Word Unmunger

The Word Unmunger is a small Python program which removes much of the HTML cruft produced by Microsoft Word 2002 (Word version 10), making them much easier to hand-edit. It removes:

New! The Word Unmunger now features a batch mode. You can process several files at once and have them dropped into an output directory with their original filename. It works like this: word-unmunger.py --output-dir=myDirectory file1.htm file2.htm file3.htm

It is not recommended that you run this software on HTML files you've created yourself because it will probably remove a great deal of formating from the files. But it's just the ticket when you want to clean up Word's output for hand-editing.

Sample Output: Output straight from Word 2002 versus Output from the Unmunger.

Download the Word Unmunger (You'll need Python 2.2 or better). The Unmunger is under the permissive MIT License.

Freshmeat page for the Word Unmunger


Why did you write the Word Unmunger?
I store my resume in Microsoft Word format because so many employers request them in that format and it is easy to edit (decent layout for printing, spell checking, etc). I then export it to HTML and ASCII for my website. I got tired of stripping all of Word's formatting by hand and I wanted to get a chance to write a simple Python program, and thus the Unmunger was born.

Why would I want to use the Word Unmunger? The HTML layout looks like crap compared to Word!
It isn't as pretty in the browser, but do a View Source on each of my sample pages and you'll see that the unmunged version is much easier to edit. If you only want to look at the web page, you don't need the Word Unmunger.

How is the Word Unmunger different from the Demoroniser?
They are totally different. The Demoroniser corrects the HTML mistakes of earlier versions of Microsoft Office and removes Microsoft's "smart quotes".

But the HTML produced by Word 10 is technically pretty good. I bet it even validates. But it's a pain in the ass to hand-edit. The focus of the Unmunger is to remove formatting and Office-specific data, not correct Microsoft's mistakes. I may extend the Unmunger's capabilities in the future. Patches are welcome.

How well has the Word Unmunger been tested?
Not very well at all. It works on my resume and that's all I need it for. USE THE SOFTWARE AT YOUR OWN RISK. It will not overwrite your file unless you explicitly tell it to, but be careful. Bug reports and patches are greatly appreciated.

Who wrote the Unmunger?
Luke Francl, a programmer from Minneapolis. You can email me at look@recursion.org. I also have a weblog.

Luke Francl (look@recursion.org)