I found a good guide to cleaning out the gunk that’s in Word’s HTML documents. For the smallest most efficient files it seems to conclude that the Textism Wordcleaner — free for files under 20Kb; for bigger files subscription options are available. This issue has been causing me some angst for some time, and one of these days I’m going to bash out a tool for this myself. (Don’t hold your breath.)
Once upon a time, demoroniser was a popular choice, though it may be getting on these days since it’s not maintained. Another option that looks good is HTML Tidy, but although there is a windows port, it requires the use of the command line.
Probably not worth you bashing out your own tool. The next version of Word in Office 12 will save documents
in XML which should easily transform to HTML.
Thanks Jeremy… though installing Perl for Demoroniser is beyond the average user.
Good point Malcolm. If I do have a go at it, it would probably involve learning XSLT, which to my mind might be a quick and easy way of filtering out the gunge.