About MS Word Generated HTML

Word and HTML

Because of the versions of Word, the way and result varies. Here’s a table of save as web page ability of all Words from 95 to 2007

Word Version 95 97 2000 2002 (XP) 2003 2007
Save as – HTML
  •  
  • Save as Web Page
  •  
  •  
  •  
  •  
  • Save as Web Page(filtered)
  •  
  •  
  •  
  • Publish to blog
  •  
  • From Word 97, Microsoft make the conversion from DOC to HTML easily just using “Save as web page” in the menu for MS Word.
    But it’s result HTML is not so pleasant, or it could be called “nasty” by some experts.

    Word 2000 make this worse, the new HTML with styles and Office specific tags make the file more fat, complex. Though it
    remians the editing ability from Word, it’s not most users want. Facing the critism, Microsoft had to release a patching tool -
    Office 2000 HTML Filter, which can strip Office specific tags out.

    In Word 2002 (XP), Microsoft embedded this funtion inside Word, as a new item in menu – “Save as Web Page (filtered)”. From then
    on, these two choices of saving HTML was consolid without any change.

    In Word 2007, it only add a new function of “Publish to Blog”, which simplifies the procession from DOC to internet fro those who
    want use Word as their Blog editor.

    Word HTML types

    Altogehter there’re three types of HTML file MS Word generates. (This tool can clean most Word (except Word 97) generated HTML files.)

    Filtered HTML (by Word XP, 2003, 2007)

    The general type this tool will work for.

    Unfiltered HTML (by Word 2000, XP, 2003, 2007)

    This tool can deal with them, but we recommend you use the filter function in MS Word if possible.

    Word 97 generated HTML

    Unfortunately, Word 97 generates HTML file in an entire different way, this version of HTML Cleaner for Word still can’t do it, we will add the function of cleaning of Word 97 HTML in later versions.

    Nasty Word HTML

    Unfortunately, the quality of HTML Word genrated never improve. Exactly, the basic style of Word generating HTML has never changed since Word 2000. Fat, redandunt, complex, even foolish.

    Here’s a comment of Word genrated HTML from Jeff Atwood, the founder of Coding Horror,

    Word offers two HTML options in its save dialog: “Save as HTML” and “Save as Filtered HTML”. In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML.

    Actually, the HTML generated by MS Word 97 is rather concised, it’s just HTML, though it’s redundant HTML. But from Word 2000, Word began to generated HTML files with stylesheet, which always contain
    useless Office specific tags and more redundant HTML codes. Though Office HTML filter, and “Save as web page (filtered)” can filter Office specific tags, but the redundancy, especially endless repeated styles still remians.

    Let’s see some example, there’s always inline styles in every tag, no matter it repeated how many times, no matter it useful or not. there’s the same value of height and width of every table cell, and these values are repeated in styles and HTML codes.

    To professional HTML coders, this style of coding is unacceptable, altogether garbage. It no only makes large file size and wastes lots of space, but make the HTML file slow to view and most important, slow to transfer thourgh internet. Maybe you can’t see the rubbish inside, but they always occupy your space, maybe you doesn’t feel the slow speed for your high speed network, but your network and CPU always spend half of their time to transfer and analysis those rubbish.

    So, why not root those junk out?

    We need tools to clean them. Fortunately, there are.