Word HTML Cleaners Tuesday, Jul 28 2009 

Word HTML Cleaners

Here I made a summary of most used Word HTML cleaning
tools. I wish it would be helpful in choosing a proper tool for a Word HTML
cleaning work. And I don’t have a intention to praise or complain some tool, if
my opinion is not correct, please email me and I would correct it.

  1. HTML Cleaner for Word
    This is the newest one of this kind of tools, it can be downloaded from
    http://www.htmlcleaner.com/cleanW/index.htm

  2. Dreamweaver
    Dreamweaver has a function to cleanup Word HTML, but
    its speed is very slow. When I tried to clean a file of 500KB, it halted.

    See the official article of using Dreamweaver to clean Word HTML

  3. Word HTML Cleaner by Textism
    An online service. Files smaller
    than 20K is free, I tried one, it works, generate basic HTML, but styles lost.
    And most Word generated HTML file is very fat, thus subscription is needed.
  4. Word Cleaner by Zapodo
    That’s a powerful tool, it can convert
    DOC to HTML directly without MS Word. I tried, but only a piece of file was
    converted in trial version, so I don’t know the real result of it. It prices
    $99.
  5. Word to HTML by Maluke
    This tool is a convertor from Word to
    HTML,
  6. Gmail
    Send a DOC as attachment to your Gmail address, don’t open
    or download it, just view as HTML, then you get a very cleaned HTML file, but
    without styles, very like the HTML generated from Word 97. It’s an alter way
    to generate sinple pure HTML.
Solutions Word HTML Cleaner Word Cleaner Word to HTML Gmail Dream-weaver HTML Cleaner for Word
Producer Textism Zapadoo Maluke Google Macromedia (Adobe) Wonder Studio
Price C49/year $99 $47 Free - $39
Type Online Service Software Software Online Function Software Function Software
Speed Medium Medium Medium Fast Slow Fast
Big-file  
  •  
  •      
  •  
  • Multi-file   ?
  •  
  •    
  •  
  • Input file type HTML DOC DOC DOC HTML HTML
    Clean Office tag
  •  
  •  
  •  
  •  
  •  
  •  
  • Clean HTML redundancy
  •  
  •  
  •  
  •  
  •  
  •  
  • Retain appearance   ?      
  •  
  • Can generate pure HTML
  •  
  •  
  •  
  •  
  •  
  •  
  • Various options   ?    
  •  
  •  
  • Trial limitation File size < 20K a piece conversion       No saving
    • The data is only for reference.
    • Some items is not exact for the poor comparability.
    • This table will be updated when get new information instantly.

    And some other tools may be helpful for cleaning.

    • MS Office HTML Filter 2.0 This is a patch tool for Word 2000
      originally. It will remove office specific tags. Since Word XP and 2003 has a
      intrinsic function to generate filtered web page, most users don’t need it
      anymore, except those who still use Word 2000. it just clean unfiltered HTML
      to filtered, its a tool for Word 2000, which can’t generate filter web page.
      From Word 2002, this function had been ebedded.
  • Word 2007 Word 2007 has a new feature, "publish to blog", some says
    it can generate cleaned HTML, I am not sure, I didn’t have Word 2007, I asked
    my friend to try, he said it will be published directly to some MS blog. I
    don’t know how can it be saved as HTML file.
  • HTML Tidy
    Actually, I don’t know why it’s always picked up by
    some one, even in this field, seems it can do everything in HTML cleaning.
    After trying, I can’t find what it can do to Word HTML cleaning.
  • Word HTML Cleaner by wordcleaner.co.uk

    A free online cleaning website. Generates pure HTML.

  • Word HTML Clean-up  by Bersoft
    A small tool to clean .
  • word2cleanhtml by Oliver Cope
    A free online converter.
  • Microsoft Word 2000 HTML Mess Cleaner by Morten Nilsson
    A free
    online service by ASP and VBScript, source code can be bought.
  • About Word HTML Wednesday, May 6 2009 

    About MS Word Generated HTML

    Word and HTML

    Because of the versions of Word, the way and result varies. Here’s a table of save as web page ability of all Words from 95 to 2007

    Word Version 95 97 2000 2002 (XP) 2003 2007
    Save as – HTML
  •  
  • Save as Web Page
  •  
  •  
  •  
  •  
  • Save as Web Page(filtered)
  •  
  •  
  •  
  • Publish to blog
  •  
  • From Word 97, Microsoft make the conversion from DOC to HTML easily just using “Save as web page” in the menu for MS Word.
    But it’s result HTML is not so pleasant, or it could be called “nasty” by some experts.

    Word 2000 make this worse, the new HTML with styles and Office specific tags make the file more fat, complex. Though it
    remians the editing ability from Word, it’s not most users want. Facing the critism, Microsoft had to release a patching tool -
    Office 2000 HTML Filter, which can strip Office specific tags out.

    In Word 2002 (XP), Microsoft embedded this funtion inside Word, as a new item in menu – “Save as Web Page (filtered)”. From then
    on, these two choices of saving HTML was consolid without any change.

    In Word 2007, it only add a new function of “Publish to Blog”, which simplifies the procession from DOC to internet fro those who
    want use Word as their Blog editor.

    Word HTML types

    Altogehter there’re three types of HTML file MS Word generates. (This tool can clean most Word (except Word 97) generated HTML files.)

    Filtered HTML (by Word XP, 2003, 2007)

    The general type this tool will work for.

    Unfiltered HTML (by Word 2000, XP, 2003, 2007)

    This tool can deal with them, but we recommend you use the filter function in MS Word if possible.

    Word 97 generated HTML

    Unfortunately, Word 97 generates HTML file in an entire different way, this version of HTML Cleaner for Word still can’t do it, we will add the function of cleaning of Word 97 HTML in later versions.

    Nasty Word HTML

    Unfortunately, the quality of HTML Word genrated never improve. Exactly, the basic style of Word generating HTML has never changed since Word 2000. Fat, redandunt, complex, even foolish.

    Here’s a comment of Word genrated HTML from Jeff Atwood, the founder of Coding Horror,

    Word offers two HTML options in its save dialog: “Save as HTML” and “Save as Filtered HTML”. In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML.

    Actually, the HTML generated by MS Word 97 is rather concised, it’s just HTML, though it’s redundant HTML. But from Word 2000, Word began to generated HTML files with stylesheet, which always contain
    useless Office specific tags and more redundant HTML codes. Though Office HTML filter, and “Save as web page (filtered)” can filter Office specific tags, but the redundancy, especially endless repeated styles still remians.

    Let’s see some example, there’s always inline styles in every tag, no matter it repeated how many times, no matter it useful or not. there’s the same value of height and width of every table cell, and these values are repeated in styles and HTML codes.

    To professional HTML coders, this style of coding is unacceptable, altogether garbage. It no only makes large file size and wastes lots of space, but make the HTML file slow to view and most important, slow to transfer thourgh internet. Maybe you can’t see the rubbish inside, but they always occupy your space, maybe you doesn’t feel the slow speed for your high speed network, but your network and CPU always spend half of their time to transfer and analysis those rubbish.

    So, why not root those junk out?

    We need tools to clean them. Fortunately, there are.

    HTML Cleaner for Word Released Wednesday, May 6 2009 

    Based on the technology of HTML Cleaning from HTML Page Cleaner, I released a new tool for special HTML Cleaning – The HTML Cleaner for Word. This tool is designed for those who suffered for MS Word generated HTML, and it exceeds most similar tools available now: Exact appearance remaining, great file size decreasement, various cleaning schemes and options, and all features from HTML Page Cleaner, such as safe cleaning,  soure code monitoring, comparison, all these features make it the best Word HTML cleaning tool. I wish it could solve the long existed conflict between MS Word and clean HTML.
    Now It can be downloaded at http://www.htmlcleaner.com/cleanW/index.htm

    HTML Page Cleaner 2.0 Released Thursday, Apr 2 2009 

    Today, I recieved a letter from upload.com ( the corresponding site of download.com), informed me that my new version of HTML Page Cleaner, 2.0 had been approved for listing.

    I submitted the new version a week ago, by then the HTML Page Cleaner 2.0 had been finished.  it’s could be called as a develpoed software, with easy munipulationg UI, and stable efficient functions.  Now it’s released to public.

    It could be downloaded from my website here.

    Hello world! Thursday, Apr 2 2009 

    Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!