Batch Word to HTML -- ConvertWordToHTML [Update: Word Converter Tool]

I recently had a requirement to batch convert Word files to HTML.

For a small number of Word files, you can simply use Word’s built-in “Save As” feature. But when dealing with a large number of Word files, it becomes quite complicated.

After searching online, I found solutions in PHP, Python, Ruby, and C#. Among them, I found a tool called “Xunjiie Converter”, but it didn’t quite fit my needs, so I decided to write my own. Since Word is a Microsoft product, I figured C# might be the best choice for this task.

I open-sourced a GUI-based solution on GitHub: https://github.com/hujiulin/ConvertWordToHTML [Currently single-threaded; will be converted to multi-threaded later].

Screenshots of the running application:

  1. Initial program interface:

    QQ截图20150111182652

  2. “Open” to select an input folder containing Word documents:

    QQ截图20150111182756

  3. “SaveAs” to select an output folder:

    QQ截图20150111182813

  4. Program finished running:

    QQ截图20150111182832

  5. Input and output results:

    QQ截图20150111182849 QQ截图20150111182858

Program notes:

  1. Dependencies: Windows OS, .NET Framework 3.5, Office Word

  2. Word’s “Save As HTML” offers several format options: single web page (mht), web page (htm), and filtered web page (htm). I chose the filtered HTML option, which converts all formulas to gif or jpg images. A properly filtered htm file won’t contain Microsoft’s messy formatting information.

GitHub: https://github.com/hujiulin/ConvertWordToHTML

Download: http://devhu-github.stor.sinaapp.com/ConvertWordToHTML.rar


2015-1-24 Update:

  • Rename solution and project to WordConverter; Add feature: convert word to PDF; ADD feature switch specified ext;

The Word Converter tool now supports both HTML and PDF formats.

Updated GitHub link: https://github.com/hujiulin/WordConverter

Download: http://devhu-github.stor.sinaapp.com/WordConverter.rar