SlideGuitarist: Exporting Word Documents to HTML

I mean real HTML. Here's what I get if I save this blog entry (which I'm editing in Word 2007) as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" 
        xmlns:m=http://schemas.microsoft.com/office/2004/12/omml
        xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=ProgId content=Word.Document>

If I save as "filtered HTML," I get something similar:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=Generator content="Microsoft Word 12 (filtered)">

In the first case, I get a namespace declaration that I don't want. What I want is legitimate XHTML (http://www.w3.org/1999/xhtml). C'mon, people, it's 2011! I don't care who's using browsers or authoring tools that can't handle it; they should upgrade. What I really, really, really want, however, is not to export with the Windows-1252 character encoding. However, there seems to be no way to prevent that if it's the default encoding on your Windows machine. If you export to plain text, you can specify an encoding, but not if you export to HTML (as far as I can tell). Weird. What I also really, really, really want is attribute values enclosed in quotation marks, as has been the standard for, oh, about 12 years. If I attempted (I did attempt it, actually) to further process this HTML as XML, any proper XML parser would choke on it.

There are parsers, however, that will read this garbage and turn it into legitimate XHTML. I prefer TagSoup for this purpose. John Cowan, the author, recommends that you not use the JAXP interface, but that's exactly what I want. I'll demonstrate below.

Let me backtrack for a minute. A good part of my job at Palantir requires me to deal with unfriendly input and output formats. A lot of these formats are XML, and usually not very nice XML (if there's a schema, it's almost certain to be misleading). So I've gotten used to dealing with character encodings, and I've gravitated toward Groovy for a lot of my work, as I can often whip something up in minutes without needing an IDE. A colleague recently asked me, quite reasonably, if it would be "easy" to convert Word documents automatically to wiki markup. "Sure!" I said. It was not easy, as it turned out.

The first step was to convert every Word document on hand to HTML. I use Groovy+Scriptom to do this:

import org.codehaus.groovy.scriptom.ActiveXObject
import org.codehaus.groovy.scriptom.Scriptom
import org.codehaus.groovy.scriptom.tlb.office.word.WdSaveFormat
import org.codehaus.groovy.scriptom.tlb.office.MsoEncoding

userHome = new File(System.properties.'user.home')
myDocuments = new File(userHome, 'My Documents')
output = new File(userHome, "Desktop/Tony's Output")

def word = new ActiveXObject('Word.Application')
Scriptom.inApartment {
    try {
        word.Visible = false // ('Visible', new Variant(false))
        word.DisplayAlerts = false // ', new Variant(false))
        def documents = word.Documents // ').toDispatch()
        myDocuments.eachFileMatch ~/.*\.docx/, { doc ->    
            println "Opening $doc"
            documents.Open doc.absolutePath
            def activeDocument = word.ActiveDocument
            assert activeDocument
            try {
                activeDocument.AcceptAllRevisions()
                def html = new File(output, doc.name - ~/\.docx$/ + '.html')
                // 7 is the magic number for Unicode text. See MSFT's docs for WdSaveFormatEnumeration. 
                // http://msdn.microsoft.com/en-us/library/bb238158%28office.12%29.aspx
                // 17 is PDF; 16 is "default," thus Office 2007. 
                // wdFormatHTML = 8
                def n = Scriptom.MISSING
                activeDocument.SaveAs html.absolutePath, WdSaveFormat.wdFormatFilteredHTML, false, n, n, n, n, n, n, n, n, MsoEncoding.msoEncodingUTF8
            } finally {
                activeDocument.Close()
            }
        } // each
    } finally {
        // This is the Office automation API call, which Scriptom resolves for you.
        word.Quit()    
        // Apparently it now works: winword.exe disappears from my Task Manager, which is what I want.
    }
} // Close apartment.

I'm not going to explain Scriptom here. Suffice it to say that the code above does roughly what VBA would do. Scriptom can use the constants defined at http://msdn.microsoft.com/en-US/library/microsoft.office.interop.word.wdsaveformat.aspx because someone was thoughtful enough to copy them to http://groovy.codehaus.org/modules/scriptom/1.6.0/scriptom-office-2K3-tlb/apidocs/org/codehaus/groovy/scriptom/tlb/office/word/WdSaveFormat.html. Unfortunately, my attempt to specify the encoding for the output file failed, and I could have left off all the arguments to SaveAs() after the first two.

Now, unfortunately, I've got a folder of bad HTML. How do I turn that into XHTML (actually, I wanted to run that through a further XSLT, to produce wiki markup)? Like this:

import org.ccil.cowan.tagsoup.Parser
import org.xml.sax.*
import javax.xml.transform.*
import javax.xml.transform.sax.SAXSource
import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource

output.eachFileMatch ~/.*\.html/, { html ->
        def transformer = TransformerFactory.newInstance().newTransformer()
        new File(html.parentFile, html.name - ~/html$/ + 'xhtml').withWriter 'UTF-8', { writer ->
            html.withReader 'Windows-1252', { reader ->
                transformer.transform(new SAXSource(new Parser(), new InputSource(reader)), new StreamResult(writer));
            }
        }
}

TransformerFactory#newInstance() simply returns the "identity transform," which is what I want: I don't want to change the structure of the XML at all.

5 comments:

Anonymous said...: Terrific work! This is the kind of info that are meant to be shared around the web.
Shame on Google for no longer positioning this post higher!
Come on over and seek advice from my site . Thanks =)

my webpage :: Ralph Lauren Factory Store; 3:27 PM
Anonymous said...: Thanks very nice blog!

Also visit my page: cheap ralph lauren polo; 1:44 AM
Anonymous said...: Thanks in support of sharing such a pleasant idea, paragraph is pleasant, thats why i have
read it completely

Feel free to surf to my website - ralph lauren outlet; 4:49 AM
Anonymous said...: certainly like your web-site but you need to test the spelling on several of
your posts. Several of them are rife with spelling problems and I to find it very troublesome to tell the reality on
the other hand I'll surely come again again.

My website - adidas jeremy scott; 11:14 PM
Anonymous said...: constantly i used to read smaller articles or reviews which also clear
their motive, and that is also happening with this article which I am reading here.

Feel free to surf to my website - cheap ralph lauren polo; 5:56 PM

SlideGuitarist

Sunday, March 27, 2011

Exporting Word Documents to HTML

5 comments:

XSLT Resources

Functional Programming

Blog Archive

Technical Resources

Tech Classics

Arts & Letters

About Me