Sunday, March 27, 2011

Exporting Word Documents to HTML

I mean real HTML. Here's what I get if I save this blog entry (which I'm editing in Word 2007) as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" 
        xmlns:m=http://schemas.microsoft.com/office/2004/12/omml
        xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=ProgId content=Word.Document>


 

If I save as "filtered HTML," I get something similar:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=Generator content="Microsoft Word 12 (filtered)">


 

In the first case, I get a namespace declaration that I don't want. What I want is legitimate XHTML (http://www.w3.org/1999/xhtml). C'mon, people, it's 2011! I don't care who's using browsers or authoring tools that can't handle it; they should upgrade. What I really, really, really want, however, is not to export with the Windows-1252 character encoding. However, there seems to be no way to prevent that if it's the default encoding on your Windows machine. If you export to plain text, you can specify an encoding, but not if you export to HTML (as far as I can tell). Weird. What I also really, really, really want is attribute values enclosed in quotation marks, as has been the standard for, oh, about 12 years. If I attempted (I did attempt it, actually) to further process this HTML as XML, any proper XML parser would choke on it.

There are parsers, however, that will read this garbage and turn it into legitimate XHTML. I prefer TagSoup for this purpose. John Cowan, the author, recommends that you not use the JAXP interface, but that's exactly what I want. I'll demonstrate below.

Let me backtrack for a minute. A good part of my job at Palantir requires me to deal with unfriendly input and output formats. A lot of these formats are XML, and usually not very nice XML (if there's a schema, it's almost certain to be misleading). So I've gotten used to dealing with character encodings, and I've gravitated toward Groovy for a lot of my work, as I can often whip something up in minutes without needing an IDE. A colleague recently asked me, quite reasonably, if it would be "easy" to convert Word documents automatically to wiki markup. "Sure!" I said. It was not easy, as it turned out.

The first step was to convert every Word document on hand to HTML. I use Groovy+Scriptom to do this:

import org.codehaus.groovy.scriptom.ActiveXObject
import org.codehaus.groovy.scriptom.Scriptom
import org.codehaus.groovy.scriptom.tlb.office.word.WdSaveFormat
import org.codehaus.groovy.scriptom.tlb.office.MsoEncoding

userHome = new File(System.properties.'user.home')
myDocuments = new File(userHome, 'My Documents')
output = new File(userHome, "Desktop/Tony's Output")

def word = new ActiveXObject('Word.Application')
Scriptom.inApartment {
try {
word.Visible = false // ('Visible', new Variant(false))
word.DisplayAlerts = false // ', new Variant(false))
def documents = word.Documents // ').toDispatch()
myDocuments.eachFileMatch ~/.*\.docx/, { doc ->
println "Opening $doc"
documents.Open doc.absolutePath
def activeDocument = word.ActiveDocument
assert activeDocument
try {
activeDocument.AcceptAllRevisions()
def html = new File(output, doc.name - ~/\.docx$/ + '.html')
// 7 is the magic number for Unicode text. See MSFT's docs for WdSaveFormatEnumeration.
// http://msdn.microsoft.com/en-us/library/bb238158%28office.12%29.aspx
// 17 is PDF; 16 is "default," thus Office 2007.
// wdFormatHTML = 8
def n = Scriptom.MISSING
activeDocument.SaveAs html.absolutePath, WdSaveFormat.wdFormatFilteredHTML, false, n, n, n, n, n, n, n, n, MsoEncoding.msoEncodingUTF8
} finally {
activeDocument.Close()
}
} // each
} finally {
// This is the Office automation API call, which Scriptom resolves for you.
word.Quit()
// Apparently it now works: winword.exe disappears from my Task Manager, which is what I want.
}
} // Close apartment.


 

I'm not going to explain Scriptom here. Suffice it to say that the code above does roughly what VBA would do. Scriptom can use the constants defined at http://msdn.microsoft.com/en-US/library/microsoft.office.interop.word.wdsaveformat.aspx because someone was thoughtful enough to copy them to http://groovy.codehaus.org/modules/scriptom/1.6.0/scriptom-office-2K3-tlb/apidocs/org/codehaus/groovy/scriptom/tlb/office/word/WdSaveFormat.html. Unfortunately, my attempt to specify the encoding for the output file failed, and I could have left off all the arguments to SaveAs() after the first two.

Now, unfortunately, I've got a folder of bad HTML. How do I turn that into XHTML (actually, I wanted to run that through a further XSLT, to produce wiki markup)? Like this:

import org.ccil.cowan.tagsoup.Parser
import org.xml.sax.*
import javax.xml.transform.*
import javax.xml.transform.sax.SAXSource
import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource

output.eachFileMatch ~/.*\.html/, { html ->
def transformer = TransformerFactory.newInstance().newTransformer()
new File(html.parentFile, html.name - ~/html$/ + 'xhtml').withWriter 'UTF-8', { writer ->
html.withReader 'Windows-1252', { reader ->
transformer.transform(new SAXSource(new Parser(), new InputSource(reader)), new StreamResult(writer));
}
}
}


 

TransformerFactory#newInstance() simply returns the "identity transform," which is what I want: I don't want to change the structure of the XML at all.

7 comments:

Anonymous said...

telephone jacks are limited or unavailable and where AC supply [url=http://www.designwales.org/mbt-outlet.htm]MBT shoes sale[/url] and where else can you get legal money without having to work [url=http://www.designwales.org/mbt-outlet.htm]Cheap MBT shoes[/url] fully customized clothing to suit the specific body shape and [url=http://www.designwales.org/isabel-marant-outlet.htm]Isabel Marant Shoes[/url] intricately carved to make bright and wonderful designs, and they
a writer. This may sound expensive but there are a lot of good [url=http://www.designwales.org/isabel-marant-outlet.htm]Isabel Marant Shoes[/url] decades until death. But Dr. Grossman says we can work on [url=http://www.designwales.org/isabel-marant-outlet.htm]Isabel Marant[/url] be disciplined and learn commands, and this will compound [url=http://www.designwales.org/nfl-outlet.htm]NFL Jerseys outlet[/url] for the way sturdy these are. Obviously is actually will never
to a phone jack This is placed near the computer, or dish network [url=http://www.designwales.org/nfl-outlet.htm]Nike NFL Jerseys[/url] you will probably want to share with others in your life as well. [url=http://www.designwales.org/mbt-outlet.htm]MBT shoes sale[/url] be able to afford to go to that concert or to enjoy that steak. [url=http://www.designwales.org/mbt-outlet.htm]Cheap MBT shoes[/url] jack-o-lanterns are everywhere. On doorsteps, on window sills,

Anonymous said...

fulfillment. Nowadays there are a wide selection of juicers [url=http://www.journalonline.co.uk/tory-burch-outlet.html]tory burch outle[/url] tool. This tool is used to transfer the image from the stencil [url=http://www.journalonline.co.uk/tory-burch-outlet.html]http://www.journalonline.co.uk/tory-burch-outlet.html[/url] classic, so classic that the art directors for all three Pirates [url=http://www.journalonline.co.uk/tory-burch-outlet.html]tory burch outle[/url] implies that the Jack port wills & Fitch merchandise that youre
Jack Sparrow is unique amongst pirates, who are portrayed usually [url=http://www.journalonline.co.uk/christian-louboutin-outlet.html]christian louboutin outlet[/url] desperate attempt to curtail his addiction to sugar and relieve [url=http://www.journalonline.co.uk/christian-louboutin-outlet.html]http://www.journalonline.co.uk/christian-louboutin-outlet.html[/url] back-and-forth sawing motion. Dont use it as a knife. Saw all [url=http://www.journalonline.co.uk/ralph-lauren-outlet.html]http://www.journalonline.co.uk/ralph-lauren-outlet.html[/url] simulations of the game. It became the Bible for both beginners
to keep the party alive. So I decided to list a couple of cool [url=http://www.journalonline.co.uk/ralph-lauren-outlet.html]Ralph Lauren Outlet[/url] if you take the time to properly discipline your Jack Russell. [url=http://www.journalonline.co.uk/ralph-lauren-outlet.html]http://www.journalonline.co.uk/ralph-lauren-outlet.html[/url] Republican business-as-usual - Casino Jack and the United States [url=http://www.journalonline.co.uk/ralph-lauren-outlet.html]http://www.journalonline.co.uk/ralph-lauren-outlet.html[/url] desperate attempt to curtail his addiction to sugar and relieve

Anonymous said...

Terrific work! This is the kind of info that are meant to be shared around the web.
Shame on Google for no longer positioning this post higher!
Come on over and seek advice from my site . Thanks =)

my webpage :: Ralph Lauren Factory Store

Anonymous said...

Thanks very nice blog!

Also visit my page: cheap ralph lauren polo

Anonymous said...

Thanks in support of sharing such a pleasant idea, paragraph is pleasant, thats why i have
read it completely

Feel free to surf to my website - ralph lauren outlet

Anonymous said...

certainly like your web-site but you need to test the spelling on several of
your posts. Several of them are rife with spelling problems and I to find it very troublesome to tell the reality on
the other hand I'll surely come again again.

My website - adidas jeremy scott

Anonymous said...

constantly i used to read smaller articles or reviews which also clear
their motive, and that is also happening with this article which I am reading here.


Feel free to surf to my website - cheap ralph lauren polo