Sunday, March 27, 2011

Exporting Word Documents to HTML

I mean real HTML. Here's what I get if I save this blog entry (which I'm editing in Word 2007) as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" 
        xmlns:m=http://schemas.microsoft.com/office/2004/12/omml
        xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=ProgId content=Word.Document>


 

If I save as "filtered HTML," I get something similar:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=Generator content="Microsoft Word 12 (filtered)">


 

In the first case, I get a namespace declaration that I don't want. What I want is legitimate XHTML (http://www.w3.org/1999/xhtml). C'mon, people, it's 2011! I don't care who's using browsers or authoring tools that can't handle it; they should upgrade. What I really, really, really want, however, is not to export with the Windows-1252 character encoding. However, there seems to be no way to prevent that if it's the default encoding on your Windows machine. If you export to plain text, you can specify an encoding, but not if you export to HTML (as far as I can tell). Weird. What I also really, really, really want is attribute values enclosed in quotation marks, as has been the standard for, oh, about 12 years. If I attempted (I did attempt it, actually) to further process this HTML as XML, any proper XML parser would choke on it.

There are parsers, however, that will read this garbage and turn it into legitimate XHTML. I prefer TagSoup for this purpose. John Cowan, the author, recommends that you not use the JAXP interface, but that's exactly what I want. I'll demonstrate below.

Let me backtrack for a minute. A good part of my job at Palantir requires me to deal with unfriendly input and output formats. A lot of these formats are XML, and usually not very nice XML (if there's a schema, it's almost certain to be misleading). So I've gotten used to dealing with character encodings, and I've gravitated toward Groovy for a lot of my work, as I can often whip something up in minutes without needing an IDE. A colleague recently asked me, quite reasonably, if it would be "easy" to convert Word documents automatically to wiki markup. "Sure!" I said. It was not easy, as it turned out.

The first step was to convert every Word document on hand to HTML. I use Groovy+Scriptom to do this:

import org.codehaus.groovy.scriptom.ActiveXObject
import org.codehaus.groovy.scriptom.Scriptom
import org.codehaus.groovy.scriptom.tlb.office.word.WdSaveFormat
import org.codehaus.groovy.scriptom.tlb.office.MsoEncoding

userHome = new File(System.properties.'user.home')
myDocuments = new File(userHome, 'My Documents')
output = new File(userHome, "Desktop/Tony's Output")

def word = new ActiveXObject('Word.Application')
Scriptom.inApartment {
try {
word.Visible = false // ('Visible', new Variant(false))
word.DisplayAlerts = false // ', new Variant(false))
def documents = word.Documents // ').toDispatch()
myDocuments.eachFileMatch ~/.*\.docx/, { doc ->
println "Opening $doc"
documents.Open doc.absolutePath
def activeDocument = word.ActiveDocument
assert activeDocument
try {
activeDocument.AcceptAllRevisions()
def html = new File(output, doc.name - ~/\.docx$/ + '.html')
// 7 is the magic number for Unicode text. See MSFT's docs for WdSaveFormatEnumeration.
// http://msdn.microsoft.com/en-us/library/bb238158%28office.12%29.aspx
// 17 is PDF; 16 is "default," thus Office 2007.
// wdFormatHTML = 8
def n = Scriptom.MISSING
activeDocument.SaveAs html.absolutePath, WdSaveFormat.wdFormatFilteredHTML, false, n, n, n, n, n, n, n, n, MsoEncoding.msoEncodingUTF8
} finally {
activeDocument.Close()
}
} // each
} finally {
// This is the Office automation API call, which Scriptom resolves for you.
word.Quit()
// Apparently it now works: winword.exe disappears from my Task Manager, which is what I want.
}
} // Close apartment.


 

I'm not going to explain Scriptom here. Suffice it to say that the code above does roughly what VBA would do. Scriptom can use the constants defined at http://msdn.microsoft.com/en-US/library/microsoft.office.interop.word.wdsaveformat.aspx because someone was thoughtful enough to copy them to http://groovy.codehaus.org/modules/scriptom/1.6.0/scriptom-office-2K3-tlb/apidocs/org/codehaus/groovy/scriptom/tlb/office/word/WdSaveFormat.html. Unfortunately, my attempt to specify the encoding for the output file failed, and I could have left off all the arguments to SaveAs() after the first two.

Now, unfortunately, I've got a folder of bad HTML. How do I turn that into XHTML (actually, I wanted to run that through a further XSLT, to produce wiki markup)? Like this:

import org.ccil.cowan.tagsoup.Parser
import org.xml.sax.*
import javax.xml.transform.*
import javax.xml.transform.sax.SAXSource
import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource

output.eachFileMatch ~/.*\.html/, { html ->
def transformer = TransformerFactory.newInstance().newTransformer()
new File(html.parentFile, html.name - ~/html$/ + 'xhtml').withWriter 'UTF-8', { writer ->
html.withReader 'Windows-1252', { reader ->
transformer.transform(new SAXSource(new Parser(), new InputSource(reader)), new StreamResult(writer));
}
}
}


 

TransformerFactory#newInstance() simply returns the "identity transform," which is what I want: I don't want to change the structure of the XML at all.

Saturday, March 26, 2011

Interval Coalesce with XSLT

I was intrigued by Vadim Tropashko's SQL Design Patterns. I'm not a SQL guy, so like lots of developers, I think of databases as places to put data, from which to retrieve it, and which occasionally need tuning. I don't think of the execution engine as something that virtually can create additional information for me. I've written maybe two pivot queries in my life, for example. Someone who's adept with Mathematica, or Excel for that matter, can do things in a minute that would take a Java developer days, since a SQL rowset is a very impoverished data structure in Java.

Anyway, I saw an immediate use for Tropashko's "interval coalesce" algorithm. The problem is, given a collection of intervals, to coalesce those that overlap, and thus produce a small collection of non-overlapping intervals. Well, "immediate" is misleading; I waited a year to do this. Anyway, I fired up Oracle XE and made up some data. Then I entered the query on p. 37…only to find out that there's a typo in it. Several hours later, I figured out where it was:

SELECT fst.x, lst.y -- Find two endpoints.

FROM intervals fst, intervals lst WHERE fst.x < lst.y

AND NOT EXISTS ( -- There's no interval beginning between these endpoints...

  SELECT * FROM intervals i

  WHERE i.x > fst.x AND i.x < lst.y

  AND NOT EXISTS ( -- ...for which there's no covering interval.

    SELECT * FROM intervals cov 

    WHERE i.x > cov.x AND i.x <= cov.y

  )

) AND NOT EXISTS (

  SELECT * FROM intervals cov

  WHERE cov.x < fst.x AND fst.x <= cov.y

  OR cov.x <= lst.y AND lst.y < cov.y

)


 

You might notice a discrepancy from the query in the book...and I've reported the erratum!

Tropashko goes on to show a more elegant and efficient way to produce this result, but at this point I wanted to get back to my own practical problem, which was to coalesce intervals defined in an XML document. Since the query above involves three self-joins, I could expect n^4 performance if I simply turned Tropashko's SQL into XPath. However, an XML document is inherently ordered (there's no ANSI SQL equivalent to following-sibling::*[1]). If I sorted the intervals according to their left end, i.e. wrote an XSLT to preprocess the input, then I could simply iterate through the intervals one time, gradually coalescing them where possible. The code falls into a pattern familiar to functional programmers. In other words, I apply a template to the first interval in the document, and that template calls a second template that uses "accumulators" for the current coalesced interval. This pattern is also used in Jeni Tennison's (http://www.jenitennison.com/xslt/index.html) <span style="font-style: italic;">XSLT on the Edge</span> as a way of grouping adjacent elements. I wanted to avoid XPath stunts like Tropashko's SQL, because I will have to apply this transform to documents with 10s of 1000s of data points. I very deliberate select only following-sibling::*[1].

Since blogger's editor widget hacks up my XSLT in horrifying ways, and I can't attach text files, I had to format this XSTL with non-breaking spaces and whatnot. Let me know if you want the real thing. Let's say the input looks like this (it must be sorted, perhaps by another XSLT, by starting time only):

<?xml version='1.0' ?>
<intervals>

<interval x="10" y="14.1155607415243584888223696987094362306"/>

<interval x="10" y="27.2574271976039591982711142331954001295"/>

<interval x="30" y="33.7147910106524433624672477551503835313"/>

<interval x="40" y="46.844920420920822280815378491005656766"/>

<interval x="50" y="61.30421963829719394371538034317434986025"/>


Then the XSLT to coalesce these intervals will look about like this:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"

xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform"

xmlns:xs=
"http://www.w3.org/2001/XMLSchema">


<xsl:output indent="yes"/>


<!-- I could have named this template "start-recursion", but it's
convenient to be able to set the context node. --
>

<xsl:template match="interval">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="."/>

<xsl:with-param name="from" select="number(@x)" tunnel="yes"/>

<xsl:with-param name="to" select="number(@y)" tunnel="yes"/>

</xsl:call-template>

</xsl:template>


<xsl:template name="coalesce-intervals">

<xsl:param name="next-interval" as="element(interval)?"/>

<xsl:param name="from" as="xs:double" tunnel="yes"/>

<xsl:param name="to" as="xs:double" tunnel="yes"/>

<xsl:choose>

<!-- Stop the recursion. -->

<xsl:when test="not($next-interval)">

<interval x="{$from}" y="{$to}"/>

</xsl:when>

<!-- The current coalesced interval overlaps this one, so move on. -->

<xsl:when test="$to gt number($next-interval/@x)">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="$next-interval/following-sibling::interval[1]"/>

<!-- Extend the current interval if possible (hence max). -->

<xsl:with-param name="to" select="max((number($next-interval/@y), $to))" tunnel="yes"/>

</xsl:call-template>

</xsl:when>

<xsl:otherwise>

<!-- No more to coalesce. Output the "accumulator" and start again. -->

<interval x="{$from}" y="{$to}"/>

<xsl:apply-templates select="$next-interval"/>

</xsl:otherwise>

</xsl:choose>

</xsl:template>


<xsl:template match="/intervals">

<intervals>

<xsl:apply-templates select="interval[1]"/>

</intervals>

</xsl:template>
</xsl:stylesheet>