Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with

<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:

import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:
  • name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
  • ":name" matches an element named "name" only id the element is not in a namespace
  • "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
Anyway, force of habit:
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: '')

html.'xhtml:body'.depthFirst().findAll { != 'script' }*.text().join('\n')

Extracting the Text from a Word Document w/ Groovy

As I'm doing a lot of data munging these days, and often have to talk to Java APIs. Since I'm not writing production code, don't have lots of RAM or lots of good tools at my disposal and therefore would just as soon use a text editor (SciTE's my current, I can't get the Win32 port of emacs!), Groovy is definitely my best choice. Notwithstanding my preference for XSLT (over GPath) to handle XML, I can't deny that you can do some slick stuff w/ GPath. At another munger's request, I cooked this up in 5 minutes, and was almost shocked at how easy it was to get the text out of an Office 2007 docx file:


docx = new File('C:\\Documents and Settings\\tnassar\\My Documents\\Efficient.docx')
zip = new ZipFile(docx)
entry = zip.getEntry('word/document.xml')
stream = zip.getInputStream(entry)

// The namespace was gleaned from the decompressed XML.
wordMl = new XmlSlurper().parse(stream).declareNamespace(w: '')

// The outermost XML element node is assigned to the variable wordMl, so
// GPath expressions will start after that. To print out the concatenated
// descendant text nodes of w:body, you use:

text = wordMl.'w:body'.children().collect { it.text() }.join('')

println text
It would be nice if Groovy offered "raw strings" like Python--r'C:\Documents and Settings\...'--or C#, which lets you prepend a '@' to have backslashes treated literally--esp. when it comes to Windows pathnames, but whatever.

This will not work well for complex document formats (I can imagine that tables and such would be a disaster), but for me it was just enough.