SlideGuitarist: Extracting the Text from a Word Document w/ Groovy

Saturday, May 02, 2009

Extracting the Text from a Word Document w/ Groovy

As I'm doing a lot of data munging these days, and often have to talk to Java APIs. Since I'm not writing production code, don't have lots of RAM or lots of good tools at my disposal and therefore would just as soon use a text editor (SciTE's my current preference...no, I can't get the Win32 port of emacs!), Groovy is definitely my best choice. Notwithstanding my preference for XSLT (over GPath) to handle XML, I can't deny that you can do some slick stuff w/ GPath. At another munger's request, I cooked this up in 5 minutes, and was almost shocked at how easy it was to get the text out of an Office 2007 docx file:


import java.util.zip.*

docx = new File('C:\\Documents and Settings\\tnassar\\My Documents\\Efficient.docx')
zip = new ZipFile(docx)
entry = zip.getEntry('word/document.xml')
stream = zip.getInputStream(entry)

// The namespace was gleaned from the decompressed XML.
wordMl = new XmlSlurper().parse(stream).declareNamespace(w: 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')

// The outermost XML element node is assigned to the variable wordMl, so
// GPath expressions will start after that. To print out the concatenated
// descendant text nodes of w:body, you use:

text = wordMl.'w:body'.children().collect { it.text() }.join('')

println text

It would be nice if Groovy offered "raw strings" like Python--r'C:\Documents and Settings\...'--or C#, which lets you prepend a '@' to have backslashes treated literally--esp. when it comes to Windows pathnames, but whatever.

This will not work well for complex document formats (I can imagine that tables and such would be a disaster), but for me it was just enough.