Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=""

<xsl:output method="text" encoding="UTF-8" />

<xsl:template match="*">
<!-- Simply recurse over the children. -->
<xsl:apply-templates />

<!-- Any piece of text from the .docx is enclosed in a w:t element. -->
<xsl:template match="w:t">
<xsl:value-of select="."/>
<!-- Look for the xml:space attribute. A namespace is not necessary. -->
<!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
<xsl:if test="@space = 'preserve'">
<xsl:text> </xsl:text>

<xsl:template match="w:p">
<xsl:apply-templates />


To get the text, here's some Groovy:

def createDocumentXml(f) {
def zip = new ZipFile(f)
def entry = zip.getEntry('word/document.xml')
assert entry
def xml = new File(f.absolutePath - '.docx' + '.xml')
// I want to copy the XML to a file, for purposes of visual inspection.
// In production code I'd simply return the stream.
zip.getInputStream(entry).withStream { i ->
xml.withOutputStream { o ->
o << i
return xml

To apply the XSLT to the document, see the attachments.