Well, this turns out to be even easier. If you've got the text:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
version="1.0">
<xsl:output method="text" encoding="UTF-8" />
<xsl:template match="*">
<!-- Simply recurse over the children. -->
<xsl:apply-templates />
</xsl:template>
<!-- Any piece of text from the .docx is enclosed in a w:t element. -->
<xsl:template match="w:t">
<xsl:value-of select="."/>
<!-- Look for the xml:space attribute. A namespace is not necessary. -->
<!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
<xsl:if test="@space = 'preserve'">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="w:p">
<xsl:apply-templates />
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
To get the text, here's some Groovy:
def createDocumentXml(f) {
def zip = new ZipFile(f)
def entry = zip.getEntry('word/document.xml')
assert entry
def xml = new File(f.absolutePath - '.docx' + '.xml')
// I want to copy the XML to a file, for purposes of visual inspection.
// In production code I'd simply return the stream.
zip.getInputStream(entry).withStream { i ->
xml.withOutputStream { o ->
o << i
}
}
return xml
}
To apply the XSLT to the document, see the attachments.