SlideGuitarist: July 2009

Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
       version="1.0">

  <xsl:output method="text" encoding="UTF-8" />

  <xsl:template match="*">
    <!-- Simply recurse over the children. -->
    <xsl:apply-templates />
  </xsl:template>
 
  <!-- Any piece of text from the .docx is enclosed in a w:t element. -->
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
    <!-- Look for the xml:space attribute. A namespace is not necessary. -->
    <!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
    <xsl:if test="@space = 'preserve'">    
   <xsl:text> </xsl:text>
    </xsl:if>
  </xsl:template>
  
  <xsl:template match="w:p">
    <xsl:apply-templates />
    <xsl:text>&#xa;</xsl:text>
  </xsl:template>
  
 </xsl:stylesheet>

To get the text, here's some Groovy:


def createDocumentXml(f) {
    def zip = new ZipFile(f)
    def entry = zip.getEntry('word/document.xml')
    assert entry
    def xml =  new File(f.absolutePath - '.docx' + '.xml')
    // I want to copy the XML to a file, for purposes of visual inspection.
    // In production code I'd simply return the stream.
    zip.getInputStream(entry).withStream { i ->
        xml.withOutputStream { o ->
            o << i
        }
    }
    return xml
}

To apply the XSLT to the document, see the attachments.

SlideGuitarist

Wednesday, July 22, 2009

An XSLT to Extract Text from Word

XSLT Resources

Functional Programming

Blog Archive

Technical Resources

Tech Classics

Arts & Letters

About Me