SlideGuitarist: An XSLT to Extract Text from Word

Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
       version="1.0">

  <xsl:output method="text" encoding="UTF-8" />

  <xsl:template match="*">
    <!-- Simply recurse over the children. -->
    <xsl:apply-templates />
  </xsl:template>
 
  <!-- Any piece of text from the .docx is enclosed in a w:t element. -->
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
    <!-- Look for the xml:space attribute. A namespace is not necessary. -->
    <!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
    <xsl:if test="@space = 'preserve'">    
   <xsl:text> </xsl:text>
    </xsl:if>
  </xsl:template>
  
  <xsl:template match="w:p">
    <xsl:apply-templates />
    <xsl:text>&#xa;</xsl:text>
  </xsl:template>
  
 </xsl:stylesheet>

To get the text, here's some Groovy:


def createDocumentXml(f) {
    def zip = new ZipFile(f)
    def entry = zip.getEntry('word/document.xml')
    assert entry
    def xml =  new File(f.absolutePath - '.docx' + '.xml')
    // I want to copy the XML to a file, for purposes of visual inspection.
    // In production code I'd simply return the stream.
    zip.getInputStream(entry).withStream { i ->
        xml.withOutputStream { o ->
            o << i
        }
    }
    return xml
}

To apply the XSLT to the document, see the attachments.

2 comments:

Anonymous said...: As tempting as it may be to easily dried up hair by rubbing it having a cloth, will not get it done. You should enable your head of hair to dried out within a cloth on your own brain for quite a while and after that lightly blot your own hair having a bath towel until finally it is actually free of moisture. Rubbing using the towel can cause knots that can lead to damage. [url=http://www.x21w12w21.info]Sca46ujdin[/url]; 12:11 PM
Anonymous said...: q8v35f4z43 s0c24w9y57 i4y85d4p90 k4y71x7x93 e4p28f0f46 n6d96r9v63; 8:56 PM

SlideGuitarist

Wednesday, July 22, 2009

An XSLT to Extract Text from Word

2 comments:

XSLT Resources

Functional Programming

Blog Archive

Technical Resources

Tech Classics

Arts & Letters

About Me