<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>
Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:
import org.ccil.cowan.tagsoup.*
parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')
As we're reminded here, we can probably do without the namespace declarations. I quote:
- name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
- ":name" matches an element named "name" only id the element is not in a namespace
- "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')
html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')