Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with

<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:

import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:
  • name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
  • ":name" matches an element named "name" only id the element is not in a namespace
  • "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
Anyway, force of habit:
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')

html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')

6 comments:

Anonymous said...

You might not want to bring along Uncle Jack who only showers
once a week, but there is certainly enough room not to feel claustrophobic.
We did like the fact that it offers a choice of any or all vent openings rather
than the preselected combinations found in most
vehicle heaters. Encore trim levels progressively add standard equipment with packages available in both front-wheel
and all-wheel-drive models:.

Here is my blog: 2014 Buick Encore (yartops.ru)

Anonymous said...

Pretty nice ƿost. Ӏ just stimbled upoon yourr ѡeblog and wished to ssay that I've truyly enjoүed brߋwsing your
blog posts. After ɑll I'll be subscribing to youг rss
feed and I hope yߋu wгite agqin soon!

Look att my homepage - buildium cost

Anonymous said...

I blog quite often and I genuinely appreciate your content.
The article hass truly peaked my interest. I will book mark your blog and keeep checking for new information about once per week.
I subscribed to your Feed too.

Also visit my web-site: Single Ladies In Austin Texas

tashee said...

replica bags online shopping india find out x0f03x9m41 replica bags nancy replica bags chicago replica gucci bags w2a03y9h31 replica bags india check here s3q11z9n28 replica louis vuitton bag replica bags paypal r6r23v8d97

Unknown said...

replica bags in bangkok replica gucci g9d32h2w71 9a replica bags replica bags cheap check here l4j24i1t56 9a replica bags find more information u1n99o6i99 cheap designer bags replica replica bags louis vuitton c1c22b8a38

mcshytee said...

e1y63n8c71 j1c51h2l15 v9g15i9u40 d7l61j5l67 b7s97o6g50 g3e49k6w15