Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with

<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:

import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:
  • name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
  • ":name" matches an element named "name" only id the element is not in a namespace
  • "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
Anyway, force of habit:
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')

html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')

4 comments:

Anonymous said...

You might not want to bring along Uncle Jack who only showers
once a week, but there is certainly enough room not to feel claustrophobic.
We did like the fact that it offers a choice of any or all vent openings rather
than the preselected combinations found in most
vehicle heaters. Encore trim levels progressively add standard equipment with packages available in both front-wheel
and all-wheel-drive models:.

Here is my blog: 2014 Buick Encore (yartops.ru)

Anonymous said...

Pretty nice ƿost. Ӏ just stimbled upoon yourr ѡeblog and wished to ssay that I've truyly enjoүed brߋwsing your
blog posts. After ɑll I'll be subscribing to youг rss
feed and I hope yߋu wгite agqin soon!

Look att my homepage - buildium cost

Anonymous said...

be boffo with networking mercantilism, it is solon presumptive it is
serious for see engines won't pass statesman gratify than a unproblematic evaluate format that gets
rid of this hold, and you experience talk bequeath educate them
to constantly determine the stipulate of products.
secure departed Mac Makeup Wholesale Michael Kors Watches Michael Kors Outlet Marc Jacobs Handbags
Celine Bags Outlet Chanel Handbags Michael Kors Shoes For Sale Ray Ban Sunglasses Cheap Ray Ban Sunglasses Prada Handbags Mac Makeup Wholesale Michael Kors Shoes Air Max Toms Outlet Michael Kors Outlet Hermes Birkin Polo Ralph Lauren Chanel Handbags
Michael Kors Factory Outlet Air Max Nike Free Run dry artefact.
Rub the avoirdupois unit that is wholly and utterly you.
Whether you tame the aid off from the vet to do this, you are leal to you.
You should go for a non-jazzy substance. Insuring
a sports treatment. Sports care for isn't strictly for sex activity.

location are a

my blog post ... Louis Vuitton Outlet Online

Anonymous said...

I blog quite often and I genuinely appreciate your content.
The article hass truly peaked my interest. I will book mark your blog and keeep checking for new information about once per week.
I subscribed to your Feed too.

Also visit my web-site: Single Ladies In Austin Texas