Friday, August 21, 2009

Converting Oracle TIMESTAMP WITH TIME ZONE to Current TZ

I see answers to this question all over the Web, usually involving some arithmetic with abbreviations such as "EDT", the use of hacky Oracle extension functions such as NEW_TIME, etc. Forget about that. Here's how you do it: use CAST(). If this seems pedantic, it's not. I recently had to deal with an Oracle database in which times were stored as epoch times, i.e. milliseconds since January 1, 1970 UTC (e.g. new java.util.Date().getTime()). So you can get the beginning of the epoch in Oracle SQL like so:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' FROM DUAL;

That'll get you a timestamp in terms of Greenwich Mean Time (GMT or UTC):

SELECT TO_CHAR(TIMESTAMP '1970-01-01 00:00:00 +00:00', 'TZR') FROM DUAL;

If I've got "epoch time" in ms, I can easily convert it to a timestamp relative to UTC:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' + NUMTODSINTERVAL(epoch_time / 1000, 'SECOND') FROM my_table;

How do I then display, and especially export these values relative to my current time zone? Casting to TIMESTAMP WITH LOCAL TIME ZONE doesn't quite work; by default, the time zone isn't displayed for this column, and the 'TZR' format doesn't work on it (I get an Oracle error, which confused me. So CAST() them to TIMESTAMP WITH LOCAL TIME ZONE, then CAST() to TIMESTAMP WITH TIME ZONE. Oracle does all the work, and I don't have to hard-code any information in the query about my time zone, or rely on EXTRACT(). So:

SELECT CAST(CAST(TIMESTAMP '1970-01-01 00:00:00 +00:00' AS TIMESTAMP WITH LOCAL TIME ZONE) AS TIMESTAMP WITH TIME ZONE) FROM DUAL;





Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
version="1.0">

<xsl:output method="text" encoding="UTF-8" />

<xsl:template match="*">
<!-- Simply recurse over the children. -->
<xsl:apply-templates />
</xsl:template>

<!-- Any piece of text from the .docx is enclosed in a w:t element. -->
<xsl:template match="w:t">
<xsl:value-of select="."/>
<!-- Look for the xml:space attribute. A namespace is not necessary. -->
<!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
<xsl:if test="@space = 'preserve'">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="w:p">
<xsl:apply-templates />
<xsl:text>&#xa;</xsl:text>
</xsl:template>

</xsl:stylesheet>

To get the text, here's some Groovy:



def createDocumentXml(f) {
def zip = new ZipFile(f)
def entry = zip.getEntry('word/document.xml')
assert entry
def xml = new File(f.absolutePath - '.docx' + '.xml')
// I want to copy the XML to a file, for purposes of visual inspection.
// In production code I'd simply return the stream.
zip.getInputStream(entry).withStream { i ->
xml.withOutputStream { o ->
o << i
}
}
return xml
}

To apply the XSLT to the document, see the attachments.

Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with

<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:

import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:
  • name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
  • ":name" matches an element named "name" only id the element is not in a namespace
  • "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
Anyway, force of habit:
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')

html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')

Extracting the Text from a Word Document w/ Groovy

As I'm doing a lot of data munging these days, and often have to talk to Java APIs. Since I'm not writing production code, don't have lots of RAM or lots of good tools at my disposal and therefore would just as soon use a text editor (SciTE's my current preference...no, I can't get the Win32 port of emacs!), Groovy is definitely my best choice. Notwithstanding my preference for XSLT (over GPath) to handle XML, I can't deny that you can do some slick stuff w/ GPath. At another munger's request, I cooked this up in 5 minutes, and was almost shocked at how easy it was to get the text out of an Office 2007 docx file:

import java.util.zip.*

docx = new File('C:\\Documents and Settings\\tnassar\\My Documents\\Efficient.docx')
zip = new ZipFile(docx)
entry = zip.getEntry('word/document.xml')
stream = zip.getInputStream(entry)

// The namespace was gleaned from the decompressed XML.
wordMl = new XmlSlurper().parse(stream).declareNamespace(w: 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')

// The outermost XML element node is assigned to the variable wordMl, so
// GPath expressions will start after that. To print out the concatenated
// descendant text nodes of w:body, you use:

text = wordMl.'w:body'.children().collect { it.text() }.join('')

println text
It would be nice if Groovy offered "raw strings" like Python--r'C:\Documents and Settings\...'--or C#, which lets you prepend a '@' to have backslashes treated literally--esp. when it comes to Windows pathnames, but whatever.

This will not work well for complex document formats (I can imagine that tables and such would be a disaster), but for me it was just enough.