Saturday, October 01, 2011

XSLT to Produce Multiple Results from a Single Input

I was recently confronted with some XML structured in this way: a <person> at the top, followed by 0 or more entities to whom the person has a certain sort of relationship. These may include other persons, and the same persons may appear in several input documents. I’d like to split up all these documents, and apply additional XSLTs to them. Moreover, I can only decide on where the output XML is going to go based on information not available to the XSLT processor, i.e. I can’t simply calculate a URI in <xsl:result-document> and let the processor open the file for me. This is the first time I’ve used the Saxon processor’s setOutputURIResolver() method; in fact, I didn’t know it existed until I did some hunting in Eclipse.

Here’s what the input looks like, more or less:

<?xml version="1.0"?>
<person-with-relationships>
<person id="1">
  <name>Anthony Albert Nassar</name>
  <phone-number>800-555-1212</phone-number>
</person>
<relationships>
  <relationship>
   <employment>
    <start-date/>
   </employment>
   <organization id="2">
    <organization-name>Palantir Technologies, Inc.</organization-name>
    <url>palantir.com</url>
   </organization>
  </relationship>
  <relationship>
   <marriage>
    <start-date/>
   </marriage>
   <person id="3">
    <name>Donavan Arizmendi</name>
   </person>
  </relationship>
</relationships>
</person-with-relationships>


The XSLT looks like this:



<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xpath-default-namespace="">
<xsl:import href="identity.xslt"/>

<xsl:template match="relationship/*[2]">
  <xsl:message>Opening document for <xsl:value-of select="local-name()"/> with ID <xsl:value-of select="@id"/></xsl:message>
  <xsl:result-document href="{@id}.xml">
   <xsl:apply-imports/>
  </xsl:result-document>
</xsl:template>

  <xsl:template match="/person-with-relationships/person">
  <xsl:message>Opening document for person with ID <xsl:value-of select="@id"/></xsl:message>
  <xsl:result-document href="{@id}.xml">
   <!-- Invoke the identity template, i.e. just copy this subtree to the output. -->
   <!-- If you have some local template with lower priority that you'd like to
    invoke, use <xsl:next-match/>
   -->
   <xsl:apply-imports/>
  </xsl:result-document>
</xsl:template>


</xsl:stylesheet>


Let’s say that I want to turn each output in a DOM, or a dom4j Document, before I do anything else with it, i.e. I can’t just write the output to files. Moreover, I want to avoid overwriting files I’ve already created, and I may need to aggregate information from all the inputs in a way not suited to XSLT (I could do some of this in XQuery…but I digress). The Java for this purpose might look like the following. I’m using dom4j to set the output subtrees aside. I could have used DOM (quel horreur), or, probably, something in Saxon, but I actually wanted to stay closer to JAXP. So:



package com.demo.xml;


import java.io.File;
import java.io.IOException;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;


import javax.xml.transform.Result;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;


import net.sf.saxon.Controller;
import net.sf.saxon.FeatureKeys;
import net.sf.saxon.OutputURIResolver;
import net.sf.saxon.TransformerFactoryImpl;
import net.sf.saxon.event.SequenceWriter;
import net.sf.saxon.om.Item;
import net.sf.saxon.trans.XPathException;


import org.apache.commons.io.FileUtils;
import org.dom4j.Document;
import org.dom4j.io.DocumentResult;
import org.dom4j.io.DocumentSource;


import com.google.common.io.NullOutputStream;


public class InputSplitter {
       private final Templates splitterTemplates;


       // http://dhruba.name/2009/08/05/concurrent-set-implementations-in-java-6/
       private final ConcurrentMap<String,DocumentResult> urisProcessed = new ConcurrentHashMap<String,DocumentResult>();


       private final TransformerFactoryImpl factory;


       public InputSplitter() throws TransformerException {
              factory = new TransformerFactoryImpl();
              // I also have the requirement of removing elements with only
              // whitespace nodes among their descendants. This attribute
              // lets the parser throw away such whitespace nodes. The
              // XPath expression to discard elements with no content
              // then becomes trivial.
              factory.setAttribute(FeatureKeys.STRIP_WHITESPACE, "all");
              File splitterXlstFile = new File("resources/splitter.xsl");
              // Calling newTemplates(), rather than newTransformer(), gives me
              // on thread-safe object that I can use repeatedly. Each time I
              // want to transform an input, I have to create a new Transformer.
              this.splitterTemplates = factory.newTemplates(new StreamSource(splitterXlstFile));
       }


       public void splitFile(File xmlFile) throws TransformerException {
              final StreamSource xmlSource = new StreamSource(xmlFile);


              TransformerHandler handler = factory.newTransformerHandler(splitterTemplates);
              Transformer transformer = handler.getTransformer();
              Controller controller = (Controller) transformer;
              // You might not want an anonymous implementation of OutputURIResolver,
              // but that's irrelevant to the example. In any case, this is Saxon's
              // back door to <xsl:result-document>.
              controller.setOutputURIResolver(new OutputURIResolver() {
                     @Override
                     public void close(Result result) throws TransformerException {
                           // If you opened a Stream in resolve(), you'd want to close it
                           // here.                      


                     } 


                     @Override
                     public Result resolve(String href, String base) throws TransformerException {
                           DocumentResult result = new DocumentResult();
                           DocumentResult existingResult = urisProcessed.putIfAbsent(href, result);

                           if (existingResult == null) {

                                  return result;
                           } else {
                                  // Throw the results away. There might be a way to implement
                                  // a null SAXResult, but I'll leave that as an exercise for the
                                  // reader.
                                  return new StreamResult(new NullOutputStream());
                           }
                         
                     }});


              
              controller.setMessageEmitter(new SequenceWriter() {


                     @Override
                     public void write(Item item) throws XPathException {
                           System.out.println(item.getStringValue())


                    }});
// Discard the output from the entire document.
transformer.transform(xmlSource, new StreamResult(new NullOutputStream()));
       }


       public void transformFolder(File folder) throws TransformerException {
              for (File xmlFile : folder.listFiles()) {
                     splitFile(xmlFile);
              }
       }
      


       static public void main(String[] args) throws TransformerException, IOException {
              InputSplitter splitter = new InputSplitter();
             
              assert args.length == 2;
              File inputXml = new File(args[0]);
              splitter.splitFile(inputXml);
             
              final File outputDirectory = new File(args[1]);


              if (!outputDirectory.mkdirs())
                     FileUtils.cleanDirectory(outputDirectory);             


              for (String entry : splitter.urisProcessed.keySet()) {
                     File outputFile = new File(outputDirectory, entry);
                     // Use the identity transform to turn the dom4j tree into a file.
                     Transformer newTransformer = splitter.factory.newTransformer();
                     newTransformer.setOutputProperty("indent", "yes");
                     Document document = splitter.urisProcessed.get(entry).getDocument();
                     newTransformer.transform(new DocumentSource(document), new StreamResult(outputFile));
              }
        }
}

Thursday, September 08, 2011

Tokenizing a String with Oracle SQL

This problem actually comes up pretty frequently for me. Audit log records at my place of employment are written to the DB. I often get requests to pull out and aggregate the objects IDs in a set of rows. The IDs are space-separated within a VARCHAR2 column. The details aren't that interesting, though.

The first trick to know is the by-now conventional way of generating a sequence of integers in Oracle SQL:

SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10;

The 10 above is a necessary but arbitrary cutoff. My sample data happens to have < 10 tokens per row; if there were more, I'd boost the cutoff. Anyway, the next trick to know concerns Oracle's REGEXP_SUBSTR() function, namely that it has an optional argument for the match. You can see where this is going, right? If I JOIN each row of the audit log to the sequence of integers, then I can use the latter integers as match indexes.

Since Oracle's regex implementation doesn't include look-ahead operators, the token separator will be part of the match, and I'll have to remove it, hence TRIM(). If your data is comma-separated, your SQL will look a little different. But enough:

SELECT * FROM (
SELECT s, i, TRIM(REGEXP_SUBSTR(s, '\d+( |$)', 1, i)) token FROM (
SELECT '123 456' FROM DUAL
UNION
SELECT '789 101112' FROM DUAL
), (
SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10
)
) WHERE token IS NOT NULL;

Wednesday, September 07, 2011

Refactoring Groovy to Generate XML

There are tons of examples out there about how to generate XML using Groovy’s builders. The usual pattern is use StreamingMarkupBuilder, then create a massive nested closure resembling almost exactly the XML you want as output, then passing that to StreamingMarkupBuilder#bind(). This does create problems, though. The first is that even when the closure represents the structure of a single object with a data source and a few properties, it's already pretty big. The second is that it quickly gets cluttered with programmatic logic: checks for invalid property values, calls to some external function to translate or normalize some input, base-64-encoding of raw text or the content of external binary files, etc. I finally found out how to avoid these problems, after hours of trial and error. I might have saved myself that time if I’d just looked at the code for StreamingMarkupBuilder, but that’s life. In the following:

import groovy.xml.*

// Note that this function has no dependency on the instance of StreamingMarkupBuilder, below.
def createPersonMarkup(builder, name, occupation, age) {
// Putting this check inside a function means I can just return, without
// generating *anything*, yet not add a nesting level to my code.
if (!value)
return
assert name // This assertion will fire only when the closure is bound!
// Note the use of the "Elvis operator" to avoid a null attribute value.
builder.person(occupation: (occupation ?: ‘Unemployed’)) {
builder.name(name)
if (occupation)
builder.occupation(occupation)
}
}

builder = new StreamingMarkupBuilder()

xml = builder.bind {
// This is the strange part: the builder actually gets passed into each closure,
// but you have to declare a closure argument to get at it. You can't rely on the
// variable declaration for "builder," above, because that binding is no longer available
// when the Builder actually constructs the XML, and you'll get some hellacious error
// meaning, basically, "unbound variable name 'builder'".
persons { builder ->
createPersonMarkup builder, 'Anthony Albert Nassar', null, 49
createPersonMarkup builder, ‘Donavan Arizmendi’, ‘Teacher’, 40
}
}

XmlUtil.serialize(xml, System.out)

If you don't name the single closure argument, it must already be available as "it," and so it is in this case. This code works:

xml = builder.bind {
palantir {
createPropertyAsRawValue it, 'com.palantir.property.Name', 'Anthony Nassar', null
}
}

XmlUtil.serialize(xml, System.out)

So that's how the StreamingMarkupBuilder works: it interprets the strings that you intend as element names, as method invocations, and tries to invoke them on itself. The builder itself is always the first argument to any of these methods, and it passes itself into whatever methods (i.e. nested elements) are invoked in turn. When you call bind(), it intercepts all these method calls to generate XML.

Sunday, March 27, 2011

Exporting Word Documents to HTML

I mean real HTML. Here's what I get if I save this blog entry (which I'm editing in Word 2007) as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" 
        xmlns:m=http://schemas.microsoft.com/office/2004/12/omml
        xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=ProgId content=Word.Document>


 

If I save as "filtered HTML," I get something similar:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=Generator content="Microsoft Word 12 (filtered)">


 

In the first case, I get a namespace declaration that I don't want. What I want is legitimate XHTML (http://www.w3.org/1999/xhtml). C'mon, people, it's 2011! I don't care who's using browsers or authoring tools that can't handle it; they should upgrade. What I really, really, really want, however, is not to export with the Windows-1252 character encoding. However, there seems to be no way to prevent that if it's the default encoding on your Windows machine. If you export to plain text, you can specify an encoding, but not if you export to HTML (as far as I can tell). Weird. What I also really, really, really want is attribute values enclosed in quotation marks, as has been the standard for, oh, about 12 years. If I attempted (I did attempt it, actually) to further process this HTML as XML, any proper XML parser would choke on it.

There are parsers, however, that will read this garbage and turn it into legitimate XHTML. I prefer TagSoup for this purpose. John Cowan, the author, recommends that you not use the JAXP interface, but that's exactly what I want. I'll demonstrate below.

Let me backtrack for a minute. A good part of my job at Palantir requires me to deal with unfriendly input and output formats. A lot of these formats are XML, and usually not very nice XML (if there's a schema, it's almost certain to be misleading). So I've gotten used to dealing with character encodings, and I've gravitated toward Groovy for a lot of my work, as I can often whip something up in minutes without needing an IDE. A colleague recently asked me, quite reasonably, if it would be "easy" to convert Word documents automatically to wiki markup. "Sure!" I said. It was not easy, as it turned out.

The first step was to convert every Word document on hand to HTML. I use Groovy+Scriptom to do this:

import org.codehaus.groovy.scriptom.ActiveXObject
import org.codehaus.groovy.scriptom.Scriptom
import org.codehaus.groovy.scriptom.tlb.office.word.WdSaveFormat
import org.codehaus.groovy.scriptom.tlb.office.MsoEncoding

userHome = new File(System.properties.'user.home')
myDocuments = new File(userHome, 'My Documents')
output = new File(userHome, "Desktop/Tony's Output")

def word = new ActiveXObject('Word.Application')
Scriptom.inApartment {
try {
word.Visible = false // ('Visible', new Variant(false))
word.DisplayAlerts = false // ', new Variant(false))
def documents = word.Documents // ').toDispatch()
myDocuments.eachFileMatch ~/.*\.docx/, { doc ->
println "Opening $doc"
documents.Open doc.absolutePath
def activeDocument = word.ActiveDocument
assert activeDocument
try {
activeDocument.AcceptAllRevisions()
def html = new File(output, doc.name - ~/\.docx$/ + '.html')
// 7 is the magic number for Unicode text. See MSFT's docs for WdSaveFormatEnumeration.
// http://msdn.microsoft.com/en-us/library/bb238158%28office.12%29.aspx
// 17 is PDF; 16 is "default," thus Office 2007.
// wdFormatHTML = 8
def n = Scriptom.MISSING
activeDocument.SaveAs html.absolutePath, WdSaveFormat.wdFormatFilteredHTML, false, n, n, n, n, n, n, n, n, MsoEncoding.msoEncodingUTF8
} finally {
activeDocument.Close()
}
} // each
} finally {
// This is the Office automation API call, which Scriptom resolves for you.
word.Quit()
// Apparently it now works: winword.exe disappears from my Task Manager, which is what I want.
}
} // Close apartment.


 

I'm not going to explain Scriptom here. Suffice it to say that the code above does roughly what VBA would do. Scriptom can use the constants defined at http://msdn.microsoft.com/en-US/library/microsoft.office.interop.word.wdsaveformat.aspx because someone was thoughtful enough to copy them to http://groovy.codehaus.org/modules/scriptom/1.6.0/scriptom-office-2K3-tlb/apidocs/org/codehaus/groovy/scriptom/tlb/office/word/WdSaveFormat.html. Unfortunately, my attempt to specify the encoding for the output file failed, and I could have left off all the arguments to SaveAs() after the first two.

Now, unfortunately, I've got a folder of bad HTML. How do I turn that into XHTML (actually, I wanted to run that through a further XSLT, to produce wiki markup)? Like this:

import org.ccil.cowan.tagsoup.Parser
import org.xml.sax.*
import javax.xml.transform.*
import javax.xml.transform.sax.SAXSource
import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource

output.eachFileMatch ~/.*\.html/, { html ->
def transformer = TransformerFactory.newInstance().newTransformer()
new File(html.parentFile, html.name - ~/html$/ + 'xhtml').withWriter 'UTF-8', { writer ->
html.withReader 'Windows-1252', { reader ->
transformer.transform(new SAXSource(new Parser(), new InputSource(reader)), new StreamResult(writer));
}
}
}


 

TransformerFactory#newInstance() simply returns the "identity transform," which is what I want: I don't want to change the structure of the XML at all.

Saturday, March 26, 2011

Interval Coalesce with XSLT

I was intrigued by Vadim Tropashko's SQL Design Patterns. I'm not a SQL guy, so like lots of developers, I think of databases as places to put data, from which to retrieve it, and which occasionally need tuning. I don't think of the execution engine as something that virtually can create additional information for me. I've written maybe two pivot queries in my life, for example. Someone who's adept with Mathematica, or Excel for that matter, can do things in a minute that would take a Java developer days, since a SQL rowset is a very impoverished data structure in Java.

Anyway, I saw an immediate use for Tropashko's "interval coalesce" algorithm. The problem is, given a collection of intervals, to coalesce those that overlap, and thus produce a small collection of non-overlapping intervals. Well, "immediate" is misleading; I waited a year to do this. Anyway, I fired up Oracle XE and made up some data. Then I entered the query on p. 37…only to find out that there's a typo in it. Several hours later, I figured out where it was:

SELECT fst.x, lst.y -- Find two endpoints.

FROM intervals fst, intervals lst WHERE fst.x < lst.y

AND NOT EXISTS ( -- There's no interval beginning between these endpoints...

  SELECT * FROM intervals i

  WHERE i.x > fst.x AND i.x < lst.y

  AND NOT EXISTS ( -- ...for which there's no covering interval.

    SELECT * FROM intervals cov 

    WHERE i.x > cov.x AND i.x <= cov.y

  )

) AND NOT EXISTS (

  SELECT * FROM intervals cov

  WHERE cov.x < fst.x AND fst.x <= cov.y

  OR cov.x <= lst.y AND lst.y < cov.y

)


 

You might notice a discrepancy from the query in the book...and I've reported the erratum!

Tropashko goes on to show a more elegant and efficient way to produce this result, but at this point I wanted to get back to my own practical problem, which was to coalesce intervals defined in an XML document. Since the query above involves three self-joins, I could expect n^4 performance if I simply turned Tropashko's SQL into XPath. However, an XML document is inherently ordered (there's no ANSI SQL equivalent to following-sibling::*[1]). If I sorted the intervals according to their left end, i.e. wrote an XSLT to preprocess the input, then I could simply iterate through the intervals one time, gradually coalescing them where possible. The code falls into a pattern familiar to functional programmers. In other words, I apply a template to the first interval in the document, and that template calls a second template that uses "accumulators" for the current coalesced interval. This pattern is also used in Jeni Tennison's (http://www.jenitennison.com/xslt/index.html) <span style="font-style: italic;">XSLT on the Edge</span> as a way of grouping adjacent elements. I wanted to avoid XPath stunts like Tropashko's SQL, because I will have to apply this transform to documents with 10s of 1000s of data points. I very deliberate select only following-sibling::*[1].

Since blogger's editor widget hacks up my XSLT in horrifying ways, and I can't attach text files, I had to format this XSTL with non-breaking spaces and whatnot. Let me know if you want the real thing. Let's say the input looks like this (it must be sorted, perhaps by another XSLT, by starting time only):

<?xml version='1.0' ?>
<intervals>

<interval x="10" y="14.1155607415243584888223696987094362306"/>

<interval x="10" y="27.2574271976039591982711142331954001295"/>

<interval x="30" y="33.7147910106524433624672477551503835313"/>

<interval x="40" y="46.844920420920822280815378491005656766"/>

<interval x="50" y="61.30421963829719394371538034317434986025"/>


Then the XSLT to coalesce these intervals will look about like this:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"

xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform"

xmlns:xs=
"http://www.w3.org/2001/XMLSchema">


<xsl:output indent="yes"/>


<!-- I could have named this template "start-recursion", but it's
convenient to be able to set the context node. --
>

<xsl:template match="interval">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="."/>

<xsl:with-param name="from" select="number(@x)" tunnel="yes"/>

<xsl:with-param name="to" select="number(@y)" tunnel="yes"/>

</xsl:call-template>

</xsl:template>


<xsl:template name="coalesce-intervals">

<xsl:param name="next-interval" as="element(interval)?"/>

<xsl:param name="from" as="xs:double" tunnel="yes"/>

<xsl:param name="to" as="xs:double" tunnel="yes"/>

<xsl:choose>

<!-- Stop the recursion. -->

<xsl:when test="not($next-interval)">

<interval x="{$from}" y="{$to}"/>

</xsl:when>

<!-- The current coalesced interval overlaps this one, so move on. -->

<xsl:when test="$to gt number($next-interval/@x)">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="$next-interval/following-sibling::interval[1]"/>

<!-- Extend the current interval if possible (hence max). -->

<xsl:with-param name="to" select="max((number($next-interval/@y), $to))" tunnel="yes"/>

</xsl:call-template>

</xsl:when>

<xsl:otherwise>

<!-- No more to coalesce. Output the "accumulator" and start again. -->

<interval x="{$from}" y="{$to}"/>

<xsl:apply-templates select="$next-interval"/>

</xsl:otherwise>

</xsl:choose>

</xsl:template>


<xsl:template match="/intervals">

<intervals>

<xsl:apply-templates select="interval[1]"/>

</intervals>

</xsl:template>
</xsl:stylesheet>