Thursday, September 08, 2011

Tokenizing a String with Oracle SQL

This problem actually comes up pretty frequently for me. Audit log records at my place of employment are written to the DB. I often get requests to pull out and aggregate the objects IDs in a set of rows. The IDs are space-separated within a VARCHAR2 column. The details aren't that interesting, though.

The first trick to know is the by-now conventional way of generating a sequence of integers in Oracle SQL:


The 10 above is a necessary but arbitrary cutoff. My sample data happens to have < 10 tokens per row; if there were more, I'd boost the cutoff. Anyway, the next trick to know concerns Oracle's REGEXP_SUBSTR() function, namely that it has an optional argument for the match. You can see where this is going, right? If I JOIN each row of the audit log to the sequence of integers, then I can use the latter integers as match indexes.

Since Oracle's regex implementation doesn't include look-ahead operators, the token separator will be part of the match, and I'll have to remove it, hence TRIM(). If your data is comma-separated, your SQL will look a little different. But enough:

SELECT s, i, TRIM(REGEXP_SUBSTR(s, '\d+( |$)', 1, i)) token FROM (
SELECT '789 101112' FROM DUAL
), (

Wednesday, September 07, 2011

Refactoring Groovy to Generate XML

There are tons of examples out there about how to generate XML using Groovy’s builders. The usual pattern is use StreamingMarkupBuilder, then create a massive nested closure resembling almost exactly the XML you want as output, then passing that to StreamingMarkupBuilder#bind(). This does create problems, though. The first is that even when the closure represents the structure of a single object with a data source and a few properties, it's already pretty big. The second is that it quickly gets cluttered with programmatic logic: checks for invalid property values, calls to some external function to translate or normalize some input, base-64-encoding of raw text or the content of external binary files, etc. I finally found out how to avoid these problems, after hours of trial and error. I might have saved myself that time if I’d just looked at the code for StreamingMarkupBuilder, but that’s life. In the following:

import groovy.xml.*

// Note that this function has no dependency on the instance of StreamingMarkupBuilder, below.
def createPersonMarkup(builder, name, occupation, age) {
// Putting this check inside a function means I can just return, without
// generating *anything*, yet not add a nesting level to my code.
if (!value)
assert name // This assertion will fire only when the closure is bound!
// Note the use of the "Elvis operator" to avoid a null attribute value.
builder.person(occupation: (occupation ?: ‘Unemployed’)) {
if (occupation)

builder = new StreamingMarkupBuilder()

xml = builder.bind {
// This is the strange part: the builder actually gets passed into each closure,
// but you have to declare a closure argument to get at it. You can't rely on the
// variable declaration for "builder," above, because that binding is no longer available
// when the Builder actually constructs the XML, and you'll get some hellacious error
// meaning, basically, "unbound variable name 'builder'".
persons { builder ->
createPersonMarkup builder, 'Anthony Albert Nassar', null, 49
createPersonMarkup builder, ‘Donavan Arizmendi’, ‘Teacher’, 40

XmlUtil.serialize(xml, System.out)

If you don't name the single closure argument, it must already be available as "it," and so it is in this case. This code works:

xml = builder.bind {
palantir {
createPropertyAsRawValue it, '', 'Anthony Nassar', null

XmlUtil.serialize(xml, System.out)

So that's how the StreamingMarkupBuilder works: it interprets the strings that you intend as element names, as method invocations, and tries to invoke them on itself. The builder itself is always the first argument to any of these methods, and it passes itself into whatever methods (i.e. nested elements) are invoked in turn. When you call bind(), it intercepts all these method calls to generate XML.