SlideGuitarist

Saturday, September 28, 2013

Carlo’s "Just Dance" 12th Birthday Party

Wednesday, June 05, 2013

Running JUnit Tests in a Particular Time Zone

In the olden days, you might have done this with @BeforeClass and @AfterClass annotations. A neater way of ensuring that each unit test is run in the same time zone is to use the newer JUnit rules.

class TimeZoneTestRule implements TestRule {

  private final DateTimeZone zone;

  TimeZoneTestRule(DateTimeZone zone) {
    this.zone = zone;
  }

  @Override
  public Statement apply(final Statement base, Description description) {
    return new Statement() {

       @Override
       public void evaluate() throws Throwable {
         DateTimeZone defaultTimeZone = DateTimeZone.getDefault();
           try {
             DateTimeZone.setDefault(zone);
             base.evaluate();
           } finally {
             DateTimeZone.setDefault(defaultTimeZone);
           }
         }
      };
   }
}

Now I want to ensure that my unit tests, which confirm that I'm parsing user input correctly, can compare the result to some constants. I annotate my test with:

  @Rule public TestRule timeZoneRule = new TimeZoneTestRule(DateTimeZone.forOffsetHours(2))

Saturday, October 01, 2011

XSLT to Produce Multiple Results from a Single Input

I was recently confronted with some XML structured in this way: a <person> at the top, followed by 0 or more entities to whom the person has a certain sort of relationship. These may include other persons, and the same persons may appear in several input documents. I’d like to split up all these documents, and apply additional XSLTs to them. Moreover, I can only decide on where the output XML is going to go based on information not available to the XSLT processor, i.e. I can’t simply calculate a URI in <xsl:result-document> and let the processor open the file for me. This is the first time I’ve used the Saxon processor’s setOutputURIResolver() method; in fact, I didn’t know it existed until I did some hunting in Eclipse.

Here’s what the input looks like, more or less:

<?xml version="1.0"?>
<person-with-relationships>
 <person id="1">
  <name>Anthony Albert Nassar</name>
  <phone-number>800-555-1212</phone-number>
 </person>
 <relationships>
  <relationship>
   <employment>
    <start-date/>
   </employment>
   <organization id="2">
    <organization-name>Palantir Technologies, Inc.</organization-name>
    <url>palantir.com</url>
   </organization>
  </relationship>
  <relationship>
   <marriage>
    <start-date/>
   </marriage>
   <person id="3">
    <name>Donavan Arizmendi</name>
   </person>
  </relationship>
 </relationships>
</person-with-relationships>

The XSLT looks like this:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xpath-default-namespace="">
 <xsl:import href="identity.xslt"/>

 <xsl:template match="relationship/*[2]">
  <xsl:message>Opening document for <xsl:value-of select="local-name()"/> with ID <xsl:value-of select="@id"/></xsl:message>
  <xsl:result-document href="{@id}.xml">
   <xsl:apply-imports/>
  </xsl:result-document>
 </xsl:template>

  <xsl:template match="/person-with-relationships/person">
  <xsl:message>Opening document for person with ID <xsl:value-of select="@id"/></xsl:message>
  <xsl:result-document href="{@id}.xml">
   <!-- Invoke the identity template, i.e. just copy this subtree to the output. -->
   <!-- If you have some local template with lower priority that you'd like to 
    invoke, use <xsl:next-match/>
   -->
   <xsl:apply-imports/>
  </xsl:result-document>
 </xsl:template>


</xsl:stylesheet>

Let’s say that I want to turn each output in a DOM, or a dom4j Document, before I do anything else with it, i.e. I can’t just write the output to files. Moreover, I want to avoid overwriting files I’ve already created, and I may need to aggregate information from all the inputs in a way not suited to XSLT (I could do some of this in XQuery…but I digress). The Java for this purpose might look like the following. I’m using dom4j to set the output subtrees aside. I could have used DOM (quel horreur), or, probably, something in Saxon, but I actually wanted to stay closer to JAXP. So:

package com.demo.xml;

import java.io.File;
import java.io.IOException;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

import javax.xml.transform.Result;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

import net.sf.saxon.Controller;
import net.sf.saxon.FeatureKeys;
import net.sf.saxon.OutputURIResolver;
import net.sf.saxon.TransformerFactoryImpl;
import net.sf.saxon.event.SequenceWriter;
import net.sf.saxon.om.Item;
import net.sf.saxon.trans.XPathException;

import org.apache.commons.io.FileUtils;
import org.dom4j.Document;
import org.dom4j.io.DocumentResult;
import org.dom4j.io.DocumentSource;

import com.google.common.io.NullOutputStream;

public class InputSplitter {
       private final Templates splitterTemplates;

       // http://dhruba.name/2009/08/05/concurrent-set-implementations-in-java-6/
       private final ConcurrentMap<String,DocumentResult> urisProcessed = new ConcurrentHashMap<String,DocumentResult>();

       private final TransformerFactoryImpl factory;

       public InputSplitter() throws TransformerException {
              factory = new TransformerFactoryImpl();
              // I also have the requirement of removing elements with only 
              // whitespace nodes among their descendants. This attribute
              // lets the parser throw away such whitespace nodes. The 
              // XPath expression to discard elements with no content 
              // then becomes trivial.
              factory.setAttribute(FeatureKeys.STRIP_WHITESPACE, "all");
              File splitterXlstFile = new File("resources/splitter.xsl");
              // Calling newTemplates(), rather than newTransformer(), gives me 
              // on thread-safe object that I can use repeatedly. Each time I 
              // want to transform an input, I have to create a new Transformer.
              this.splitterTemplates = factory.newTemplates(new StreamSource(splitterXlstFile));
       }

       public void splitFile(File xmlFile) throws TransformerException {
              final StreamSource xmlSource = new StreamSource(xmlFile);

              TransformerHandler handler = factory.newTransformerHandler(splitterTemplates);
              Transformer transformer = handler.getTransformer();
              Controller controller = (Controller) transformer;
              // You might not want an anonymous implementation of OutputURIResolver,
              // but that's irrelevant to the example. In any case, this is Saxon's
              // back door to <xsl:result-document>.
              controller.setOutputURIResolver(new OutputURIResolver() {
                     @Override
                     public void close(Result result) throws TransformerException {
                           // If you opened a Stream in resolve(), you'd want to close it
                           // here.

                     @Override
                     public Result resolve(String href, String base) throws TransformerException {
                           DocumentResult result = new DocumentResult();
                           DocumentResult existingResult = urisProcessed.putIfAbsent(href, result);

                           if (existingResult == null) {

                                  return result;
                           } else {
                                  // Throw the results away. There might be a way to implement 
                                  // a null SAXResult, but I'll leave that as an exercise for the 
                                  // reader.
                                  return new StreamResult(new NullOutputStream());
                           }
                          
                     }});

              
              controller.setMessageEmitter(new SequenceWriter() {

                     @Override
                     public void write(Item item) throws XPathException {
                           System.out.println(item.getStringValue())

                    }});
              // Discard the output from the entire document.
              transformer.transform(xmlSource, new StreamResult(new NullOutputStream()));
       }

       public void transformFolder(File folder) throws TransformerException {
              for (File xmlFile : folder.listFiles()) {
                     splitFile(xmlFile);
              }
       }

       static public void main(String[] args) throws TransformerException, IOException {
              InputSplitter splitter = new InputSplitter();
              
              assert args.length == 2;
              File inputXml = new File(args[0]);
              splitter.splitFile(inputXml);
              
              final File outputDirectory = new File(args[1]);

              if (!outputDirectory.mkdirs())
                     FileUtils.cleanDirectory(outputDirectory);

              for (String entry : splitter.urisProcessed.keySet()) {
                     File outputFile = new File(outputDirectory, entry);
                     // Use the identity transform to turn the dom4j tree into a file.
                     Transformer newTransformer = splitter.factory.newTransformer();
                     newTransformer.setOutputProperty("indent", "yes");
                     Document document = splitter.urisProcessed.get(entry).getDocument();
                     newTransformer.transform(new DocumentSource(document), new StreamResult(outputFile));
              }
        }
}

Thursday, September 08, 2011

Tokenizing a String with Oracle SQL

This problem actually comes up pretty frequently for me. Audit log records at my place of employment are written to the DB. I often get requests to pull out and aggregate the objects IDs in a set of rows. The IDs are space-separated within a VARCHAR2 column. The details aren't that interesting, though.

The first trick to know is the by-now conventional way of generating a sequence of integers in Oracle SQL:


SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10;

The 10 above is a necessary but arbitrary cutoff. My sample data happens to have < 10 tokens per row; if there were more, I'd boost the cutoff. Anyway, the next trick to know concerns Oracle's REGEXP_SUBSTR() function, namely that it has an optional argument for the match. You can see where this is going, right? If I JOIN each row of the audit log to the sequence of integers, then I can use the latter integers as match indexes.

Since Oracle's regex implementation doesn't include look-ahead operators, the token separator will be part of the match, and I'll have to remove it, hence TRIM(). If your data is comma-separated, your SQL will look a little different. But enough:


SELECT * FROM (
  SELECT s, i, TRIM(REGEXP_SUBSTR(s, '\d+( |$)', 1, i)) token FROM (
    SELECT '123 456' FROM DUAL
    UNION
    SELECT '789 101112' FROM DUAL
), (
  SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10
  )
) WHERE token IS NOT NULL;

Wednesday, September 07, 2011

Refactoring Groovy to Generate XML

There are tons of examples out there about how to generate XML using Groovy’s builders. The usual pattern is use StreamingMarkupBuilder, then create a massive nested closure resembling almost exactly the XML you want as output, then passing that to StreamingMarkupBuilder#bind(). This does create problems, though. The first is that even when the closure represents the structure of a single object with a data source and a few properties, it's already pretty big. The second is that it quickly gets cluttered with programmatic logic: checks for invalid property values, calls to some external function to translate or normalize some input, base-64-encoding of raw text or the content of external binary files, etc. I finally found out how to avoid these problems, after hours of trial and error. I might have saved myself that time if I’d just looked at the code for StreamingMarkupBuilder, but that’s life. In the following:


import groovy.xml.*

// Note that this function has no dependency on the instance of StreamingMarkupBuilder, below.
def createPersonMarkup(builder, name, occupation, age) {
    // Putting this check inside a function means I can just return, without
    // generating *anything*, yet not add a nesting level to my code.
        if (!value)
         return
    assert name // This assertion will fire only when the closure is bound!
    // Note the use of the "Elvis operator" to avoid a null attribute value.
    builder.person(occupation: (occupation ?: ‘Unemployed’)) {
         builder.name(name)
         if (occupation)
           builder.occupation(occupation)
    }
}

builder = new StreamingMarkupBuilder()

xml = builder.bind { 
    // This is the strange part: the builder actually gets passed into each closure,
    // but you have to declare a closure argument to get at it. You can't rely on the
    // variable declaration for "builder," above, because that binding is no longer available
    // when the Builder actually constructs the XML, and you'll get some hellacious error 
    // meaning, basically, "unbound variable name 'builder'".
    persons { builder -> 
        createPersonMarkup builder, 'Anthony Albert Nassar', null, 49
        createPersonMarkup builder, ‘Donavan Arizmendi’, ‘Teacher’, 40
    }
}

XmlUtil.serialize(xml, System.out)

If you don't name the single closure argument, it must already be available as "it," and so it is in this case. This code works:


xml = builder.bind {
    palantir {
       createPropertyAsRawValue it, 'com.palantir.property.Name', 'Anthony Nassar', null
    }
}

XmlUtil.serialize(xml, System.out)

So that's how the StreamingMarkupBuilder works: it interprets the strings that you intend as element names, as method invocations, and tries to invoke them on itself. The builder itself is always the first argument to any of these methods, and it passes itself into whatever methods (i.e. nested elements) are invoked in turn. When you call bind(), it intercepts all these method calls to generate XML.

Sunday, March 27, 2011

Exporting Word Documents to HTML

I mean real HTML. Here's what I get if I save this blog entry (which I'm editing in Word 2007) as HTML:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" 
        xmlns:m=http://schemas.microsoft.com/office/2004/12/omml
        xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=ProgId content=Word.Document>

If I save as "filtered HTML," I get something similar:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
        <meta name=Generator content="Microsoft Word 12 (filtered)">

In the first case, I get a namespace declaration that I don't want. What I want is legitimate XHTML (http://www.w3.org/1999/xhtml). C'mon, people, it's 2011! I don't care who's using browsers or authoring tools that can't handle it; they should upgrade. What I really, really, really want, however, is not to export with the Windows-1252 character encoding. However, there seems to be no way to prevent that if it's the default encoding on your Windows machine. If you export to plain text, you can specify an encoding, but not if you export to HTML (as far as I can tell). Weird. What I also really, really, really want is attribute values enclosed in quotation marks, as has been the standard for, oh, about 12 years. If I attempted (I did attempt it, actually) to further process this HTML as XML, any proper XML parser would choke on it.

There are parsers, however, that will read this garbage and turn it into legitimate XHTML. I prefer TagSoup for this purpose. John Cowan, the author, recommends that you not use the JAXP interface, but that's exactly what I want. I'll demonstrate below.

Let me backtrack for a minute. A good part of my job at Palantir requires me to deal with unfriendly input and output formats. A lot of these formats are XML, and usually not very nice XML (if there's a schema, it's almost certain to be misleading). So I've gotten used to dealing with character encodings, and I've gravitated toward Groovy for a lot of my work, as I can often whip something up in minutes without needing an IDE. A colleague recently asked me, quite reasonably, if it would be "easy" to convert Word documents automatically to wiki markup. "Sure!" I said. It was not easy, as it turned out.

The first step was to convert every Word document on hand to HTML. I use Groovy+Scriptom to do this:

import org.codehaus.groovy.scriptom.ActiveXObject
import org.codehaus.groovy.scriptom.Scriptom
import org.codehaus.groovy.scriptom.tlb.office.word.WdSaveFormat
import org.codehaus.groovy.scriptom.tlb.office.MsoEncoding

userHome = new File(System.properties.'user.home')
myDocuments = new File(userHome, 'My Documents')
output = new File(userHome, "Desktop/Tony's Output")

def word = new ActiveXObject('Word.Application')
Scriptom.inApartment {
    try {
        word.Visible = false // ('Visible', new Variant(false))
        word.DisplayAlerts = false // ', new Variant(false))
        def documents = word.Documents // ').toDispatch()
        myDocuments.eachFileMatch ~/.*\.docx/, { doc ->    
            println "Opening $doc"
            documents.Open doc.absolutePath
            def activeDocument = word.ActiveDocument
            assert activeDocument
            try {
                activeDocument.AcceptAllRevisions()
                def html = new File(output, doc.name - ~/\.docx$/ + '.html')
                // 7 is the magic number for Unicode text. See MSFT's docs for WdSaveFormatEnumeration. 
                // http://msdn.microsoft.com/en-us/library/bb238158%28office.12%29.aspx
                // 17 is PDF; 16 is "default," thus Office 2007. 
                // wdFormatHTML = 8
                def n = Scriptom.MISSING
                activeDocument.SaveAs html.absolutePath, WdSaveFormat.wdFormatFilteredHTML, false, n, n, n, n, n, n, n, n, MsoEncoding.msoEncodingUTF8
            } finally {
                activeDocument.Close()
            }
        } // each
    } finally {
        // This is the Office automation API call, which Scriptom resolves for you.
        word.Quit()    
        // Apparently it now works: winword.exe disappears from my Task Manager, which is what I want.
    }
} // Close apartment.

I'm not going to explain Scriptom here. Suffice it to say that the code above does roughly what VBA would do. Scriptom can use the constants defined at http://msdn.microsoft.com/en-US/library/microsoft.office.interop.word.wdsaveformat.aspx because someone was thoughtful enough to copy them to http://groovy.codehaus.org/modules/scriptom/1.6.0/scriptom-office-2K3-tlb/apidocs/org/codehaus/groovy/scriptom/tlb/office/word/WdSaveFormat.html. Unfortunately, my attempt to specify the encoding for the output file failed, and I could have left off all the arguments to SaveAs() after the first two.

Now, unfortunately, I've got a folder of bad HTML. How do I turn that into XHTML (actually, I wanted to run that through a further XSLT, to produce wiki markup)? Like this:

import org.ccil.cowan.tagsoup.Parser
import org.xml.sax.*
import javax.xml.transform.*
import javax.xml.transform.sax.SAXSource
import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource

output.eachFileMatch ~/.*\.html/, { html ->
        def transformer = TransformerFactory.newInstance().newTransformer()
        new File(html.parentFile, html.name - ~/html$/ + 'xhtml').withWriter 'UTF-8', { writer ->
            html.withReader 'Windows-1252', { reader ->
                transformer.transform(new SAXSource(new Parser(), new InputSource(reader)), new StreamResult(writer));
            }
        }
}

TransformerFactory#newInstance() simply returns the "identity transform," which is what I want: I don't want to change the structure of the XML at all.

Saturday, March 26, 2011

Interval Coalesce with XSLT

I was intrigued by Vadim Tropashko's SQL Design Patterns. I'm not a SQL guy, so like lots of developers, I think of databases as places to put data, from which to retrieve it, and which occasionally need tuning. I don't think of the execution engine as something that virtually can create additional information for me. I've written maybe two pivot queries in my life, for example. Someone who's adept with Mathematica, or Excel for that matter, can do things in a minute that would take a Java developer days, since a SQL rowset is a very impoverished data structure in Java.

Anyway, I saw an immediate use for Tropashko's "interval coalesce" algorithm. The problem is, given a collection of intervals, to coalesce those that overlap, and thus produce a small collection of non-overlapping intervals. Well, "immediate" is misleading; I waited a year to do this. Anyway, I fired up Oracle XE and made up some data. Then I entered the query on p. 37…only to find out that there's a typo in it. Several hours later, I figured out where it was:

SELECT fst.x, lst.y -- Find two endpoints.

FROM intervals fst, intervals lst WHERE fst.x < lst.y

AND NOT EXISTS ( -- There's no interval beginning between these endpoints...

  SELECT * FROM intervals i

  WHERE i.x > fst.x AND i.x < lst.y

  AND NOT EXISTS ( -- ...for which there's no covering interval.

    SELECT * FROM intervals cov

    WHERE i.x > cov.x AND i.x <= cov.y

) AND NOT EXISTS (

  SELECT * FROM intervals cov

  WHERE cov.x < fst.x AND fst.x <= cov.y

  OR cov.x <= lst.y AND lst.y < cov.y

You might notice a discrepancy from the query in the book...and I've reported the erratum!

Tropashko goes on to show a more elegant and efficient way to produce this result, but at this point I wanted to get back to my own practical problem, which was to coalesce intervals defined in an XML document. Since the query above involves three self-joins, I could expect n^4 performance if I simply turned Tropashko's SQL into XPath. However, an XML document is inherently ordered (there's no ANSI SQL equivalent to following-sibling::*[1]). If I sorted the intervals according to their left end, i.e. wrote an XSLT to preprocess the input, then I could simply iterate through the intervals one time, gradually coalescing them where possible. The code falls into a pattern familiar to functional programmers. In other words, I apply a template to the first interval in the document, and that template calls a second template that uses "accumulators" for the current coalesced interval. This pattern is also used in Jeni Tennison's (http://www.jenitennison.com/xslt/index.html) <span style="font-style: italic;">XSLT on the Edge</span> as a way of grouping adjacent elements. I wanted to avoid XPath stunts like Tropashko's SQL, because I will have to apply this transform to documents with 10s of 1000s of data points. I very deliberate select only following-sibling::*[1].

Since blogger's editor widget hacks up my XSLT in horrifying ways, and I can't attach text files, I had to format this XSTL with non-breaking spaces and whatnot. Let me know if you want the real thing. Let's say the input looks like this (it must be sorted, perhaps by another XSLT, by starting time only):

<?xml version='1.0' ?>
<intervals>

<interval x="10" y="14.1155607415243584888223696987094362306"/>

<interval x="10" y="27.2574271976039591982711142331954001295"/>

<interval x="30" y="33.7147910106524433624672477551503835313"/>

<interval x="40" y="46.844920420920822280815378491005656766"/>

<interval x="50" y="61.30421963829719394371538034317434986025"/>

…

Then the XSLT to coalesce these intervals will look about like this:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xsl:output indent="yes"/>



<xsl:template match="interval">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="."/>

<xsl:with-param name="from" select="number(@x)" tunnel="yes"/>

<xsl:with-param name="to" select="number(@y)" tunnel="yes"/>

</xsl:call-template>

</xsl:template>

<xsl:template name="coalesce-intervals">

<xsl:param name="next-interval" as="element(interval)?"/>

<xsl:param name="from" as="xs:double" tunnel="yes"/>

<xsl:param name="to" as="xs:double" tunnel="yes"/>

<xsl:choose>



<xsl:when test="not($next-interval)">

<interval x="{$from}" y="{$to}"/>

</xsl:when>



<xsl:when test="$to gt number($next-interval/@x)">

<xsl:call-template name="coalesce-intervals">

<xsl:with-param name="next-interval" select="$next-interval/following-sibling::interval[1]"/>



<xsl:with-param name="to" select="max((number($next-interval/@y), $to))" tunnel="yes"/>

</xsl:call-template>

</xsl:when>

<xsl:otherwise>



<interval x="{$from}" y="{$to}"/>

<xsl:apply-templates select="$next-interval"/>

</xsl:otherwise>

</xsl:choose>

</xsl:template>

<xsl:template match="/intervals">

<intervals>

<xsl:apply-templates select="interval[1]"/>

</intervals>

</xsl:template>
</xsl:stylesheet>

Friday, August 21, 2009

Converting Oracle TIMESTAMP WITH TIME ZONE to Current TZ

I see answers to this question all over the Web, usually involving some arithmetic with abbreviations such as "EDT", the use of hacky Oracle extension functions such as NEW_TIME, etc. Forget about that. Here's how you do it: use CAST(). If this seems pedantic, it's not. I recently had to deal with an Oracle database in which times were stored as epoch times, i.e. milliseconds since January 1, 1970 UTC (e.g. new java.util.Date().getTime()). So you can get the beginning of the epoch in Oracle SQL like so:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' FROM DUAL;

That'll get you a timestamp in terms of Greenwich Mean Time (GMT or UTC):

SELECT TO_CHAR(TIMESTAMP '1970-01-01 00:00:00 +00:00', 'TZR') FROM DUAL;

If I've got "epoch time" in ms, I can easily convert it to a timestamp relative to UTC:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' + NUMTODSINTERVAL(epoch_time / 1000, 'SECOND') FROM my_table;

How do I then display, and especially export these values relative to my current time zone? Casting to TIMESTAMP WITH LOCAL TIME ZONE doesn't quite work; by default, the time zone isn't displayed for this column, and the 'TZR' format doesn't work on it (I get an Oracle error, which confused me. So CAST() them to TIMESTAMP WITH LOCAL TIME ZONE, then CAST() to TIMESTAMP WITH TIME ZONE. Oracle does all the work, and I don't have to hard-code any information in the query about my time zone, or rely on EXTRACT(). So:

SELECT CAST(CAST(TIMESTAMP '1970-01-01 00:00:00 +00:00' AS TIMESTAMP WITH LOCAL TIME ZONE) AS TIMESTAMP WITH TIME ZONE) FROM DUAL;

Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
       version="1.0">

  <xsl:output method="text" encoding="UTF-8" />

  <xsl:template match="*">
    <!-- Simply recurse over the children. -->
    <xsl:apply-templates />
  </xsl:template>
 
  <!-- Any piece of text from the .docx is enclosed in a w:t element. -->
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
    <!-- Look for the xml:space attribute. A namespace is not necessary. -->
    <!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
    <xsl:if test="@space = 'preserve'">    
   <xsl:text> </xsl:text>
    </xsl:if>
  </xsl:template>
  
  <xsl:template match="w:p">
    <xsl:apply-templates />
    <xsl:text>&#xa;</xsl:text>
  </xsl:template>
  
 </xsl:stylesheet>

To get the text, here's some Groovy:


def createDocumentXml(f) {
    def zip = new ZipFile(f)
    def entry = zip.getEntry('word/document.xml')
    assert entry
    def xml =  new File(f.absolutePath - '.docx' + '.xml')
    // I want to copy the XML to a file, for purposes of visual inspection.
    // In production code I'd simply return the stream.
    zip.getInputStream(entry).withStream { i ->
        xml.withOutputStream { o ->
            o << i
        }
    }
    return xml
}

To apply the XSLT to the document, see the attachments.

Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with


<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:


import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:

name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
":name" matches an element named "name" only id the element is not in a namespace
"prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)

Anyway, force of habit:

html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')

html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')

Extracting the Text from a Word Document w/ Groovy

As I'm doing a lot of data munging these days, and often have to talk to Java APIs. Since I'm not writing production code, don't have lots of RAM or lots of good tools at my disposal and therefore would just as soon use a text editor (SciTE's my current preference...no, I can't get the Win32 port of emacs!), Groovy is definitely my best choice. Notwithstanding my preference for XSLT (over GPath) to handle XML, I can't deny that you can do some slick stuff w/ GPath. At another munger's request, I cooked this up in 5 minutes, and was almost shocked at how easy it was to get the text out of an Office 2007 docx file:


import java.util.zip.*

docx = new File('C:\\Documents and Settings\\tnassar\\My Documents\\Efficient.docx')
zip = new ZipFile(docx)
entry = zip.getEntry('word/document.xml')
stream = zip.getInputStream(entry)

// The namespace was gleaned from the decompressed XML.
wordMl = new XmlSlurper().parse(stream).declareNamespace(w: 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')

// The outermost XML element node is assigned to the variable wordMl, so
// GPath expressions will start after that. To print out the concatenated
// descendant text nodes of w:body, you use:

text = wordMl.'w:body'.children().collect { it.text() }.join('')

println text

It would be nice if Groovy offered "raw strings" like Python--r'C:\Documents and Settings\...'--or C#, which lets you prepend a '@' to have backslashes treated literally--esp. when it comes to Windows pathnames, but whatever.

This will not work well for complex document formats (I can imagine that tables and such would be a disaster), but for me it was just enough.

Saturday, October 25, 2008

Efficient “Disjoint Sets” Implementation in XSLT

I actually had to revisit this problem with slightly different input, and once again I struggled. If I represent an undirected graph like this:

<?xml version="1.0"?>
<graph>
    <nodes>
        <node id="1"/>
        <node id="2"/>
        <node id="3"/>
        <node id="4"/>
        <node id="5"/>
    </nodes>
    <links>
        <link end1="1" end2="2"/>
        <link end1="1" end2="3"/>
        <link end1="1" end2="4"/>
        <link end1="1" end2="5"/>
    </links>
</graph>

…then it's a little harder to go from one node to its connected nodes, because the information I need is elsewhere in the XML document (i.e. not in a child node of the <node> element) . To index any node element, I'd have to navigate up the tree, and back down to the links. What's worse, I'd have to scan all the links for every node. So the solution involves some indirection. First I declare these two mappings:

<xsl:key name="left" match="/graph/links/link" use="@end1"/>
<xsl:key name="right" match="/graph/links/link" use="@end2"/>

Now I map each node (its ID, actually; remember that the key has to be of an atomic type) to its connected nodes by going through these mappings (such a recursive definition is not possible in XSLT 1.0, I believe). Note that for links that were indexed by @end1, I'm interested in the value of @end2, and v.v.

<xsl:key name="connected" match="node" use="key('left', @id)/@end2, key('right', @id)/@end1"/>

Finally, I can use this mapping:

<xsl:template match="/">
        <connectivity>
            <xsl:for-each select="graph/nodes/node">
                <node id="{@id}">
                    <connected-to>
                        <xsl:value-of select="key('connected', @id)/@id"/>
                    </connected-to>
                </node>
            </xsl:for-each>
        </connectivity>
    </xsl:template>

I believe that these are all the data structures I need. Now I can implement the "strongly connected components" algorithm as I did below, but using set operators on the nodes.

Friday, October 03, 2008

Graph Algorithms in XSLT

Notwithstanding that XSLT can be considered a functional programming language, you probably wouldn't choose to implement algorithms in it. Nonetheless, I recently had to split a very large XML document into smaller pieces (using xsl:result-document), and I wanted to redundancies between the outputs. So what I essentially wanted to do was separate elements in the input into "strongly-connected components." XSLT 2.0 provides lots of operators on sequences and sets, and allows you to define property functions, but what it does not lend itself to is the creation of new data structures. To be more precise, the data structures you'd need for depth-first search or Tarjan's disjoint sets would require you to copy elements into new structures. Easy, if you're working in a language that has pointer or reference equality; not possible in XLST. However, if you assume that the XSLT processor will use hash tables to represent sets, and you put that together with <xsl:key>, you can try a different approach.

Here's the input I want to process:

<?xml version="1.0"?>
  <data>
    <documents>
    <document id="a">
      <related-entities>1 2 3</related-entities>
    </document>
    <document id="b">
     <related-entities>2 3 4</related-entities>
    </document>
    <document id="c">
      <related-entities>5 6 7</related-entities>
    </document>
    <document id="d">
      <related-entities>7 8 9</related-entities>
    </document>
    <document id="e">
      <related-entities>10</related-entities>
    </document>
  </documents>
  <entities>
    <entity id="1"/>
    <entity id="2"/>
    <entity id="3"/>
    <entity id="4"/>
    <entity id="5"/>
    <entity id="6"/>
    <entity id="7"/>
    <entity id="8"/>
    <entity id="9"/>
    <entity id="10"/>
  </entities>
</data>

I'm going to create a hashmap (actually, a hashmultimap) from entity IDs to documents, and vice versa. Note that the keys have to be of an atomic value type, and the values have to be elements. You can't map, say, integers to strings. Alright: for any document I can get the list of entity IDs, and use them as keys pointing back to the document element. So if any two documents point to the same entity, those two documents (and, obviously, the entity as well) must belong in the same strongly-connected component. For a sequence of entity IDs I can create a set of document elements; I then have to determine if this set intersects with the set I've already accumulated. If yes, I union both sets and recurse; otherwise I spit out the subgraph, pop the stack, and resume with the next, as-yet-unselected document.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="subgraphs">
  <xsl:key name="linked-documents" match="/data/documents/document" use="tokenize(related-entities, ' ')"/>
  <xsl:key name="linked-entities" match="/data/entities/entity" use="key('linked-documents', @id)/@id"/>
  <xsl:template match="document">
    <document id="{@id}"/>
  </xsl:template>
  <xsl:template name="create-subgraph">
    <xsl:param name="subgraph" as="element(document)*"/>
    <xsl:param name="selection" as="element(document)*" />
    <xsl:message select="string-join($subgraph/@id, ',')" />
    <xsl:message select="string-join($selection/@id, ',')" />
    <xsl:choose>
      <!-- If there are no more elements to select from, the subgraph is complete (even if empty). -->
      <xsl:when test="not($selection)">
        <subgraph>
          <xsl:apply-templates select="$subgraph"/>
        </subgraph>
      </xsl:when>
      <xsl:otherwise>
        <xsl:variable name="linked-entities" as="element(entity)*" select="for $d in $subgraph return key('linked-entities', $d/@id)"/>
        <xsl:message select="string-join($linked-entities/@id, ',')" />
        <xsl:variable name="subselection" select="$selection[key('linked-entities', @id) intersect $linked-entities]"/>
        <xsl:message select="string-join($subselection/@id, ',')" />
        <xsl:choose>
          <xsl:when test="$subselection">
            <xsl:call-template name="create-subgraph">
              <xsl:with-param name="selection" select="$selection except $subselection"/>
              <xsl:with-param name="subgraph" select="$subgraph|$subselection"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <subgraph>
              <xsl:apply-templates select="$subgraph"/>
            </subgraph>
            <xsl:variable name="new-selection" select="$selection except $subselection"/>
            <xsl:call-template name="create-subgraph">
              <xsl:with-param name="selection" select="$new-selection[position() gt 1]"/>
              <xsl:with-param name="subgraph" select="$new-selection[1]"/>
            </xsl:call-template>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template match="/">
    <subgraphs>
      <xsl:call-template name="create-subgraph">
        <!-- Make the first document element an SCC. -->
          <xsl:with-param name="subgraph" select="/data/documents/document[1]"/>
          <xsl:with-param name="selection" select="/data/documents/document[position() gt 1]"/>
        </xsl:call-template>
      </subgraphs>
  </xsl:template>
</xsl:stylesheet>

In the end, sure enough, I get:

<subgraphs xmlns:fn="subgraphs" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <subgraph>
    <document id="a"/>
    <document id="b"/>
  </subgraph>
  <subgraph>
    <document id="c"/>
    <document id="d"/>
  </subgraph>
  <subgraph>
    <document id="e"/>
  </subgraph>
</subgraphs>

Yes, defining one mapping in terms of another is allowed in XSLT 2.0. The problem I had at work was actually more complicated: the entities were in entirely separate XML documents. The trick there, which cost me some sweat, was to define one mapping, which pointed from a document immediately to other documents. However, I'll put off that explanation until someone asks for it.

Friday, July 18, 2008

How to Convert Julian Dates to Gregorian

I'm writing Java / Jython / Groovy these days, not F#, at a new job, and haven't had time to maintain this blog, but I did want to post one little snippet that I was surprised not to find elsewhere. I've had to ingest 10s of 1000s of records from a customer's Microsoft Access database, and the dates are represented in Julian format. Not relative to the Julian calendar; I mean the number of days since a particular Day 0. For Microsoft Access, Day 0 is December 30, 1899. If Microsoft Access interests you, read up on it here.

Anyway, there are quite a few Web pages out there describing how to do it in Excel or T-SQL, but none describing how to do it in Java. Actually, it's ridiculously easy, which doesn't say much for me. Here's the solution in Jython:


from java.util import *

def convertJulian(julian):
    calendar = Calendar.getInstance()
    # Why in God's name are months in Java 0-based?
    # This is December 30, 1899.
    calendar.set(1899, 11, 30)
    calendar.add(Calendar.DATE, julian)
    return calendar.time

D'oh! Set Day 0, add the days, and you're good to go. Now in Java:


private static Date convertJulian(final int julian) {
    Calendar calendar = Calendar.getInstance();
    calendar.set(1899, 11, 30);
    calendar.add(Calendar.DATE, julian);
    return calendar.getTime();
}

Finally, as XSLT (you'll need an XSLT 2.0-compliant processor for this):


 <xsl:function name="local:convert-access-date" as="xs:date">
  <xsl:param name="access-date" as="xs:integer"/>
  <xsl:variable name="days-since-zero" as="xs:dayTimeDuration"
    select="xs:dayTimeDuration(concat('P', $access-date, 'D'))"/>
  <xsl:sequence select="$zero + $days-since-zero"/>
 </xsl:function>

Saturday, May 24, 2008

John Cage for Children

Imagine my surprise when I came home from work a few days ago to find this album blasting from my Meadowlark speakers. I did buy them to listen to art music, not Wiggles DVDs, but I didn't expect my toddler to rummage through my CDs and pick this one to play. Kids today grow up so fast!

Here's the album I want him to listen to though, because I want to train him to be normal. Komar and Melamid, those jokesters, have assembled musical material that people say they like, result in "a musical work that will be unavoidably and uncontrollably 'liked' by 72 ± 12% of listeners ((standard deviation; Kolmogorov-Smirnov statistic)." I have to admit that I prefer the composition that only 200 people in the world are supposed to like; it sounds to me like John Cage sounds to those other people. You can listen to recordings of both here. The unpleasant recording had to be fun to make; the pleasant one sounds like a bored Alicia Keys cover band playing at an airport lounge.

Sunday, May 04, 2008

Counting Change Again?

Problem 76 at Project Euler is really just the "change counting" problem: how many ways can you count out 100 cents if you have coins in denominations from 1 to 99? However, when I plug those values into my previous implementations of the change counting algorithm, it runs forever...well, maybe not forever, but for hours and hours, while using 100% of my RAM. OK, so that's not going to work. At first I thought that the problem was simply that too many of the answers were in memory (i.e. sequences of coins), so I changed the implementation, simply to count possible answers:


let rec countChange =
    match coins, amount with
    ¦ _, 0 -> 1L
    ¦ [], _ -> 0L
    ¦ h::t, _ -> [ 0..h..amount ]
                 ¦> List.map (fun amt' -> countChange t (amount - amt'))
                 ¦> Seq.fold1 (+)

Let's review this. How many ways are there to make 0 in change, whatever the available coins? There's 1 way. How many ways are there to make some other quantity in change if you have no coins? 0. Finally, how many ways are there to make C in change if you have a a list of coins h::t? Partition the problem thus: use the coin h to pay out a partition of C (obviously you can use coin h up to C/h times), then make up the remaining partition of C with the rest of the list. If, to solve each subproblem, you're dividing C by the denomination of a coin, then the size of the problem space is exponential in the number of denominations: (C/denomination₁) * (C/denomination₂) * ... * (C/denomination_n) = O(Cⁿ). You can't exactly tell from this example, but the rate of growth here is clearly superpolynomial:


> countChange [ 4; 5 ] 20;;
val it : int64 = 2L
> countChange [ 3; 4; 5 ] 20;;
val it : int64 = 6L
> countChange [ 3; 4; 5; 6 ] 20;;
val it : int64 = 11L
> countChange [2;  3; 4; 5; 6 ] 20;;
val it : int64 = 47L
>

Anyway, it should be clear why there's no point in waiting for the solution to countChange [ 1..99 ] 100. However, I can memoize the change counting algorithm, since it's deterministic: for a given amount, and a given set of coins, the answer is always the same (commerce would be hopeless otherwise). This turns out to be fairly easy, notwithstanding the two arguments to the function. Here's my answer:


open System.Collections.Generic;;

let memoize f =
    let cache = Dictionary<_,>()
    fun x y ->
        if cache.ContainsKey((x, y)) then cache.[(x, y)]
        // Remember that f will always consult the memo
        // for subproblems.
        else let result = f x y
             cache.[(x, y)] <- result              
             result  

let rec countChange =   
  // Turn this into a function that returns a function.   
  // The *returned* function will be evaluated over the arguments.   
  let countChange' coins amount =      
    match coins, amount with     
    ¦ _, 0 -> 1L     
    ¦ [], _ -> 0L     
    ¦ h::t, _ -> [ 0..h..amount ]                   
                 ¦> List.map (fun amt' -> countChange t (amount - amt'))
                 ¦> Seq.fold1 (+)
  memoize countChange'

The answer shows up in a few seconds. Since F# lists are immutable, they can easily be compared for reference equality, rather than in O(n²) time. That is, the initial list [ 99..(-1)..1 ] is a 99 prepended to the list [ 98..(-1)..1 ], and so on. We partition the problem into subproblems corresponding to the head and tail of the list, but no additional lists are created beyond the argument to the function (there's only one list beginning with 50, ever, and it's the tail of the list beginning with 51). Microsoft.FSharp.Collections.List<_>.CompareTo() sees literally the same lists again and again when called from memoize(). Hence the lickety-split performance.

Perhaps this could all be done imperatively, though I don't how OOP is at all applicable to this sort of classic optimization problem (what could justify a Coin class here)? If you haven't read Daniel Friedman's "The Role of the Study of Programming Languages in the Education of a Programmer," here's your chance. As my buddy George Paci puts it, "The title is clearly written by a professor; a marketing guy would have called it something like, 'Holy Crap! It's Amazing What You Can Do with Program Transformations!'"

Wednesday, April 30, 2008

Agent Less Than Zero

I just started a new job, and I'm writing a lot of Java all of a sudden. Hey, maybe I'll learn Scala and piss off my new colleagues just as I was pissing people off with F# a few weeks ago! Or maybe I'll put off blogging about F# for a few weeks until I get settled in. I have been using F# to solve Project Euler problems, but I feel guilty blogging about those, since I'd have to give away answers.

In the meantime, I have to agree with Charles Barkley, who said, "I think the Washington Wizards have got to be the dumbest team in the history of civilization." I think that Caron Butler is a smart player, mind you, and he *should* have had the ball in his hands at the end of tonight's game.

Wednesday, April 02, 2008

Memoization, etc.

I've been sweating over some problems at Project Euler, and I've verified that I'm no mathematician. Some enterprising Haskell programmers got there way before I did, and if I were inclined to cheat, I'd simply translate their solutions into F#. However, in the effort or out of the necessity to remain pure, they don't use for-loops over arrays, or their own memoization, which I'd almost certainly do for any production code that had to calculate the Fibonacci series...well, actually, I don't do that kind of work. Anyway...

I did find the Haskell samples instructive, though I don't want to get sidetracked from my project of learning F#. To this end I've once again resorted to Don Syme's thoughts on memoization: here on the CAML list, and here more recently, where he avails himself of some popular .NET collections. So a memoized Fibonacci function in Dr. Syme's older formulation looks like this:


> #light;;

> let cache f =

-   let table = ref [] in

-   fun n ->

-      try

-        List.assoc n !table

-      with Not_found ->

-        let f_n = f n in

-           table := (n, f_n) :: !table;

-           f_n

-

-

- let rec fib_mem =

-    cache (function

-              // You'd better return a bigint, not an int!

-              ¦ 0 -> 0I

-              ¦ 1 -> 1I

-              ¦ n -> fib_mem (n - 1) + fib_mem (n - 2))

-

-

- ;;



val cache : ('a -> 'b) -> ('a -> 'b)

val fib_mem : (int -> bigint)


> fib_mem(1000);;

val it : bigint

= 434665576869374564356885276750406258025646605173717804024817290895365554179490

51890403879840079255169295922593080322634775209689623239873322471161642996440906

533187938298969649928516003704476137795166849228875I

val ns : seq

Several problems at Project Euler ask you to do something with the first 10 or last 10 digits of some big integer; for anyone who was wondering how to do that with a number such as the one above, here's one way:


> fib_mem(1000) |> Seq.unfold (fun n -> if n > 0I
-                                         then
-                                          let quotient,remainder = BigInt.divmod n 10I
-                                          Some (remainder, quotient)
-                                        else
-                                          None);;
val it : seq = seq [5I; 7I; 8I; 8I; ...]
>

You'll notice that the least significant digits appear first, which is of course by design, since dividing by 10 is probably cheap. The last thing I'd want to do is reverse this sequence, i.e.


-               |> List.of_seq
-               |> List.rev;;

I could clean this up further by applying BigInt.to_Int32 to the generated elements, etc. To sum the first 10 digits (note that this is *not* the Project Euler problem; I'm not about to give away any answers) I could go further, like so:


-               |> Seq.take 10
-               |> Seq.fold1 (+);;

You get the idea.

Tuesday, March 11, 2008

Munging with Active Patterns

Have you checked out active patterns yet? C'mon! Here's what the code below might have looked like if I'd used them:


> open System;;

> open System.Text.RegularExpressions;;

> let re = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

val re : Regex
> let (¦DataRow(¦_¦) (s : string) =
-     if re.IsMatch s
-         // Here I'd want to break the line down.
-     then Some(1, 2, "Hello!")
-     else None

val ( ¦DataRow¦_¦ ) : string -> (int * int * string) option

As I said earlier, Seq.choose will through out None, and I could then handle different kinds of data rows from the input:


> match "TUE" with
- ¦ DataRow(x, y, s) -> Console.WriteLine(s)
- // Additional cases here...
- ¦ _ -> Console.WriteLine("Got bupkes!");;
Hello!
val it : unit = ()

Monday, March 10, 2008

Munging Data in F#, Again

I couldn't help myself. I was recently asked to take a coding test to prove my fitness for a potential new job, and I figured I could do it more quickly in F#, interactively. I wouldn't have to fire up Visual Studio, I wouldn't have to create an executable project with NUnit tests and debuggability and all that ceremony. But really, my dabblings in functional programming have taught me that decomposition into classes is not always the natural way to decompose a problem. Why do I need any classes at all to do this exercise?


> #light;;
> open System.Text.RegularExpressions;;
> let r = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

The line must start with one of these abbreviations. Hey, Blogger keeps taking out the vertical bars between the abbreviations; if they're missing, don't get distracted. Then I want to turn the file into a sequence of strings, i.e. IEnumerable<string>. A "use" as opposed to a "let" binding ensures that the Close() or Dispose() is finally called. See Don Syme's blog for examples (this example, actually, which I simply copied). It's more or less like a C# function that returns IEnumerable<_> by means of "yield return."


> let reader =
- { use reader = new StreamReader(File.OpenRead(@"C:\...\input.txt"))
-   while not reader.EndOfStream do
-   yield reader.ReadLine() };;
// Filter out the lines that aren’t data rows.
> let filtered = reader ¦> Seq.filter (fun line -> r.IsMatch line);;
> open System;;

Now decompose the line into the parts we’re interested in, based on the (assumed) fixed format. If you want to get your mind blown by "active patterns," dig my man Don Syme.


> let (¦Record¦) (line : string) =
- let transientSold = Int32.Parse(line.Substring(67, 7))
- let definite = Int32.Parse(line.Substring(25, 3))
- let tentative = Int32.Parse(line.Substring(19, 3))
- let date = line.Substring(5, 11)
- // Return the tuple of interesting values.
- date, transientSold, definite, tentative;;
// This should really be embedded in a function!
> open System.Xml;;
> let xml = XmlWriter.Create("output.xml");;
> xml.WriteStartDocument();;
> xml.WriteStartElement("Sample");;

Now pass the filtered lines through a function that decomposes the line into fields we care about, and output an XML element per. Seq.iter applies a function with side-effects but no return value to every element in a sequence.


> filtered ¦> Seq.iter (fun line ->
- match line with
- ¦ Record(date, ts, d, t) -> xml.WriteStartElement("Thing")
-   xml.WriteElementString("Date", date)
-   xml.WriteElementString("TransientSold", XmlConvert.ToString(ts))
-   xml.WriteElementString("CommitmentsDefinite", XmlConvert.ToString(d))
-   xml.WriteElementString("CommitmentsTentative", XmlConvert.ToString(t))
-   xml.WriteEndElement()
- );;
val it : unit = ()
// Likewise, this should have been put into the function I inlined above.
> xml.WriteEndElement();;
val it : unit = ()
> xml.Close();;
val it : unit = ()

I could have made some different choices here. First of all, opening an XmlWriter in one function, or directly from the command line, then closing it the same way, is rather ugly. If a function is the unit of encapsulation in functional programming, then one function should really own the writer via a use-binding. So I could do something like this:

filtered ¦> (fun lines -> 
                                use xmlWriter = XmlWriter.Create("output.xml")
                                xmlWriter .WriteStartDocument()
                                xmlWriter .WriteStartElement("Sample")
                                // Yes, it's a loop!
                                for line in lines do
                                    // Create each element...
                            )

The following may be too cute:

filtered  ¦> Seq.fold (fun () -> let xmlWriter = ...
                                                 xmlWriter.WriteStartDocument()
                                                 // additional initialization
                                                 xmlWriter
                                      )
                                      (fun writer line -> 
                                                 // Add an element to the XmlWriter.
                                                 writer
                                      )
                  ¦> (fun writer -> writer.Close())

A significant change would have been to use an active pattern to both look for lines of interest and decompose them into interesting tuples. In this case I either keep the line or throw it out, but there might be several data row formats that I care about. In this event I could use active patterns to discriminate between them, and embed any regular expressions in the pattern function. Furthermore, I could use Seq.choose instead of Seq.filter, because the former passes over None. If I get a chance tonight, I'll write that code out.

Saturday, February 02, 2008

Munging some XML with F# & Linq

I recently wanted to change the format of a test script, and canonicalize a flattened listing of all the regression test models. So I set myself the goal of doing it in F#, which would force me to use some of the language features I'd only read about, and possibly prod me to use some LINQ as well. I could have done this transformation manually in about an hour, I admit. And I learned that if I really want to transform a lot of XML to XML, I'd rather do it via XML, i.e. XSLT. However, Microsoft doesn't offer XSLT 2.0 compliancy, so I have less to gain from learning to do it that way (meaning, I'm stuck with the .NET XsltCompiledTransform implementation at work). What I would have needed is the tokenize function (to split up pathnames on the '\\' character), XSLT functions that could return Boolean values, and the ability to select groups. Anyway, that's what I did below.


#light

open System
open System.Xml
open System.Linq
open System.Xml.Linq
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Xml.Linq
open Microsoft.FSharp.Linq.SequenceOps
open Microsoft.FSharp.Xml.Linq.SequenceOps

let doc = XDocument.Load @"knownSolutions.xml"

// Simply tagged aliases for these various tuples. I want to create 
// a tree of these, then finally convert them back to XML.      
type Node =
| File of string * XElement
| Directory of string * Node list

// the temporary data structure: I'll create a list of these from
// the initial XML, then recursively group them into directories. 
type Path = string list * XElement

let make_path (e : XElement) =
   e.Element(xname "File") |> element_to_string |> String.split [ '\\' ], e

// I want to split up the path names, while carrying the original
// element around until I'm ready to use it. 
let paths = doc.Root.Elements(xname "Model")
               |> Seq.map make_path
               |> Seq.to_list // There's no Seq.partition, so...
              
let is_file (p, _) =
   (List.length p) = 1
  
let strip (p, e) =
   // Strip the next part of the path, and return this tuple.
   List.tl p, e

let separate (l : Path list) =
   List.partition is_file l  
  
let rec collect (l : Path list) =
   match l with
   | [] -> []
   | _ -> let files, directories = separate l
          let x = directories
                  |> Seq.groupBy (fun (l, _) -> List.hd l)
                  |> Seq.map make_directories
                  |> Seq.to_list
          (make_files files) @ x
and make_directories (l, r) =
   Directory ( l, r |> Seq.map strip |> Seq.to_list |> collect )
and make_files paths =
   paths |> List.map (fun (l, e) -> File (List.hd l, e))
  
let collected = paths |> collect

let make_xml l =
   // Yes, you need the "let rec" here, too. That took me a while to 
    // figure out. Otherwise these nested functions can't call each other.
   let rec make_file name (xml : XElement) =
       // If you want to mix XElement and XAttribute objects, you'll have 
        // to upcast them to XObject. F#'s type inference won't try to find the
        // lowest common denominator between two classes in a heterogeneous 
        // collection, as far as I can tell. 
       XElement(xname "File", xargs [XAttribute(xname "Name", name); XAttribute(xname "ObjectiveValue", xml.Element(xname "Solution") |> element_to_string)])
   and make_directory name nodes =
       let element = XElement(xname "Directory", xargs (nodes |> List.map make_xml'))
       element.SetAttributeValue(xname "Name", name)
       element
   and make_xml' e =
       match e with
       | File (name, element) -> make_file name element
       | Directory (name, element) -> make_directory name element
   match l with
   | [] -> []
   | _ -> List.map make_xml' l
  
let xml = XElement(xname "KnownSolutions", make_xml collected)
let s = xml.ToString()
Console.WriteLine(s)

What's an F# Tuple?

It's hardly a difficult question, but it's taken me a while to arrive at an answer. Reflector has been invaluable to me, and I can't recommend it strongly enough to anyone who wants or needs to understand what the F# compiler does. One often hears that functional programming more readily lets one find the proper level of abstraction, but sometimes one has to look at the IL, perhaps as a bridge to C# (the "canonical"
.NET language), to improve.

Let's say I define these tuple types:

type StringInt = string * int
type StringDouble = string * double

If I disassemble the resulting DLL, I find...nothing at all! There's no trace in the IL of these tuple types, or anything else:

public static void main()
{
}

Let me add some functions that operate on this code:

let get_int (t : StringInt) =
    let _, n = t in n

let get_float (t : StringFloat) =
    let _, f = t in f

Something interesting now shows up in Reflector's tree view of the
assembly:

main() then looks like this (screenshots made using Cropper):

I can't claim that all of this is comprehensible to me, but one thing is obvious: the compiled IL has no trace of the named tuple types. They're like structs insofar as they're equal when they have equal structure. Tuple types with the same structure are really the same type. They are not implemented as structs (presumably to avoid having to copy them to the stack on every recursion), but they're implemented as classes. Sealed classes, mind you:

The disassembled get_int() function looks like this; I don't know why a new instance is being allocated from the heap:
Interestingly, if I change my code thus, I can no longer call get_int() and get_float() with the resulting values, because the "discrimated union" Tuples is implemented as a class, rather like a C-style union with type tag and some smarter operations (the disassembled IL is much too verbose to reproduce here, but I recommend trying it):

type Tuples =
| One of StringInt
| Two of StringFloat

let si = One ("Carlo", 6)
let sf = Two ("Leo", 1.5)

Console.WriteLine (si)
Console.WriteLine (sf)

Note that I don't have to specify the tuple type; the compiler infers if from the tag "One" or "Two". What I've learned from Don Syme's Expert F# is that the tag most recently put in scope is the one that the compiler tries to use. Then the disassembled code looks like this, meaning that you can no longer get at the tuple except through the tag: