Saturday, February 02, 2008

Munging some XML with F# & Linq

I recently wanted to change the format of a test script, and canonicalize a flattened listing of all the regression test models. So I set myself the goal of doing it in F#, which would force me to use some of the language features I'd only read about, and possibly prod me to use some LINQ as well. I could have done this transformation manually in about an hour, I admit. And I learned that if I really want to transform a lot of XML to XML, I'd rather do it via XML, i.e. XSLT. However, Microsoft doesn't offer XSLT 2.0 compliancy, so I have less to gain from learning to do it that way (meaning, I'm stuck with the .NET XsltCompiledTransform implementation at work). What I would have needed is the tokenize function (to split up pathnames on the '\\' character), XSLT functions that could return Boolean values, and the ability to select groups. Anyway, that's what I did below.


open System
open System.Xml
open System.Linq
open System.Xml.Linq
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Xml.Linq
open Microsoft.FSharp.Linq.SequenceOps
open Microsoft.FSharp.Xml.Linq.SequenceOps

let doc = XDocument.Load @"knownSolutions.xml"

// Simply tagged aliases for these various tuples. I want to create
// a tree of these, then finally convert them back to XML.
type Node =
| File of string * XElement
| Directory of string * Node list

// the temporary data structure: I'll create a list of these from
// the initial XML, then recursively group them into directories.
type Path = string list * XElement

let make_path (e : XElement) =
e.Element(xname "File") |> element_to_string |> String.split [ '\\' ], e

// I want to split up the path names, while carrying the original
// element around until I'm ready to use it.
let paths = doc.Root.Elements(xname "Model")
|> make_path
|> Seq.to_list // There's no Seq.partition, so...

let is_file (p, _) =
(List.length p) = 1

let strip (p, e) =
// Strip the next part of the path, and return this tuple. p, e

let separate (l : Path list) =
List.partition is_file l

let rec collect (l : Path list) =
match l with
| [] -> []
| _ -> let files, directories = separate l
let x = directories
|> Seq.groupBy (fun (l, _) -> List.hd l)
|> make_directories
|> Seq.to_list
(make_files files) @ x
and make_directories (l, r) =
Directory ( l, r |> strip |> Seq.to_list |> collect )
and make_files paths =
paths |> (fun (l, e) -> File (List.hd l, e))

let collected = paths |> collect

let make_xml l =
// Yes, you need the "let rec" here, too. That took me a while to
// figure out. Otherwise these nested functions can't call each other.
let rec make_file name (xml : XElement) =
// If you want to mix XElement and XAttribute objects, you'll have
// to upcast them to XObject. F#'s type inference won't try to find the
// lowest common denominator between two classes in a heterogeneous
// collection, as far as I can tell.
XElement(xname "File", xargs [XAttribute(xname "Name", name); XAttribute(xname "ObjectiveValue", xml.Element(xname "Solution") |> element_to_string)])
and make_directory name nodes =
let element = XElement(xname "Directory", xargs (nodes |> make_xml'))
element.SetAttributeValue(xname "Name", name)
and make_xml' e =
match e with
| File (name, element) -> make_file name element
| Directory (name, element) -> make_directory name element
match l with
| [] -> []
| _ -> make_xml' l

let xml = XElement(xname "KnownSolutions", make_xml collected)
let s = xml.ToString()

What's an F# Tuple?

It's hardly a difficult question, but it's taken me a while to arrive at an answer. Reflector has been invaluable to me, and I can't recommend it strongly enough to anyone who wants or needs to understand what the F# compiler does. One often hears that functional programming more readily lets one find the proper level of abstraction, but sometimes one has to look at the IL, perhaps as a bridge to C# (the "canonical"
.NET language), to improve.

Let's say I define these tuple types:

type StringInt = string * int
type StringDouble = string * double

If I disassemble the resulting DLL, I find...nothing at all! There's no trace in the IL of these tuple types, or anything else:
public static void main()

Let me add some functions that operate on this code:
let get_int (t : StringInt) =
let _, n = t in n

let get_float (t : StringFloat) =
let _, f = t in f

Something interesting now shows up in Reflector's tree view of the

main() then looks like this (screenshots made using Cropper):

I can't claim that all of this is comprehensible to me, but one thing is obvious: the compiled IL has no trace of the named tuple types. They're like structs insofar as they're equal when they have equal structure. Tuple types with the same structure are really the same type. They are not implemented as structs (presumably to avoid having to copy them to the stack on every recursion), but they're implemented as classes. Sealed classes, mind you:

The disassembled get_int() function looks like this; I don't know why a new instance is being allocated from the heap:
Interestingly, if I change my code thus, I can no longer call get_int() and get_float() with the resulting values, because the "discrimated union" Tuples is implemented as a class, rather like a C-style union with type tag and some smarter operations (the disassembled IL is much too verbose to reproduce here, but I recommend trying it):

type Tuples =
| One of StringInt
| Two of StringFloat

let si = One ("Carlo", 6)
let sf = Two ("Leo", 1.5)

Console.WriteLine (si)
Console.WriteLine (sf)

Note that I don't have to specify the tuple type; the compiler infers if from the tag "One" or "Two". What I've learned from Don Syme's Expert F# is that the tag most recently put in scope is the one that the compiler tries to use. Then the disassembled code looks like this, meaning that you can no longer get at the tuple except through the tag: