Monday, March 10, 2008

Munging Data in F#, Again

I couldn't help myself. I was recently asked to take a coding test to prove my fitness for a potential new job, and I figured I could do it more quickly in F#, interactively. I wouldn't have to fire up Visual Studio, I wouldn't have to create an executable project with NUnit tests and debuggability and all that ceremony. But really, my dabblings in functional programming have taught me that decomposition into classes is not always the natural way to decompose a problem. Why do I need any classes at all to do this exercise?

> #light;;
> open System.Text.RegularExpressions;;
> let r = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

The line must start with one of these abbreviations. Hey, Blogger keeps taking out the vertical bars between the abbreviations; if they're missing, don't get distracted. Then I want to turn the file into a sequence of strings, i.e. IEnumerable<string>. A "use" as opposed to a "let" binding ensures that the Close() or Dispose() is finally called. See Don Syme's blog for examples (this example, actually, which I simply copied). It's more or less like a C# function that returns IEnumerable<_> by means of "yield return."



> let reader =
- { use reader = new StreamReader(File.OpenRead(@"C:\...\input.txt"))
- while not reader.EndOfStream do
- yield reader.ReadLine() };;
// Filter out the lines that aren’t data rows.
> let filtered = reader ¦> Seq.filter (fun line -> r.IsMatch line);;
> open System;;

Now decompose the line into the parts we’re interested in, based on the (assumed) fixed format. If you want to get your mind blown by "active patterns," dig my man Don Syme.



> let (¦Record¦) (line : string) =
- let transientSold = Int32.Parse(line.Substring(67, 7))
- let definite = Int32.Parse(line.Substring(25, 3))
- let tentative = Int32.Parse(line.Substring(19, 3))
- let date = line.Substring(5, 11)
- // Return the tuple of interesting values.
- date, transientSold, definite, tentative;;
// This should really be embedded in a function!
> open System.Xml;;
> let xml = XmlWriter.Create("output.xml");;
> xml.WriteStartDocument();;
> xml.WriteStartElement("Sample");;

Now pass the filtered lines through a function that decomposes the line into fields we care about, and output an XML element per. Seq.iter applies a function with side-effects but no return value to every element in a sequence.



> filtered ¦> Seq.iter (fun line ->
- match line with
- ¦ Record(date, ts, d, t) -> xml.WriteStartElement("Thing")
- xml.WriteElementString("Date", date)
- xml.WriteElementString("TransientSold", XmlConvert.ToString(ts))
- xml.WriteElementString("CommitmentsDefinite", XmlConvert.ToString(d))
- xml.WriteElementString("CommitmentsTentative", XmlConvert.ToString(t))
- xml.WriteEndElement()
- );;
val it : unit = ()
// Likewise, this should have been put into the function I inlined above.
> xml.WriteEndElement();;
val it : unit = ()
> xml.Close();;
val it : unit = ()

I could have made some different choices here. First of all, opening an XmlWriter in one function, or directly from the command line, then closing it the same way, is rather ugly. If a function is the unit of encapsulation in functional programming, then one function should really own the writer via a use-binding. So I could do something like this:


filtered ¦> (fun lines -> 
use xmlWriter = XmlWriter.Create("output.xml")
xmlWriter .WriteStartDocument()
xmlWriter .WriteStartElement("Sample")
// Yes, it's a loop!
for line in lines do
// Create each element...
)

The following may be too cute:


filtered  ¦> Seq.fold (fun () -> let xmlWriter = ...
xmlWriter.WriteStartDocument()
// additional initialization
xmlWriter
)
(fun writer line ->
// Add an element to the XmlWriter.
writer
)
¦> (fun writer -> writer.Close())

A significant change would have been to use an active pattern to both look for lines of interest and decompose them into interesting tuples. In this case I either keep the line or throw it out, but there might be several data row formats that I care about. In this event I could use active patterns to discriminate between them, and embed any regular expressions in the pattern function. Furthermore, I could use Seq.choose instead of Seq.filter, because the former passes over None. If I get a chance tonight, I'll write that code out.

No comments: