Saturday, October 25, 2008

Efficient “Disjoint Sets” Implementation in XSLT

I actually had to revisit this problem with slightly different input, and once again I struggled. If I represent an undirected graph like this:

<?xml version="1.0"?>
<graph>
    <nodes>
        <node id="1"/>
        <node id="2"/>
        <node id="3"/>
        <node id="4"/>
        <node id="5"/>
    </nodes>
    <links>
        <link end1="1" end2="2"/>
        <link end1="1" end2="3"/>
        <link end1="1" end2="4"/>
        <link end1="1" end2="5"/>
    </links>
</graph>

…then it's a little harder to go from one node to its connected nodes, because the information I need is elsewhere in the XML document (i.e. not in a child node of the <node> element) . To index any node element, I'd have to navigate up the tree, and back down to the links. What's worse, I'd have to scan all the links for every node. So the solution involves some indirection. First I declare these two mappings:

<xsl:key name="left" match="/graph/links/link" use="@end1"/>
<xsl:key name="right" match="/graph/links/link" use="@end2"/>


 

Now I map each node (its ID, actually; remember that the key has to be of an atomic type) to its connected nodes by going through these mappings (such a recursive definition is not possible in XSLT 1.0, I believe). Note that for links that were indexed by @end1, I'm interested in the value of @end2, and v.v.


<xsl:key name="connected" match="node" use="key('left', @id)/@end2, key('right', @id)/@end1"/>


 

Finally, I can use this mapping:


 

<xsl:template match="/">
        <connectivity>
            <xsl:for-each select="graph/nodes/node">
                <node id="{@id}">
                    <connected-to>
                        <xsl:value-of select="key('connected', @id)/@id"/>
                    </connected-to>
                </node>
            </xsl:for-each>
        </connectivity>
    </xsl:template>


 

I believe that these are all the data structures I need. Now I can implement the "strongly connected components" algorithm as I did below, but using set operators on the nodes.

Friday, October 03, 2008

Graph Algorithms in XSLT

Notwithstanding that XSLT can be considered a functional programming language, you probably wouldn't choose to implement algorithms in it. Nonetheless, I recently had to split a very large XML document into smaller pieces (using xsl:result-document), and I wanted to redundancies between the outputs. So what I essentially wanted to do was separate elements in the input into "strongly-connected components." XSLT 2.0 provides lots of operators on sequences and sets, and allows you to define property functions, but what it does not lend itself to is the creation of new data structures. To be more precise, the data structures you'd need for depth-first search or Tarjan's disjoint sets would require you to copy elements into new structures. Easy, if you're working in a language that has pointer or reference equality; not possible in XLST. However, if you assume that the XSLT processor will use hash tables to represent sets, and you put that together with <xsl:key>, you can try a different approach.

Here's the input I want to process:

<?xml version="1.0"?>
<data>
<documents>
<document id="a">
<related-entities>1 2 3</related-entities>
</document>
<document id="b">
<related-entities>2 3 4</related-entities>
</document>
<document id="c">
<related-entities>5 6 7</related-entities>
</document>
<document id="d">
<related-entities>7 8 9</related-entities>
</document>
<document id="e">
<related-entities>10</related-entities>
</document>
</documents>
<entities>
<entity id="1"/>
<entity id="2"/>
<entity id="3"/>
<entity id="4"/>
<entity id="5"/>
<entity id="6"/>
<entity id="7"/>
<entity id="8"/>
<entity id="9"/>
<entity id="10"/>
</entities>
</data>

I'm going to create a hashmap (actually, a hashmultimap) from entity IDs to documents, and vice versa. Note that the keys have to be of an atomic value type, and the values have to be elements. You can't map, say, integers to strings. Alright: for any document I can get the list of entity IDs, and use them as keys pointing back to the document element. So if any two documents point to the same entity, those two documents (and, obviously, the entity as well) must belong in the same strongly-connected component. For a sequence of entity IDs I can create a set of document elements; I then have to determine if this set intersects with the set I've already accumulated. If yes, I union both sets and recurse; otherwise I spit out the subgraph, pop the stack, and resume with the next, as-yet-unselected document.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="subgraphs">
<xsl:key name="linked-documents" match="/data/documents/document" use="tokenize(related-entities, ' ')"/>
<xsl:key name="linked-entities" match="/data/entities/entity" use="key('linked-documents', @id)/@id"/>
<xsl:template match="document">
<document id="{@id}"/>
</xsl:template>
<xsl:template name="create-subgraph">
<xsl:param name="subgraph" as="element(document)*"/>
<xsl:param name="selection" as="element(document)*" />
<xsl:message select="string-join($subgraph/@id, ',')" />
<xsl:message select="string-join($selection/@id, ',')" />
<xsl:choose>
<!-- If there are no more elements to select from, the subgraph is complete (even if empty). -->
<xsl:when test="not($selection)">
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="linked-entities" as="element(entity)*" select="for $d in $subgraph return key('linked-entities', $d/@id)"/>
<xsl:message select="string-join($linked-entities/@id, ',')" />
<xsl:variable name="subselection" select="$selection[key('linked-entities', @id) intersect $linked-entities]"/>
<xsl:message select="string-join($subselection/@id, ',')" />
<xsl:choose>
<xsl:when test="$subselection">
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$selection except $subselection"/>
<xsl:with-param name="subgraph" select="$subgraph|$subselection"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
<xsl:variable name="new-selection" select="$selection except $subselection"/>
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$new-selection[position() gt 1]"/>
<xsl:with-param name="subgraph" select="$new-selection[1]"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template match="/">
<subgraphs>
<xsl:call-template name="create-subgraph">
<!-- Make the first document element an SCC. -->
<xsl:with-param name="subgraph" select="/data/documents/document[1]"/>
<xsl:with-param name="selection" select="/data/documents/document[position() gt 1]"/>
</xsl:call-template>
</subgraphs>
</xsl:template>
</xsl:stylesheet>


In the end, sure enough, I get:

<subgraphs xmlns:fn="subgraphs" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<subgraph>
<document id="a"/>
<document id="b"/>
</subgraph>
<subgraph>
<document id="c"/>
<document id="d"/>
</subgraph>
<subgraph>
<document id="e"/>
</subgraph>
</subgraphs>

Yes, defining one mapping in terms of another is allowed in XSLT 2.0. The problem I had at work was actually more complicated: the entities were in entirely separate XML documents. The trick there, which cost me some sweat, was to define one mapping, which pointed from a document immediately to other documents. However, I'll put off that explanation until someone asks for it.

Friday, July 18, 2008

How to Convert Julian Dates to Gregorian

I'm writing Java / Jython / Groovy these days, not F#, at a new job, and haven't had time to maintain this blog, but I did want to post one little snippet that I was surprised not to find elsewhere. I've had to ingest 10s of 1000s of records from a customer's Microsoft Access database, and the dates are represented in Julian format. Not relative to the Julian calendar; I mean the number of days since a particular Day 0. For Microsoft Access, Day 0 is December 30, 1899. If Microsoft Access interests you, read up on it here.

Anyway, there are quite a few Web pages out there describing how to do it in Excel or T-SQL, but none describing how to do it in Java. Actually, it's ridiculously easy, which doesn't say much for me. Here's the solution in Jython:

from java.util import *

def convertJulian(julian):
calendar = Calendar.getInstance()
# Why in God's name are months in Java 0-based?
# This is December 30, 1899.
calendar.set(1899, 11, 30)
calendar.add(Calendar.DATE, julian)
return calendar.time

D'oh! Set Day 0, add the days, and you're good to go. Now in Java:

private static Date convertJulian(final int julian) {
Calendar calendar = Calendar.getInstance();
calendar.set(1899, 11, 30);
calendar.add(Calendar.DATE, julian);
return calendar.getTime();
}

Finally, as XSLT (you'll need an XSLT 2.0-compliant processor for this):

<xsl:function name="local:convert-access-date" as="xs:date">
<xsl:param name="access-date" as="xs:integer"/>
<xsl:variable name="days-since-zero" as="xs:dayTimeDuration"
select="xs:dayTimeDuration(concat('P', $access-date, 'D'))"/>
<xsl:sequence select="$zero + $days-since-zero"/>
</xsl:function>

Saturday, May 24, 2008

John Cage for Children

Imagine my surprise when I came home from work a few days ago to find this album blasting from my Meadowlark speakers. I did buy them to listen to art music, not Wiggles DVDs, but I didn't expect my toddler to rummage through my CDs and pick this one to play. Kids today grow up so fast!






Here's the album I want him to listen to though, because I want to train him to be normal. Komar and Melamid, those jokesters, have assembled musical material that people say they like, result in "a musical work that will be unavoidably and uncontrollably 'liked' by 72 ± 12% of listeners ((standard deviation; Kolmogorov-Smirnov statistic)." I have to admit that I prefer the composition that only 200 people in the world are supposed to like; it sounds to me like John Cage sounds to those other people. You can listen to recordings of both here. The unpleasant recording had to be fun to make; the pleasant one sounds like a bored Alicia Keys cover band playing at an airport lounge.

Sunday, May 04, 2008

Counting Change Again?

Problem 76 at Project Euler is really just the "change counting" problem: how many ways can you count out 100 cents if you have coins in denominations from 1 to 99? However, when I plug those values into my previous implementations of the change counting algorithm, it runs forever...well, maybe not forever, but for hours and hours, while using 100% of my RAM. OK, so that's not going to work. At first I thought that the problem was simply that too many of the answers were in memory (i.e. sequences of coins), so I changed the implementation, simply to count possible answers:

let rec countChange =
match coins, amount with
¦ _, 0 -> 1L
¦ [], _ -> 0L
¦ h::t, _ -> [ 0..h..amount ]
¦> List.map (fun amt' -> countChange t (amount - amt'))
¦> Seq.fold1 (+)
Let's review this. How many ways are there to make 0 in change, whatever the available coins? There's 1 way. How many ways are there to make some other quantity in change if you have no coins? 0. Finally, how many ways are there to make C in change if you have a a list of coins h::t? Partition the problem thus: use the coin h to pay out a partition of C (obviously you can use coin h up to C/h times), then make up the remaining partition of C with the rest of the list. If, to solve each subproblem, you're dividing C by the denomination of a coin, then the size of the problem space is exponential in the number of denominations: (C/denomination1) * (C/denomination2) * ... * (C/denominationn) = O(Cn). You can't exactly tell from this example, but the rate of growth here is clearly superpolynomial:

> countChange [ 4; 5 ] 20;;
val it : int64 = 2L
> countChange [ 3; 4; 5 ] 20;;
val it : int64 = 6L
> countChange [ 3; 4; 5; 6 ] 20;;
val it : int64 = 11L
> countChange [2; 3; 4; 5; 6 ] 20;;
val it : int64 = 47L
>
Anyway, it should be clear why there's no point in waiting for the solution to countChange [ 1..99 ] 100. However, I can memoize the change counting algorithm, since it's deterministic: for a given amount, and a given set of coins, the answer is always the same (commerce would be hopeless otherwise). This turns out to be fairly easy, notwithstanding the two arguments to the function. Here's my answer:

open System.Collections.Generic;;

let memoize f =
let cache = Dictionary<_,>()
fun x y ->
if cache.ContainsKey((x, y)) then cache.[(x, y)]
// Remember that f will always consult the memo
// for subproblems.
else let result = f x y
cache.[(x, y)] <- result
result

let rec countChange =
// Turn this into a function that returns a function.
// The *returned* function will be evaluated over the arguments.
let countChange' coins amount =
match coins, amount with
¦ _, 0 -> 1L
¦ [], _ -> 0L
¦ h::t, _ -> [ 0..h..amount ]
¦> List.map (fun amt' -> countChange t (amount - amt'))
¦> Seq.fold1 (+)
memoize countChange'
The answer shows up in a few seconds. Since F# lists are immutable, they can easily be compared for reference equality, rather than in O(n2) time. That is, the initial list [ 99..(-1)..1 ] is a 99 prepended to the list [ 98..(-1)..1 ], and so on. We partition the problem into subproblems corresponding to the head and tail of the list, but no additional lists are created beyond the argument to the function (there's only one list beginning with 50, ever, and it's the tail of the list beginning with 51). Microsoft.FSharp.Collections.List<_>.CompareTo() sees literally the same lists again and again when called from memoize(). Hence the lickety-split performance.

Perhaps this could all be done imperatively, though I don't how OOP is at all applicable to this sort of classic optimization problem (what could justify a Coin class here)? If you haven't read Daniel Friedman's "The Role of the Study of Programming Languages in the Education of a Programmer," here's your chance. As my buddy George Paci puts it, "The title is clearly written by a professor; a marketing guy would have called it something like, 'Holy Crap! It's Amazing What You Can Do with Program Transformations!'"

Wednesday, April 30, 2008

Agent Less Than Zero

I just started a new job, and I'm writing a lot of Java all of a sudden. Hey, maybe I'll learn Scala and piss off my new colleagues just as I was pissing people off with F# a few weeks ago! Or maybe I'll put off blogging about F# for a few weeks until I get settled in. I have been using F# to solve Project Euler problems, but I feel guilty blogging about those, since I'd have to give away answers.

In the meantime, I have to agree with Charles Barkley, who said, "I think the Washington Wizards have got to be the dumbest team in the history of civilization." I think that Caron Butler is a smart player, mind you, and he *should* have had the ball in his hands at the end of tonight's game.

Wednesday, April 02, 2008

Memoization, etc.

I've been sweating over some problems at Project Euler, and I've verified that I'm no mathematician. Some enterprising Haskell programmers got there way before I did, and if I were inclined to cheat, I'd simply translate their solutions into F#. However, in the effort or out of the necessity to remain pure, they don't use for-loops over arrays, or their own memoization, which I'd almost certainly do for any production code that had to calculate the Fibonacci series...well, actually, I don't do that kind of work. Anyway...

I did find the Haskell samples instructive, though I don't want to get sidetracked from my project of learning F#. To this end I've once again resorted to Don Syme's thoughts on memoization: here on the CAML list, and here more recently, where he avails himself of some popular .NET collections. So a memoized Fibonacci function in Dr. Syme's older formulation looks like this:

> #light;;

> let cache f =

- let table = ref [] in

- fun n ->

- try

- List.assoc n !table

- with Not_found ->

- let f_n = f n in

- table := (n, f_n) :: !table;

- f_n

-

-

- let rec fib_mem =

- cache (function

- // You'd better return a bigint, not an int!

- ¦ 0 -> 0I

- ¦ 1 -> 1I

- ¦ n -> fib_mem (n - 1) + fib_mem (n - 2))

-

-

- ;;



val cache : ('a -> 'b) -> ('a -> 'b)

val fib_mem : (int -> bigint)


> fib_mem(1000);;

val it : bigint

= 434665576869374564356885276750406258025646605173717804024817290895365554179490

51890403879840079255169295922593080322634775209689623239873322471161642996440906

533187938298969649928516003704476137795166849228875I

val ns : seq
Several problems at Project Euler ask you to do something with the first 10 or last 10 digits of some big integer; for anyone who was wondering how to do that with a number such as the one above, here's one way:

> fib_mem(1000) |> Seq.unfold (fun n -> if n > 0I
- then
- let quotient,remainder = BigInt.divmod n 10I
- Some (remainder, quotient)
- else
- None);;
val it : seq = seq [5I; 7I; 8I; 8I; ...]
>
You'll notice that the least significant digits appear first, which is of course by design, since dividing by 10 is probably cheap. The last thing I'd want to do is reverse this sequence, i.e.

- |> List.of_seq
- |> List.rev;;
I could clean this up further by applying BigInt.to_Int32 to the generated elements, etc. To sum the first 10 digits (note that this is *not* the Project Euler problem; I'm not about to give away any answers) I could go further, like so:

- |> Seq.take 10
- |> Seq.fold1 (+);;
You get the idea.

Tuesday, March 11, 2008

Munging with Active Patterns

Have you checked out active patterns yet? C'mon! Here's what the code below might have looked like if I'd used them:

> open System;;

> open System.Text.RegularExpressions;;

> let re = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

val re : Regex
> let (¦DataRow(¦_¦) (s : string) =
- if re.IsMatch s
- // Here I'd want to break the line down.
- then Some(1, 2, "Hello!")
- else None

val ( ¦DataRow¦_¦ ) : string -> (int * int * string) option
As I said earlier, Seq.choose will through out None, and I could then handle different kinds of data rows from the input:

> match "TUE" with
- ¦ DataRow(x, y, s) -> Console.WriteLine(s)
- // Additional cases here...
- ¦ _ -> Console.WriteLine("Got bupkes!");;
Hello!
val it : unit = ()

Monday, March 10, 2008

Munging Data in F#, Again

I couldn't help myself. I was recently asked to take a coding test to prove my fitness for a potential new job, and I figured I could do it more quickly in F#, interactively. I wouldn't have to fire up Visual Studio, I wouldn't have to create an executable project with NUnit tests and debuggability and all that ceremony. But really, my dabblings in functional programming have taught me that decomposition into classes is not always the natural way to decompose a problem. Why do I need any classes at all to do this exercise?

> #light;;
> open System.Text.RegularExpressions;;
> let r = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

The line must start with one of these abbreviations. Hey, Blogger keeps taking out the vertical bars between the abbreviations; if they're missing, don't get distracted. Then I want to turn the file into a sequence of strings, i.e. IEnumerable<string>. A "use" as opposed to a "let" binding ensures that the Close() or Dispose() is finally called. See Don Syme's blog for examples (this example, actually, which I simply copied). It's more or less like a C# function that returns IEnumerable<_> by means of "yield return."



> let reader =
- { use reader = new StreamReader(File.OpenRead(@"C:\...\input.txt"))
- while not reader.EndOfStream do
- yield reader.ReadLine() };;
// Filter out the lines that aren’t data rows.
> let filtered = reader ¦> Seq.filter (fun line -> r.IsMatch line);;
> open System;;

Now decompose the line into the parts we’re interested in, based on the (assumed) fixed format. If you want to get your mind blown by "active patterns," dig my man Don Syme.



> let (¦Record¦) (line : string) =
- let transientSold = Int32.Parse(line.Substring(67, 7))
- let definite = Int32.Parse(line.Substring(25, 3))
- let tentative = Int32.Parse(line.Substring(19, 3))
- let date = line.Substring(5, 11)
- // Return the tuple of interesting values.
- date, transientSold, definite, tentative;;
// This should really be embedded in a function!
> open System.Xml;;
> let xml = XmlWriter.Create("output.xml");;
> xml.WriteStartDocument();;
> xml.WriteStartElement("Sample");;

Now pass the filtered lines through a function that decomposes the line into fields we care about, and output an XML element per. Seq.iter applies a function with side-effects but no return value to every element in a sequence.



> filtered ¦> Seq.iter (fun line ->
- match line with
- ¦ Record(date, ts, d, t) -> xml.WriteStartElement("Thing")
- xml.WriteElementString("Date", date)
- xml.WriteElementString("TransientSold", XmlConvert.ToString(ts))
- xml.WriteElementString("CommitmentsDefinite", XmlConvert.ToString(d))
- xml.WriteElementString("CommitmentsTentative", XmlConvert.ToString(t))
- xml.WriteEndElement()
- );;
val it : unit = ()
// Likewise, this should have been put into the function I inlined above.
> xml.WriteEndElement();;
val it : unit = ()
> xml.Close();;
val it : unit = ()

I could have made some different choices here. First of all, opening an XmlWriter in one function, or directly from the command line, then closing it the same way, is rather ugly. If a function is the unit of encapsulation in functional programming, then one function should really own the writer via a use-binding. So I could do something like this:


filtered ¦> (fun lines -> 
use xmlWriter = XmlWriter.Create("output.xml")
xmlWriter .WriteStartDocument()
xmlWriter .WriteStartElement("Sample")
// Yes, it's a loop!
for line in lines do
// Create each element...
)

The following may be too cute:


filtered  ¦> Seq.fold (fun () -> let xmlWriter = ...
xmlWriter.WriteStartDocument()
// additional initialization
xmlWriter
)
(fun writer line ->
// Add an element to the XmlWriter.
writer
)
¦> (fun writer -> writer.Close())

A significant change would have been to use an active pattern to both look for lines of interest and decompose them into interesting tuples. In this case I either keep the line or throw it out, but there might be several data row formats that I care about. In this event I could use active patterns to discriminate between them, and embed any regular expressions in the pattern function. Furthermore, I could use Seq.choose instead of Seq.filter, because the former passes over None. If I get a chance tonight, I'll write that code out.

Saturday, February 02, 2008

Munging some XML with F# & Linq

I recently wanted to change the format of a test script, and canonicalize a flattened listing of all the regression test models. So I set myself the goal of doing it in F#, which would force me to use some of the language features I'd only read about, and possibly prod me to use some LINQ as well. I could have done this transformation manually in about an hour, I admit. And I learned that if I really want to transform a lot of XML to XML, I'd rather do it via XML, i.e. XSLT. However, Microsoft doesn't offer XSLT 2.0 compliancy, so I have less to gain from learning to do it that way (meaning, I'm stuck with the .NET XsltCompiledTransform implementation at work). What I would have needed is the tokenize function (to split up pathnames on the '\\' character), XSLT functions that could return Boolean values, and the ability to select groups. Anyway, that's what I did below.

#light

open System
open System.Xml
open System.Linq
open System.Xml.Linq
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Xml.Linq
open Microsoft.FSharp.Linq.SequenceOps
open Microsoft.FSharp.Xml.Linq.SequenceOps

let doc = XDocument.Load @"knownSolutions.xml"

// Simply tagged aliases for these various tuples. I want to create
// a tree of these, then finally convert them back to XML.
type Node =
| File of string * XElement
| Directory of string * Node list

// the temporary data structure: I'll create a list of these from
// the initial XML, then recursively group them into directories.
type Path = string list * XElement

let make_path (e : XElement) =
e.Element(xname "File") |> element_to_string |> String.split [ '\\' ], e

// I want to split up the path names, while carrying the original
// element around until I'm ready to use it.
let paths = doc.Root.Elements(xname "Model")
|> Seq.map make_path
|> Seq.to_list // There's no Seq.partition, so...

let is_file (p, _) =
(List.length p) = 1

let strip (p, e) =
// Strip the next part of the path, and return this tuple.
List.tl p, e

let separate (l : Path list) =
List.partition is_file l

let rec collect (l : Path list) =
match l with
| [] -> []
| _ -> let files, directories = separate l
let x = directories
|> Seq.groupBy (fun (l, _) -> List.hd l)
|> Seq.map make_directories
|> Seq.to_list
(make_files files) @ x
and make_directories (l, r) =
Directory ( l, r |> Seq.map strip |> Seq.to_list |> collect )
and make_files paths =
paths |> List.map (fun (l, e) -> File (List.hd l, e))

let collected = paths |> collect

let make_xml l =
// Yes, you need the "let rec" here, too. That took me a while to
// figure out. Otherwise these nested functions can't call each other.
let rec make_file name (xml : XElement) =
// If you want to mix XElement and XAttribute objects, you'll have
// to upcast them to XObject. F#'s type inference won't try to find the
// lowest common denominator between two classes in a heterogeneous
// collection, as far as I can tell.
XElement(xname "File", xargs [XAttribute(xname "Name", name); XAttribute(xname "ObjectiveValue", xml.Element(xname "Solution") |> element_to_string)])
and make_directory name nodes =
let element = XElement(xname "Directory", xargs (nodes |> List.map make_xml'))
element.SetAttributeValue(xname "Name", name)
element
and make_xml' e =
match e with
| File (name, element) -> make_file name element
| Directory (name, element) -> make_directory name element
match l with
| [] -> []
| _ -> List.map make_xml' l

let xml = XElement(xname "KnownSolutions", make_xml collected)
let s = xml.ToString()
Console.WriteLine(s)

What's an F# Tuple?


It's hardly a difficult question, but it's taken me a while to arrive at an answer. Reflector has been invaluable to me, and I can't recommend it strongly enough to anyone who wants or needs to understand what the F# compiler does. One often hears that functional programming more readily lets one find the proper level of abstraction, but sometimes one has to look at the IL, perhaps as a bridge to C# (the "canonical"
.NET language), to improve.


Let's say I define these tuple types:


type StringInt = string * int
type StringDouble = string * double



If I disassemble the resulting DLL, I find...nothing at all! There's no trace in the IL of these tuple types, or anything else:
public static void main()
{
}



Let me add some functions that operate on this code:
let get_int (t : StringInt) =
let _, n = t in n

let get_float (t : StringFloat) =
let _, f = t in f


Something interesting now shows up in Reflector's tree view of the
assembly:


main() then looks like this (screenshots made using Cropper):

I can't claim that all of this is comprehensible to me, but one thing is obvious: the compiled IL has no trace of the named tuple types. They're like structs insofar as they're equal when they have equal structure. Tuple types with the same structure are really the same type. They are not implemented as structs (presumably to avoid having to copy them to the stack on every recursion), but they're implemented as classes. Sealed classes, mind you:



The disassembled get_int() function looks like this; I don't know why a new instance is being allocated from the heap:
Interestingly, if I change my code thus, I can no longer call get_int() and get_float() with the resulting values, because the "discrimated union" Tuples is implemented as a class, rather like a C-style union with type tag and some smarter operations (the disassembled IL is much too verbose to reproduce here, but I recommend trying it):


type Tuples =
| One of StringInt
| Two of StringFloat

let si = One ("Carlo", 6)
let sf = Two ("Leo", 1.5)

Console.WriteLine (si)
Console.WriteLine (sf)


Note that I don't have to specify the tuple type; the compiler infers if from the tag "One" or "Two". What I've learned from Don Syme's Expert F# is that the tag most recently put in scope is the one that the compiler tries to use. Then the disassembled code looks like this, meaning that you can no longer get at the tuple except through the tag:



Tuesday, January 29, 2008

Cute Techie Kids!


I don't really have any excuse for posting this photo of Leo (19 mos. old) sleeping with a bottle dangling from his lips. You wanna make something of it?

I Couldn't Be Happier!

From slate.com:

A study suggests extreme happiness may be bad for you. Findings: 1) "The highest levels of income, education and political participation were reported not by the most satisfied individuals, but by moderately satisfied individuals." 2) Extremely happy people "earned significantly less money" and earned lower school grades than moderately happy people. 3) They "may not live as long," either. Theories: 1) Happiness makes you complacent and kills your drive. 2) It makes you slow to adapt. 3) It makes you too optimistic and insufficiently vigilant about your health. 4) It may overstimulate your cardiovascular system. Researchers' conclusions: 1) "Happiness may need to be moderated for success." 2) "Extremely high levels of happiness might not be a desirable goal." Human Nature's conclusions: 1) Success may need to be moderated for happiness. 2) Extremely high levels of success might not be a desirable goal.

Sunday, January 20, 2008

Huffman Coding in F#

#light // I.e. lightweight syntax

namespace Nassar
// An F# module is a "static class." All the declarations
// that follow are static members of this class.

module Huffman

type Incidence = HashMultiMap<char, int>
type CharSet = Set<char>

let addToCount (d : Incidence) (c : char) =
// For some reason, the "do" is necessary here.
do match d.TryFind c with
Some(n) -> d.Replace(c, n + 1)
_ -> d.Add(c, 1)
d // Return the dictionary.

let count (s : string) =
// I only just noticed that System.String implements
// IEnumerable. The F# compiler correctly inferred from
// the type of Incidence that s has to be a sequence of
// char.

s |> Seq.fold addToCount (Incidence.Create())

type HuffmanNode =
Leaf of HuffmanLeaf
Tree of HuffmanTree
and HuffmanLeaf = {Count: int;
Char: char;}
and HuffmanTree = {Count: int;
Chars: CharSet;
Left: HuffmanNode;
Right: HuffmanNode;}

let get_count node =
match node with
Leaf l -> l.Count
Tree t -> t.Count

let has_char node c =
match node with
Leaf l -> l.Char = c
Tree t -> t.Chars.Contains c

let get_set node =
match node with
Leaf leaf -> CharSet.Singleton leaf.Char
Tree tree -> tree.Chars

let merge_nodes left right =
Tree { Count = get_count left + get_count right;
Chars = (get_set left) + (get_set right);
Left = left;
Right = right }

let node_priority l r =
get_count l - get_count r

let rec reduce_list l =
// What I should really do is put all the new leaf nodes
// into a priority queue, and pull off two at a time as
// long as the length of the queue is > 1. Then
// make_tree wouldn't have to convert the sequence
// it creates from the dictionary into a list.

match l with
l::r::tail -> (merge_nodes l r)::tail
|> List.sort node_priority
// ...and recurse.
|> reduce_list
[tree] -> tree
[] -> failwith "Can't operate on empty list!"

let rec encode h s =
// Encode a character by branching to the leaf node
// with that character.

let rec encode' node c =
match node with
Leaf l -> []
// TODO Add an accumulator to make this tail-recursive,
// rather than prepending the new element.

Tree t -> if has_char t.Left c then 0::(encode' t.Left c) else 1::(encode' t.Right c)
s > Seq.map (fun c -> encode' h c) > Seq.concat

// Decode the int list l using the tree h.
let rec decode h l =
let rec decode' node l =
match node, l with
// If we've arrived at a leaf, this must be the character
// we're looking for. Add it to the sequence, and resume
// from the top of the tree.
// TODO Add an accumulator to make this tail-recursive.

Leaf leaf, _ -> leaf.Char :: (decode h l)
// A 0 in the stream means a left branch during
// the encoding, and v.v.

Tree tree, n::tail -> if n = 0 then decode' tree.Left tail else decode' tree.Right tail
// Something was wrong in the encoding.
_, _ -> failwith "Something has gone very wrong."
match l with
[] -> []
// Start navigating down the tree.
_ -> decode' h l

let make_tree s =
let d = count s
// IDictionary<_, _> implements
// ICollection<KeyValuePair<_,_>>, so enumerate the
// pairs and convert each into a leaf node.

let leaves = d |> Seq.map (fun pair ->
Leaf { Count = pair.Value;
Char = pair.Key }; )
|> Seq.to_list
|> List.sort node_priority
reduce_list leaves

Wednesday, January 16, 2008

Huffman Coding, SICP Exercise 2.69

As a convenience, I defined Incidence as System.Collections.Generic.Dictionary. F#'s Map type
is actually a functor, and each insertion or removal returns a new instance. What I'd really like is something like the FindOrAdd() method available in the C5 dictionary implementations, but I can live without it. Anyway, I intend to fold a sequence of characters into a dictionary, for which purpose I'll first define the function that will do the folding:

> let count (d : Incidence) (c : char) =
- let n = ref 0
- // I need a "ref" variable to serve as an "out" parameter.
- if d.TryGetValue(c, n) then
- // Assign to a mutable property...
- d.[c] <- !n + 1
- else
- d.Add(c, 1)
- d;;
val count : Incidence -> char -> Incidence

I'm getting good at this:

> let d = "Tony Nassar, Software Engineer" |> Seq.fold count (new Incidence());;

val d : Incidence

> d;;
val it : Incidence
= dict
[('T', 1); ('o', 2); ('n', 3); ('y', 1); (' ', 3); ('N', 1); ('a', 3);
('s', 2); ('r', 3); (',', 1); ('S', 1); ('f', 1); ('t', 1); ('w', 1);
('e', 3); ('E', 1); ('g', 1); ('i', 1)]
>
I did not know until today that System.String implements IEnumerable! And, fortunately, F#'s type inference lets me pipe that sequence of characters into my count() function. There are two steps that I need to take now: I need to create a sequence sorted by frequency (actually, a mapping from frequency to character, where the frequencies don't have to be unique) and turn it into a tree. This took me a while. Unlike C#, F# only allows mutually recursive type definitions if they're linked with "and." Here I'm declaring HuffmanNode as a "discriminated union," where Leaf and Tree are implemented as tags (you can verify this by using Reflector to examine the compiled DLL), and a leaf consists of the character to be encoded and its frequency (the latter is necessary during the construction of the tree), and a tree (or any non-leaf node) is...well, it's kinda obvious. If this were C, and someone were using a union and a type code, I'd say "Yuck!" But since it's functional programming, and this is a very handy way to build up a tree, it's suddenly cool again.

> type HuffmanNode =
- | Leaf of HuffmanLeaf
- | Tree of HuffmanTree
- and HuffmanLeaf = { Count : int; Char : char }
- and HuffmanTree = { Count : int; Chars : CharSet; Left : HuffmanNode; Right : HuffmanNode };;

type HuffmanNode =
| Leaf of HuffmanLeaf
| Tree of HuffmanTree
and HuffmanLeaf = {Count: int;
Char: char;}
and HuffmanTree = {Count: int;
Chars: CharSet;
Left: HuffmanNode;
Right: HuffmanNode;}

>

Tuesday, January 15, 2008

Eight Queens: I couldn't let it go!

If you're not used to functional programming, and I'm not, you may respond with irritation at an FPer's "How few lines of code can I use to solve this problem?" showpiece. A good example would be the textbook problem of transposing a matrix (for which the answer is posted on Jomo Fischer's blog). If you really need to transpose a lot of matrices, you're probably going to be working in an environment where such operations are optimized, e.g. Mathematica. In other words, you're not going to do it yourself. So if anyone is irritated with the following code, let me say that I had to ponder this (on and off, not continuously!) for a few months before I arrived at the "right" solution, and I do feel that it's right. However, trying to implement it in F# gave me some valuable insights into how it ought to be done in other languages. Here's my solution:
> let queens size =
- let range = [1..size]
- range |> Seq.fold (fun answers column ->
- let positions = [for row in range -> (column, row)]
- [for p in positions for a in answers when all_safe p a -> a@[p]]) [[]]
- ;;

val queens : int -> (int * int) list list

> queens 5;;
val it : (int * int) list list
= [[(1, 4); (2, 2); (3, 5); (4, 3); (5, 1)];
[(1, 3); (2, 5); (3, 2); (4, 4); (5, 1)];
[(1, 5); (2, 3); (3, 1); (4, 4); (5, 2)];
[(1, 4); (2, 1); (3, 3); (4, 5); (5, 2)];
[(1, 5); (2, 2); (3, 4); (4, 1); (5, 3)];
[(1, 1); (2, 4); (3, 2); (4, 5); (5, 3)];
[(1, 2); (2, 5); (3, 3); (4, 1); (5, 4)];
[(1, 1); (2, 3); (3, 5); (4, 2); (5, 4)];
[(1, 3); (2, 1); (3, 4); (4, 2); (5, 5)];
[(1, 2); (2, 4); (3, 1); (4, 3); (5, 5)]]
> queens 4;;
val it : (int * int) list list
= [[(1, 3); (2, 1); (3, 4); (4, 2)]; [(1, 2); (2, 4); (3, 1); (4, 3)]]

First, I did not use a for loop. Not because that would just be too imperative, but because I realized that the range (e.g. from 1..8, for an 8 x 8 chessboard) was actually an (immutable) object that I needed to use in several ways. To generate all the positions on a chessboard, I would then use a "sequence comprehension" such as { for column_number in range for row_number in range -> (column_number, row_number) }. I've begun to see this as more natural than nested for loops. I'm not doing with side effects in a loop; I'm asking for a single object (a collection of board positions) to be created for me. So the comprehension, a declarative construct, is more natural: "Here's what I want; give it to me. I'm not interested in putting it together piece by piece; that's your job." In comparison, the idiom 1, create a list, 2, go through several loops, appending on each innermost iteration, 3, finally do something with the collection (which is probably mutable, even though it really shouldn't be), obscures the intention, and is error-prone.

Then I iterate over that range (now treating it as the sequence of column numbers), and for every column I take the sequence of answers so far (i.e., each such answer is a sequence of one position each from the columns I've seen so far) and extend each one with each position in the current column (i.e. the current element of the range) that's safe. Below, I defined the functions safe() and all_safe(). I can easily generate board positions for the current column by means of another comprehension (let positions = etc.). Then I take the Cartesian product of the answers so far and this new column, filter all these new candidates through all_safe, and pass this new collection on to the next iteration, via fold(). For obvious reasons, Seq.fold starts with the collection of known 0-length answers, which is of course [[]].

If I were (re)doing it in Scheme, I'd rip off Kent Dybvig's comprehension macro, first off. I'd also rip off an implementation of fold, if I didn't already have one. I'd generate a range 1..n just as I do above. Then, on every iteration (I believe it is an iteration, if I'm folding from the left) I create the Cartesian product of the answers so far and the current column (which I just generated), (filter) these pairs through (all-safe), and continue. Something like that.

Eight Queens, and an End

Once again I was led astray by having done too much imperative programming. I hate nesting loops to create collections, but I didn't go all the way. If the columns and rows of the chessboard are represented as a range—the same range—then all the positions are produced by taking the Cartesian product of that range with itself. Then I can iterate across the columns, and, for each one, take the Cartesian product of the rows in that column and the answers so far, and fold the new answers into a collection that was empty to start with. Ah, fold, delight of all functional programmers. To wit:

> let queens size =
- // the row numbers *and* the column numbers
- let range = [1..size]
- let queens' columns =
- columns |> Seq.fold (fun answers column ->
- let positions = [for row in range -> (column, row)]
- [for p in positions for a in answers when all_safe p a -> a@[p]])
- [[]]
- queens' range;;

val queens : int -> (int * int) list list

> queens 4;;
val it : (int * int) list list
= [[(1, 3); (2, 1); (3, 4); (4, 2)]; [(1, 2); (2, 4); (3, 1); (4, 3)]]
>