Friday, August 21, 2009

Converting Oracle TIMESTAMP WITH TIME ZONE to Current TZ

I see answers to this question all over the Web, usually involving some arithmetic with abbreviations such as "EDT", the use of hacky Oracle extension functions such as NEW_TIME, etc. Forget about that. Here's how you do it: use CAST(). If this seems pedantic, it's not. I recently had to deal with an Oracle database in which times were stored as epoch times, i.e. milliseconds since January 1, 1970 UTC (e.g. new java.util.Date().getTime()). So you can get the beginning of the epoch in Oracle SQL like so:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' FROM DUAL;

That'll get you a timestamp in terms of Greenwich Mean Time (GMT or UTC):

SELECT TO_CHAR(TIMESTAMP '1970-01-01 00:00:00 +00:00', 'TZR') FROM DUAL;

If I've got "epoch time" in ms, I can easily convert it to a timestamp relative to UTC:

SELECT TIMESTAMP '1970-01-01 00:00:00 +00:00' + NUMTODSINTERVAL(epoch_time / 1000, 'SECOND') FROM my_table;

How do I then display, and especially export these values relative to my current time zone? Casting to TIMESTAMP WITH LOCAL TIME ZONE doesn't quite work; by default, the time zone isn't displayed for this column, and the 'TZR' format doesn't work on it (I get an Oracle error, which confused me. So CAST() them to TIMESTAMP WITH LOCAL TIME ZONE, then CAST() to TIMESTAMP WITH TIME ZONE. Oracle does all the work, and I don't have to hard-code any information in the query about my time zone, or rely on EXTRACT(). So:

SELECT CAST(CAST(TIMESTAMP '1970-01-01 00:00:00 +00:00' AS TIMESTAMP WITH LOCAL TIME ZONE) AS TIMESTAMP WITH TIME ZONE) FROM DUAL;





Wednesday, July 22, 2009

An XSLT to Extract Text from Word

Well, this turns out to be even easier. If you've got the text:


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
version="1.0">

<xsl:output method="text" encoding="UTF-8" />

<xsl:template match="*">
<!-- Simply recurse over the children. -->
<xsl:apply-templates />
</xsl:template>

<!-- Any piece of text from the .docx is enclosed in a w:t element. -->
<xsl:template match="w:t">
<xsl:value-of select="."/>
<!-- Look for the xml:space attribute. A namespace is not necessary. -->
<!-- XPath 1.0 doesn't have the starts-with() and ends-with() functions, unfortunately. -->
<xsl:if test="@space = 'preserve'">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="w:p">
<xsl:apply-templates />
<xsl:text>&#xa;</xsl:text>
</xsl:template>

</xsl:stylesheet>

To get the text, here's some Groovy:



def createDocumentXml(f) {
def zip = new ZipFile(f)
def entry = zip.getEntry('word/document.xml')
assert entry
def xml = new File(f.absolutePath - '.docx' + '.xml')
// I want to copy the XML to a file, for purposes of visual inspection.
// In production code I'd simply return the stream.
zip.getInputStream(entry).withStream { i ->
xml.withOutputStream { o ->
o << i
}
}
return xml
}

To apply the XSLT to the document, see the attachments.

Saturday, May 02, 2009

Extracting the Text of an HTML Document

This is something I often have to do within an XSLT: there's some base-64 encoded HTML in a text node, and I'd like to extract the body text. Saxon offers saxon:parse() and saxon:base64Binary-to-string() that might be useful here. If you use the TagSoup parser to turn possibly nasty HTML into XHTML, then you can extract the text from the "thing" element with

<xsl:variable name="html" select="saxon:parse(saxon:base64-to-string(xs:base64(thing)))"/>
<xsl:value-of select="$html/xhtml:body//xhtml:*[local-name() != 'script']/text()" separator=""/>

Saxon's parse() function will fail if you don't direct Saxon (from the command line, or programmatically) to use the TagSoup parser. OTOH, you could resort to Groovy. Put TagSoup on your CLASSPATH, and:

import org.ccil.cowan.tagsoup.*

parser = new Parser()
// TagSoup offers loads of interesting options...check 'em out!
f = new File('C:\\Documents and Settings\\tnassar\\Desktop\\Efficient.html')

As we're reminded here, we can probably do without the namespace declarations. I quote:
  • name or "*:name" matches an element named "name" irrespective of the namespace it's in (i.e. this is the default mode of operation)
  • ":name" matches an element named "name" only id the element is not in a namespace
  • "prefix:name" matches an element names "name" only if it is in the namespace identified by the prefix "prefix" (and the prefix to namespace mapping was defined by a previous call to declareNamespace)
Anyway, force of habit:
html = new XmlSlurper(parser).parse(f).declareNamespace(xhtml: 'http://www.w3.org/1999/xhtml')

html.'xhtml:body'.depthFirst().findAll { it.name() != 'script' }*.text().join('\n')

Extracting the Text from a Word Document w/ Groovy

As I'm doing a lot of data munging these days, and often have to talk to Java APIs. Since I'm not writing production code, don't have lots of RAM or lots of good tools at my disposal and therefore would just as soon use a text editor (SciTE's my current preference...no, I can't get the Win32 port of emacs!), Groovy is definitely my best choice. Notwithstanding my preference for XSLT (over GPath) to handle XML, I can't deny that you can do some slick stuff w/ GPath. At another munger's request, I cooked this up in 5 minutes, and was almost shocked at how easy it was to get the text out of an Office 2007 docx file:

import java.util.zip.*

docx = new File('C:\\Documents and Settings\\tnassar\\My Documents\\Efficient.docx')
zip = new ZipFile(docx)
entry = zip.getEntry('word/document.xml')
stream = zip.getInputStream(entry)

// The namespace was gleaned from the decompressed XML.
wordMl = new XmlSlurper().parse(stream).declareNamespace(w: 'http://schemas.openxmlformats.org/wordprocessingml/2006/main')

// The outermost XML element node is assigned to the variable wordMl, so
// GPath expressions will start after that. To print out the concatenated
// descendant text nodes of w:body, you use:

text = wordMl.'w:body'.children().collect { it.text() }.join('')

println text
It would be nice if Groovy offered "raw strings" like Python--r'C:\Documents and Settings\...'--or C#, which lets you prepend a '@' to have backslashes treated literally--esp. when it comes to Windows pathnames, but whatever.

This will not work well for complex document formats (I can imagine that tables and such would be a disaster), but for me it was just enough.

Saturday, October 25, 2008

Efficient “Disjoint Sets” Implementation in XSLT

I actually had to revisit this problem with slightly different input, and once again I struggled. If I represent an undirected graph like this:

<?xml version="1.0"?>
<graph>
    <nodes>
        <node id="1"/>
        <node id="2"/>
        <node id="3"/>
        <node id="4"/>
        <node id="5"/>
    </nodes>
    <links>
        <link end1="1" end2="2"/>
        <link end1="1" end2="3"/>
        <link end1="1" end2="4"/>
        <link end1="1" end2="5"/>
    </links>
</graph>

…then it's a little harder to go from one node to its connected nodes, because the information I need is elsewhere in the XML document (i.e. not in a child node of the <node> element) . To index any node element, I'd have to navigate up the tree, and back down to the links. What's worse, I'd have to scan all the links for every node. So the solution involves some indirection. First I declare these two mappings:

<xsl:key name="left" match="/graph/links/link" use="@end1"/>
<xsl:key name="right" match="/graph/links/link" use="@end2"/>


 

Now I map each node (its ID, actually; remember that the key has to be of an atomic type) to its connected nodes by going through these mappings (such a recursive definition is not possible in XSLT 1.0, I believe). Note that for links that were indexed by @end1, I'm interested in the value of @end2, and v.v.


<xsl:key name="connected" match="node" use="key('left', @id)/@end2, key('right', @id)/@end1"/>


 

Finally, I can use this mapping:


 

<xsl:template match="/">
        <connectivity>
            <xsl:for-each select="graph/nodes/node">
                <node id="{@id}">
                    <connected-to>
                        <xsl:value-of select="key('connected', @id)/@id"/>
                    </connected-to>
                </node>
            </xsl:for-each>
        </connectivity>
    </xsl:template>


 

I believe that these are all the data structures I need. Now I can implement the "strongly connected components" algorithm as I did below, but using set operators on the nodes.

Friday, October 03, 2008

Graph Algorithms in XSLT

Notwithstanding that XSLT can be considered a functional programming language, you probably wouldn't choose to implement algorithms in it. Nonetheless, I recently had to split a very large XML document into smaller pieces (using xsl:result-document), and I wanted to redundancies between the outputs. So what I essentially wanted to do was separate elements in the input into "strongly-connected components." XSLT 2.0 provides lots of operators on sequences and sets, and allows you to define property functions, but what it does not lend itself to is the creation of new data structures. To be more precise, the data structures you'd need for depth-first search or Tarjan's disjoint sets would require you to copy elements into new structures. Easy, if you're working in a language that has pointer or reference equality; not possible in XLST. However, if you assume that the XSLT processor will use hash tables to represent sets, and you put that together with <xsl:key>, you can try a different approach.

Here's the input I want to process:

<?xml version="1.0"?>
<data>
<documents>
<document id="a">
<related-entities>1 2 3</related-entities>
</document>
<document id="b">
<related-entities>2 3 4</related-entities>
</document>
<document id="c">
<related-entities>5 6 7</related-entities>
</document>
<document id="d">
<related-entities>7 8 9</related-entities>
</document>
<document id="e">
<related-entities>10</related-entities>
</document>
</documents>
<entities>
<entity id="1"/>
<entity id="2"/>
<entity id="3"/>
<entity id="4"/>
<entity id="5"/>
<entity id="6"/>
<entity id="7"/>
<entity id="8"/>
<entity id="9"/>
<entity id="10"/>
</entities>
</data>

I'm going to create a hashmap (actually, a hashmultimap) from entity IDs to documents, and vice versa. Note that the keys have to be of an atomic value type, and the values have to be elements. You can't map, say, integers to strings. Alright: for any document I can get the list of entity IDs, and use them as keys pointing back to the document element. So if any two documents point to the same entity, those two documents (and, obviously, the entity as well) must belong in the same strongly-connected component. For a sequence of entity IDs I can create a set of document elements; I then have to determine if this set intersects with the set I've already accumulated. If yes, I union both sets and recurse; otherwise I spit out the subgraph, pop the stack, and resume with the next, as-yet-unselected document.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="subgraphs">
<xsl:key name="linked-documents" match="/data/documents/document" use="tokenize(related-entities, ' ')"/>
<xsl:key name="linked-entities" match="/data/entities/entity" use="key('linked-documents', @id)/@id"/>
<xsl:template match="document">
<document id="{@id}"/>
</xsl:template>
<xsl:template name="create-subgraph">
<xsl:param name="subgraph" as="element(document)*"/>
<xsl:param name="selection" as="element(document)*" />
<xsl:message select="string-join($subgraph/@id, ',')" />
<xsl:message select="string-join($selection/@id, ',')" />
<xsl:choose>
<!-- If there are no more elements to select from, the subgraph is complete (even if empty). -->
<xsl:when test="not($selection)">
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="linked-entities" as="element(entity)*" select="for $d in $subgraph return key('linked-entities', $d/@id)"/>
<xsl:message select="string-join($linked-entities/@id, ',')" />
<xsl:variable name="subselection" select="$selection[key('linked-entities', @id) intersect $linked-entities]"/>
<xsl:message select="string-join($subselection/@id, ',')" />
<xsl:choose>
<xsl:when test="$subselection">
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$selection except $subselection"/>
<xsl:with-param name="subgraph" select="$subgraph|$subselection"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
<xsl:variable name="new-selection" select="$selection except $subselection"/>
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$new-selection[position() gt 1]"/>
<xsl:with-param name="subgraph" select="$new-selection[1]"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template match="/">
<subgraphs>
<xsl:call-template name="create-subgraph">
<!-- Make the first document element an SCC. -->
<xsl:with-param name="subgraph" select="/data/documents/document[1]"/>
<xsl:with-param name="selection" select="/data/documents/document[position() gt 1]"/>
</xsl:call-template>
</subgraphs>
</xsl:template>
</xsl:stylesheet>


In the end, sure enough, I get:

<subgraphs xmlns:fn="subgraphs" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<subgraph>
<document id="a"/>
<document id="b"/>
</subgraph>
<subgraph>
<document id="c"/>
<document id="d"/>
</subgraph>
<subgraph>
<document id="e"/>
</subgraph>
</subgraphs>

Yes, defining one mapping in terms of another is allowed in XSLT 2.0. The problem I had at work was actually more complicated: the entities were in entirely separate XML documents. The trick there, which cost me some sweat, was to define one mapping, which pointed from a document immediately to other documents. However, I'll put off that explanation until someone asks for it.

Sunday, August 24, 2008

Philosophy of Life

"My philosophy is as simple as ever. I love smoking, drinking, moderate sexual intercourse on a diminishing scale, reading and writing (not arithmetic). I have a selfless absorption in the well-being and achievements of Noel Coward." -- from the letters of Noel Coward

Friday, July 18, 2008

How to Convert Julian Dates to Gregorian

I'm writing Java / Jython / Groovy these days, not F#, at a new job, and haven't had time to maintain this blog, but I did want to post one little snippet that I was surprised not to find elsewhere. I've had to ingest 10s of 1000s of records from a customer's Microsoft Access database, and the dates are represented in Julian format. Not relative to the Julian calendar; I mean the number of days since a particular Day 0. For Microsoft Access, Day 0 is December 30, 1899. If Microsoft Access interests you, read up on it here.

Anyway, there are quite a few Web pages out there describing how to do it in Excel or T-SQL, but none describing how to do it in Java. Actually, it's ridiculously easy, which doesn't say much for me. Here's the solution in Jython:

from java.util import *

def convertJulian(julian):
calendar = Calendar.getInstance()
# Why in God's name are months in Java 0-based?
# This is December 30, 1899.
calendar.set(1899, 11, 30)
calendar.add(Calendar.DATE, julian)
return calendar.time

D'oh! Set Day 0, add the days, and you're good to go. Now in Java:

private static Date convertJulian(final int julian) {
Calendar calendar = Calendar.getInstance();
calendar.set(1899, 11, 30);
calendar.add(Calendar.DATE, julian);
return calendar.getTime();
}

Finally, as XSLT (you'll need an XSLT 2.0-compliant processor for this):

<xsl:function name="local:convert-access-date" as="xs:date">
<xsl:param name="access-date" as="xs:integer"/>
<xsl:variable name="days-since-zero" as="xs:dayTimeDuration"
select="xs:dayTimeDuration(concat('P', $access-date, 'D'))"/>
<xsl:sequence select="$zero + $days-since-zero"/>
</xsl:function>

Saturday, May 24, 2008

John Cage for Children

Imagine my surprise when I came home from work a few days ago to find this album blasting from my Meadowlark speakers. I did buy them to listen to art music, not Wiggles DVDs, but I didn't expect my toddler to rummage through my CDs and pick this one to play. Kids today grow up so fast!






Here's the album I want him to listen to though, because I want to train him to be normal. Komar and Melamid, those jokesters, have assembled musical material that people say they like, result in "a musical work that will be unavoidably and uncontrollably 'liked' by 72 ± 12% of listeners ((standard deviation; Kolmogorov-Smirnov statistic)." I have to admit that I prefer the composition that only 200 people in the world are supposed to like; it sounds to me like John Cage sounds to those other people. You can listen to recordings of both here. The unpleasant recording had to be fun to make; the pleasant one sounds like a bored Alicia Keys cover band playing at an airport lounge.

Sunday, May 04, 2008

Counting Change Again?

Problem 76 at Project Euler is really just the "change counting" problem: how many ways can you count out 100 cents if you have coins in denominations from 1 to 99? However, when I plug those values into my previous implementations of the change counting algorithm, it runs forever...well, maybe not forever, but for hours and hours, while using 100% of my RAM. OK, so that's not going to work. At first I thought that the problem was simply that too many of the answers were in memory (i.e. sequences of coins), so I changed the implementation, simply to count possible answers:

let rec countChange =
match coins, amount with
¦ _, 0 -> 1L
¦ [], _ -> 0L
¦ h::t, _ -> [ 0..h..amount ]
¦> List.map (fun amt' -> countChange t (amount - amt'))
¦> Seq.fold1 (+)
Let's review this. How many ways are there to make 0 in change, whatever the available coins? There's 1 way. How many ways are there to make some other quantity in change if you have no coins? 0. Finally, how many ways are there to make C in change if you have a a list of coins h::t? Partition the problem thus: use the coin h to pay out a partition of C (obviously you can use coin h up to C/h times), then make up the remaining partition of C with the rest of the list. If, to solve each subproblem, you're dividing C by the denomination of a coin, then the size of the problem space is exponential in the number of denominations: (C/denomination1) * (C/denomination2) * ... * (C/denominationn) = O(Cn). You can't exactly tell from this example, but the rate of growth here is clearly superpolynomial:

> countChange [ 4; 5 ] 20;;
val it : int64 = 2L
> countChange [ 3; 4; 5 ] 20;;
val it : int64 = 6L
> countChange [ 3; 4; 5; 6 ] 20;;
val it : int64 = 11L
> countChange [2; 3; 4; 5; 6 ] 20;;
val it : int64 = 47L
>
Anyway, it should be clear why there's no point in waiting for the solution to countChange [ 1..99 ] 100. However, I can memoize the change counting algorithm, since it's deterministic: for a given amount, and a given set of coins, the answer is always the same (commerce would be hopeless otherwise). This turns out to be fairly easy, notwithstanding the two arguments to the function. Here's my answer:

open System.Collections.Generic;;

let memoize f =
let cache = Dictionary<_,>()
fun x y ->
if cache.ContainsKey((x, y)) then cache.[(x, y)]
// Remember that f will always consult the memo
// for subproblems.
else let result = f x y
cache.[(x, y)] <- result
result

let rec countChange =
// Turn this into a function that returns a function.
// The *returned* function will be evaluated over the arguments.
let countChange' coins amount =
match coins, amount with
¦ _, 0 -> 1L
¦ [], _ -> 0L
¦ h::t, _ -> [ 0..h..amount ]
¦> List.map (fun amt' -> countChange t (amount - amt'))
¦> Seq.fold1 (+)
memoize countChange'
The answer shows up in a few seconds. Since F# lists are immutable, they can easily be compared for reference equality, rather than in O(n2) time. That is, the initial list [ 99..(-1)..1 ] is a 99 prepended to the list [ 98..(-1)..1 ], and so on. We partition the problem into subproblems corresponding to the head and tail of the list, but no additional lists are created beyond the argument to the function (there's only one list beginning with 50, ever, and it's the tail of the list beginning with 51). Microsoft.FSharp.Collections.List<_>.CompareTo() sees literally the same lists again and again when called from memoize(). Hence the lickety-split performance.

Perhaps this could all be done imperatively, though I don't how OOP is at all applicable to this sort of classic optimization problem (what could justify a Coin class here)? If you haven't read Daniel Friedman's "The Role of the Study of Programming Languages in the Education of a Programmer," here's your chance. As my buddy George Paci puts it, "The title is clearly written by a professor; a marketing guy would have called it something like, 'Holy Crap! It's Amazing What You Can Do with Program Transformations!'"

Wednesday, April 30, 2008

Agent Less Than Zero

I just started a new job, and I'm writing a lot of Java all of a sudden. Hey, maybe I'll learn Scala and piss off my new colleagues just as I was pissing people off with F# a few weeks ago! Or maybe I'll put off blogging about F# for a few weeks until I get settled in. I have been using F# to solve Project Euler problems, but I feel guilty blogging about those, since I'd have to give away answers.

In the meantime, I have to agree with Charles Barkley, who said, "I think the Washington Wizards have got to be the dumbest team in the history of civilization." I think that Caron Butler is a smart player, mind you, and he *should* have had the ball in his hands at the end of tonight's game.

Wednesday, April 02, 2008

Memoization, etc.

I've been sweating over some problems at Project Euler, and I've verified that I'm no mathematician. Some enterprising Haskell programmers got there way before I did, and if I were inclined to cheat, I'd simply translate their solutions into F#. However, in the effort or out of the necessity to remain pure, they don't use for-loops over arrays, or their own memoization, which I'd almost certainly do for any production code that had to calculate the Fibonacci series...well, actually, I don't do that kind of work. Anyway...

I did find the Haskell samples instructive, though I don't want to get sidetracked from my project of learning F#. To this end I've once again resorted to Don Syme's thoughts on memoization: here on the CAML list, and here more recently, where he avails himself of some popular .NET collections. So a memoized Fibonacci function in Dr. Syme's older formulation looks like this:

> #light;;

> let cache f =

- let table = ref [] in

- fun n ->

- try

- List.assoc n !table

- with Not_found ->

- let f_n = f n in

- table := (n, f_n) :: !table;

- f_n

-

-

- let rec fib_mem =

- cache (function

- // You'd better return a bigint, not an int!

- ¦ 0 -> 0I

- ¦ 1 -> 1I

- ¦ n -> fib_mem (n - 1) + fib_mem (n - 2))

-

-

- ;;



val cache : ('a -> 'b) -> ('a -> 'b)

val fib_mem : (int -> bigint)


> fib_mem(1000);;

val it : bigint

= 434665576869374564356885276750406258025646605173717804024817290895365554179490

51890403879840079255169295922593080322634775209689623239873322471161642996440906

533187938298969649928516003704476137795166849228875I

val ns : seq
Several problems at Project Euler ask you to do something with the first 10 or last 10 digits of some big integer; for anyone who was wondering how to do that with a number such as the one above, here's one way:

> fib_mem(1000) |> Seq.unfold (fun n -> if n > 0I
- then
- let quotient,remainder = BigInt.divmod n 10I
- Some (remainder, quotient)
- else
- None);;
val it : seq = seq [5I; 7I; 8I; 8I; ...]
>
You'll notice that the least significant digits appear first, which is of course by design, since dividing by 10 is probably cheap. The last thing I'd want to do is reverse this sequence, i.e.

- |> List.of_seq
- |> List.rev;;
I could clean this up further by applying BigInt.to_Int32 to the generated elements, etc. To sum the first 10 digits (note that this is *not* the Project Euler problem; I'm not about to give away any answers) I could go further, like so:

- |> Seq.take 10
- |> Seq.fold1 (+);;
You get the idea.

Tuesday, March 11, 2008

Munging with Active Patterns

Have you checked out active patterns yet? C'mon! Here's what the code below might have looked like if I'd used them:

> open System;;

> open System.Text.RegularExpressions;;

> let re = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

val re : Regex
> let (¦DataRow(¦_¦) (s : string) =
- if re.IsMatch s
- // Here I'd want to break the line down.
- then Some(1, 2, "Hello!")
- else None

val ( ¦DataRow¦_¦ ) : string -> (int * int * string) option
As I said earlier, Seq.choose will through out None, and I could then handle different kinds of data rows from the input:

> match "TUE" with
- ¦ DataRow(x, y, s) -> Console.WriteLine(s)
- // Additional cases here...
- ¦ _ -> Console.WriteLine("Got bupkes!");;
Hello!
val it : unit = ()

Monday, March 10, 2008

Munging Data in F#, Again

I couldn't help myself. I was recently asked to take a coding test to prove my fitness for a potential new job, and I figured I could do it more quickly in F#, interactively. I wouldn't have to fire up Visual Studio, I wouldn't have to create an executable project with NUnit tests and debuggability and all that ceremony. But really, my dabblings in functional programming have taught me that decomposition into classes is not always the natural way to decompose a problem. Why do I need any classes at all to do this exercise?

> #light;;
> open System.Text.RegularExpressions;;
> let r = new Regex("^(MON¦TUE¦WED¦THU¦FRI¦SAT¦SUN)", RegexOptions.Compiled);;

The line must start with one of these abbreviations. Hey, Blogger keeps taking out the vertical bars between the abbreviations; if they're missing, don't get distracted. Then I want to turn the file into a sequence of strings, i.e. IEnumerable<string>. A "use" as opposed to a "let" binding ensures that the Close() or Dispose() is finally called. See Don Syme's blog for examples (this example, actually, which I simply copied). It's more or less like a C# function that returns IEnumerable<_> by means of "yield return."



> let reader =
- { use reader = new StreamReader(File.OpenRead(@"C:\...\input.txt"))
- while not reader.EndOfStream do
- yield reader.ReadLine() };;
// Filter out the lines that aren’t data rows.
> let filtered = reader ¦> Seq.filter (fun line -> r.IsMatch line);;
> open System;;

Now decompose the line into the parts we’re interested in, based on the (assumed) fixed format. If you want to get your mind blown by "active patterns," dig my man Don Syme.



> let (¦Record¦) (line : string) =
- let transientSold = Int32.Parse(line.Substring(67, 7))
- let definite = Int32.Parse(line.Substring(25, 3))
- let tentative = Int32.Parse(line.Substring(19, 3))
- let date = line.Substring(5, 11)
- // Return the tuple of interesting values.
- date, transientSold, definite, tentative;;
// This should really be embedded in a function!
> open System.Xml;;
> let xml = XmlWriter.Create("output.xml");;
> xml.WriteStartDocument();;
> xml.WriteStartElement("Sample");;

Now pass the filtered lines through a function that decomposes the line into fields we care about, and output an XML element per. Seq.iter applies a function with side-effects but no return value to every element in a sequence.



> filtered ¦> Seq.iter (fun line ->
- match line with
- ¦ Record(date, ts, d, t) -> xml.WriteStartElement("Thing")
- xml.WriteElementString("Date", date)
- xml.WriteElementString("TransientSold", XmlConvert.ToString(ts))
- xml.WriteElementString("CommitmentsDefinite", XmlConvert.ToString(d))
- xml.WriteElementString("CommitmentsTentative", XmlConvert.ToString(t))
- xml.WriteEndElement()
- );;
val it : unit = ()
// Likewise, this should have been put into the function I inlined above.
> xml.WriteEndElement();;
val it : unit = ()
> xml.Close();;
val it : unit = ()

I could have made some different choices here. First of all, opening an XmlWriter in one function, or directly from the command line, then closing it the same way, is rather ugly. If a function is the unit of encapsulation in functional programming, then one function should really own the writer via a use-binding. So I could do something like this:


filtered ¦> (fun lines -> 
use xmlWriter = XmlWriter.Create("output.xml")
xmlWriter .WriteStartDocument()
xmlWriter .WriteStartElement("Sample")
// Yes, it's a loop!
for line in lines do
// Create each element...
)

The following may be too cute:


filtered  ¦> Seq.fold (fun () -> let xmlWriter = ...
xmlWriter.WriteStartDocument()
// additional initialization
xmlWriter
)
(fun writer line ->
// Add an element to the XmlWriter.
writer
)
¦> (fun writer -> writer.Close())

A significant change would have been to use an active pattern to both look for lines of interest and decompose them into interesting tuples. In this case I either keep the line or throw it out, but there might be several data row formats that I care about. In this event I could use active patterns to discriminate between them, and embed any regular expressions in the pattern function. Furthermore, I could use Seq.choose instead of Seq.filter, because the former passes over None. If I get a chance tonight, I'll write that code out.

Sunday, February 10, 2008

I don't own an XBox

I also don't own a Blackberry or an iPod, and today I lost my cell phone, so I really am completely out of it. I'm happy with my new life as a geek, happier than I would be in literary academia, but my literary training really makes it impossible for me get into geek culture completely. Esp. gadgets. For example, I'm utterly uninterested in computer gaming. Utterly. I'm also uninterested in IMing. Or in sitting in an airport terminal, clicking away at a Blackberry. Whenever I see someone doing that, I almost want to buy them a novel from an airport bookstore: "Dude, you've been putting off that encounter with Proust. Here you go! Too many pages? Fine, I'll buy you The Da Vinci Code. Or Maxim. Whatever. Just stop making yourself look pathetic."

Changing the subject somewhat, if there even is a subject here, I'm addicted to the Arts and Letters Daily. I assume that Professor Dennis Dutton has graduate assistants find these links for him, or perhaps his attentions wander back and forth across the left/right divide. Right now there are a number of links to the Manhattan Institute's City Journal, the destination for "liberals who've been mugged," I suppose (i.e., become disenchanted with poor folks). Anyway, Kay Hymowitz has an interesting take on the new "lad culture" here: "Child-Man in the Promised Land." I'm 46, married, two children; the cultural space where I could sit around with my buddies all night, watching South Park and hitting the beer bong, had not yet opened up when I was 26. I don't regret that. Since I got a Ph.D. when I was 34, I obviously had the opportunity to enjoy my own period of arrested development. I'm not sure it helped me more than it hurt me. Not just socially, but intellectually. I'm just not confident that I could make the case for a lot of the stuff I read in graduate school over, say, World of Warcraft.

Saturday, February 02, 2008

Munging some XML with F# & Linq

I recently wanted to change the format of a test script, and canonicalize a flattened listing of all the regression test models. So I set myself the goal of doing it in F#, which would force me to use some of the language features I'd only read about, and possibly prod me to use some LINQ as well. I could have done this transformation manually in about an hour, I admit. And I learned that if I really want to transform a lot of XML to XML, I'd rather do it via XML, i.e. XSLT. However, Microsoft doesn't offer XSLT 2.0 compliancy, so I have less to gain from learning to do it that way (meaning, I'm stuck with the .NET XsltCompiledTransform implementation at work). What I would have needed is the tokenize function (to split up pathnames on the '\\' character), XSLT functions that could return Boolean values, and the ability to select groups. Anyway, that's what I did below.

#light

open System
open System.Xml
open System.Linq
open System.Xml.Linq
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Xml.Linq
open Microsoft.FSharp.Linq.SequenceOps
open Microsoft.FSharp.Xml.Linq.SequenceOps

let doc = XDocument.Load @"knownSolutions.xml"

// Simply tagged aliases for these various tuples. I want to create
// a tree of these, then finally convert them back to XML.
type Node =
| File of string * XElement
| Directory of string * Node list

// the temporary data structure: I'll create a list of these from
// the initial XML, then recursively group them into directories.
type Path = string list * XElement

let make_path (e : XElement) =
e.Element(xname "File") |> element_to_string |> String.split [ '\\' ], e

// I want to split up the path names, while carrying the original
// element around until I'm ready to use it.
let paths = doc.Root.Elements(xname "Model")
|> Seq.map make_path
|> Seq.to_list // There's no Seq.partition, so...

let is_file (p, _) =
(List.length p) = 1

let strip (p, e) =
// Strip the next part of the path, and return this tuple.
List.tl p, e

let separate (l : Path list) =
List.partition is_file l

let rec collect (l : Path list) =
match l with
| [] -> []
| _ -> let files, directories = separate l
let x = directories
|> Seq.groupBy (fun (l, _) -> List.hd l)
|> Seq.map make_directories
|> Seq.to_list
(make_files files) @ x
and make_directories (l, r) =
Directory ( l, r |> Seq.map strip |> Seq.to_list |> collect )
and make_files paths =
paths |> List.map (fun (l, e) -> File (List.hd l, e))

let collected = paths |> collect

let make_xml l =
// Yes, you need the "let rec" here, too. That took me a while to
// figure out. Otherwise these nested functions can't call each other.
let rec make_file name (xml : XElement) =
// If you want to mix XElement and XAttribute objects, you'll have
// to upcast them to XObject. F#'s type inference won't try to find the
// lowest common denominator between two classes in a heterogeneous
// collection, as far as I can tell.
XElement(xname "File", xargs [XAttribute(xname "Name", name); XAttribute(xname "ObjectiveValue", xml.Element(xname "Solution") |> element_to_string)])
and make_directory name nodes =
let element = XElement(xname "Directory", xargs (nodes |> List.map make_xml'))
element.SetAttributeValue(xname "Name", name)
element
and make_xml' e =
match e with
| File (name, element) -> make_file name element
| Directory (name, element) -> make_directory name element
match l with
| [] -> []
| _ -> List.map make_xml' l

let xml = XElement(xname "KnownSolutions", make_xml collected)
let s = xml.ToString()
Console.WriteLine(s)

What's an F# Tuple?


It's hardly a difficult question, but it's taken me a while to arrive at an answer. Reflector has been invaluable to me, and I can't recommend it strongly enough to anyone who wants or needs to understand what the F# compiler does. One often hears that functional programming more readily lets one find the proper level of abstraction, but sometimes one has to look at the IL, perhaps as a bridge to C# (the "canonical"
.NET language), to improve.


Let's say I define these tuple types:


type StringInt = string * int
type StringDouble = string * double



If I disassemble the resulting DLL, I find...nothing at all! There's no trace in the IL of these tuple types, or anything else:
public static void main()
{
}



Let me add some functions that operate on this code:
let get_int (t : StringInt) =
let _, n = t in n

let get_float (t : StringFloat) =
let _, f = t in f


Something interesting now shows up in Reflector's tree view of the
assembly:


main() then looks like this (screenshots made using Cropper):

I can't claim that all of this is comprehensible to me, but one thing is obvious: the compiled IL has no trace of the named tuple types. They're like structs insofar as they're equal when they have equal structure. Tuple types with the same structure are really the same type. They are not implemented as structs (presumably to avoid having to copy them to the stack on every recursion), but they're implemented as classes. Sealed classes, mind you:



The disassembled get_int() function looks like this; I don't know why a new instance is being allocated from the heap:
Interestingly, if I change my code thus, I can no longer call get_int() and get_float() with the resulting values, because the "discrimated union" Tuples is implemented as a class, rather like a C-style union with type tag and some smarter operations (the disassembled IL is much too verbose to reproduce here, but I recommend trying it):


type Tuples =
| One of StringInt
| Two of StringFloat

let si = One ("Carlo", 6)
let sf = Two ("Leo", 1.5)

Console.WriteLine (si)
Console.WriteLine (sf)


Note that I don't have to specify the tuple type; the compiler infers if from the tag "One" or "Two". What I've learned from Don Syme's Expert F# is that the tag most recently put in scope is the one that the compiler tries to use. Then the disassembled code looks like this, meaning that you can no longer get at the tuple except through the tag:



Tuesday, January 29, 2008

Cute Techie Kids!


I don't really have any excuse for posting this photo of Leo (19 mos. old) sleeping with a bottle dangling from his lips. You wanna make something of it?

I Couldn't Be Happier!

From slate.com:

A study suggests extreme happiness may be bad for you. Findings: 1) "The highest levels of income, education and political participation were reported not by the most satisfied individuals, but by moderately satisfied individuals." 2) Extremely happy people "earned significantly less money" and earned lower school grades than moderately happy people. 3) They "may not live as long," either. Theories: 1) Happiness makes you complacent and kills your drive. 2) It makes you slow to adapt. 3) It makes you too optimistic and insufficiently vigilant about your health. 4) It may overstimulate your cardiovascular system. Researchers' conclusions: 1) "Happiness may need to be moderated for success." 2) "Extremely high levels of happiness might not be a desirable goal." Human Nature's conclusions: 1) Success may need to be moderated for happiness. 2) Extremely high levels of success might not be a desirable goal.

Sunday, January 20, 2008

Huffman Coding in F#

#light // I.e. lightweight syntax

namespace Nassar
// An F# module is a "static class." All the declarations
// that follow are static members of this class.

module Huffman

type Incidence = HashMultiMap<char, int>
type CharSet = Set<char>

let addToCount (d : Incidence) (c : char) =
// For some reason, the "do" is necessary here.
do match d.TryFind c with
Some(n) -> d.Replace(c, n + 1)
_ -> d.Add(c, 1)
d // Return the dictionary.

let count (s : string) =
// I only just noticed that System.String implements
// IEnumerable. The F# compiler correctly inferred from
// the type of Incidence that s has to be a sequence of
// char.

s |> Seq.fold addToCount (Incidence.Create())

type HuffmanNode =
Leaf of HuffmanLeaf
Tree of HuffmanTree
and HuffmanLeaf = {Count: int;
Char: char;}
and HuffmanTree = {Count: int;
Chars: CharSet;
Left: HuffmanNode;
Right: HuffmanNode;}

let get_count node =
match node with
Leaf l -> l.Count
Tree t -> t.Count

let has_char node c =
match node with
Leaf l -> l.Char = c
Tree t -> t.Chars.Contains c

let get_set node =
match node with
Leaf leaf -> CharSet.Singleton leaf.Char
Tree tree -> tree.Chars

let merge_nodes left right =
Tree { Count = get_count left + get_count right;
Chars = (get_set left) + (get_set right);
Left = left;
Right = right }

let node_priority l r =
get_count l - get_count r

let rec reduce_list l =
// What I should really do is put all the new leaf nodes
// into a priority queue, and pull off two at a time as
// long as the length of the queue is > 1. Then
// make_tree wouldn't have to convert the sequence
// it creates from the dictionary into a list.

match l with
l::r::tail -> (merge_nodes l r)::tail
|> List.sort node_priority
// ...and recurse.
|> reduce_list
[tree] -> tree
[] -> failwith "Can't operate on empty list!"

let rec encode h s =
// Encode a character by branching to the leaf node
// with that character.

let rec encode' node c =
match node with
Leaf l -> []
// TODO Add an accumulator to make this tail-recursive,
// rather than prepending the new element.

Tree t -> if has_char t.Left c then 0::(encode' t.Left c) else 1::(encode' t.Right c)
s > Seq.map (fun c -> encode' h c) > Seq.concat

// Decode the int list l using the tree h.
let rec decode h l =
let rec decode' node l =
match node, l with
// If we've arrived at a leaf, this must be the character
// we're looking for. Add it to the sequence, and resume
// from the top of the tree.
// TODO Add an accumulator to make this tail-recursive.

Leaf leaf, _ -> leaf.Char :: (decode h l)
// A 0 in the stream means a left branch during
// the encoding, and v.v.

Tree tree, n::tail -> if n = 0 then decode' tree.Left tail else decode' tree.Right tail
// Something was wrong in the encoding.
_, _ -> failwith "Something has gone very wrong."
match l with
[] -> []
// Start navigating down the tree.
_ -> decode' h l

let make_tree s =
let d = count s
// IDictionary<_, _> implements
// ICollection<KeyValuePair<_,_>>, so enumerate the
// pairs and convert each into a leaf node.

let leaves = d |> Seq.map (fun pair ->
Leaf { Count = pair.Value;
Char = pair.Key }; )
|> Seq.to_list
|> List.sort node_priority
reduce_list leaves

Wednesday, January 16, 2008

Huffman Coding, SICP Exercise 2.69

As a convenience, I defined Incidence as System.Collections.Generic.Dictionary. F#'s Map type
is actually a functor, and each insertion or removal returns a new instance. What I'd really like is something like the FindOrAdd() method available in the C5 dictionary implementations, but I can live without it. Anyway, I intend to fold a sequence of characters into a dictionary, for which purpose I'll first define the function that will do the folding:

> let count (d : Incidence) (c : char) =
- let n = ref 0
- // I need a "ref" variable to serve as an "out" parameter.
- if d.TryGetValue(c, n) then
- // Assign to a mutable property...
- d.[c] <- !n + 1
- else
- d.Add(c, 1)
- d;;
val count : Incidence -> char -> Incidence

I'm getting good at this:

> let d = "Tony Nassar, Software Engineer" |> Seq.fold count (new Incidence());;

val d : Incidence

> d;;
val it : Incidence
= dict
[('T', 1); ('o', 2); ('n', 3); ('y', 1); (' ', 3); ('N', 1); ('a', 3);
('s', 2); ('r', 3); (',', 1); ('S', 1); ('f', 1); ('t', 1); ('w', 1);
('e', 3); ('E', 1); ('g', 1); ('i', 1)]
>
I did not know until today that System.String implements IEnumerable! And, fortunately, F#'s type inference lets me pipe that sequence of characters into my count() function. There are two steps that I need to take now: I need to create a sequence sorted by frequency (actually, a mapping from frequency to character, where the frequencies don't have to be unique) and turn it into a tree. This took me a while. Unlike C#, F# only allows mutually recursive type definitions if they're linked with "and." Here I'm declaring HuffmanNode as a "discriminated union," where Leaf and Tree are implemented as tags (you can verify this by using Reflector to examine the compiled DLL), and a leaf consists of the character to be encoded and its frequency (the latter is necessary during the construction of the tree), and a tree (or any non-leaf node) is...well, it's kinda obvious. If this were C, and someone were using a union and a type code, I'd say "Yuck!" But since it's functional programming, and this is a very handy way to build up a tree, it's suddenly cool again.

> type HuffmanNode =
- | Leaf of HuffmanLeaf
- | Tree of HuffmanTree
- and HuffmanLeaf = { Count : int; Char : char }
- and HuffmanTree = { Count : int; Chars : CharSet; Left : HuffmanNode; Right : HuffmanNode };;

type HuffmanNode =
| Leaf of HuffmanLeaf
| Tree of HuffmanTree
and HuffmanLeaf = {Count: int;
Char: char;}
and HuffmanTree = {Count: int;
Chars: CharSet;
Left: HuffmanNode;
Right: HuffmanNode;}

>

Tuesday, January 15, 2008

Eight Queens: I couldn't let it go!

If you're not used to functional programming, and I'm not, you may respond with irritation at an FPer's "How few lines of code can I use to solve this problem?" showpiece. A good example would be the textbook problem of transposing a matrix (for which the answer is posted on Jomo Fischer's blog). If you really need to transpose a lot of matrices, you're probably going to be working in an environment where such operations are optimized, e.g. Mathematica. In other words, you're not going to do it yourself. So if anyone is irritated with the following code, let me say that I had to ponder this (on and off, not continuously!) for a few months before I arrived at the "right" solution, and I do feel that it's right. However, trying to implement it in F# gave me some valuable insights into how it ought to be done in other languages. Here's my solution:
> let queens size =
- let range = [1..size]
- range |> Seq.fold (fun answers column ->
- let positions = [for row in range -> (column, row)]
- [for p in positions for a in answers when all_safe p a -> a@[p]]) [[]]
- ;;

val queens : int -> (int * int) list list

> queens 5;;
val it : (int * int) list list
= [[(1, 4); (2, 2); (3, 5); (4, 3); (5, 1)];
[(1, 3); (2, 5); (3, 2); (4, 4); (5, 1)];
[(1, 5); (2, 3); (3, 1); (4, 4); (5, 2)];
[(1, 4); (2, 1); (3, 3); (4, 5); (5, 2)];
[(1, 5); (2, 2); (3, 4); (4, 1); (5, 3)];
[(1, 1); (2, 4); (3, 2); (4, 5); (5, 3)];
[(1, 2); (2, 5); (3, 3); (4, 1); (5, 4)];
[(1, 1); (2, 3); (3, 5); (4, 2); (5, 4)];
[(1, 3); (2, 1); (3, 4); (4, 2); (5, 5)];
[(1, 2); (2, 4); (3, 1); (4, 3); (5, 5)]]
> queens 4;;
val it : (int * int) list list
= [[(1, 3); (2, 1); (3, 4); (4, 2)]; [(1, 2); (2, 4); (3, 1); (4, 3)]]

First, I did not use a for loop. Not because that would just be too imperative, but because I realized that the range (e.g. from 1..8, for an 8 x 8 chessboard) was actually an (immutable) object that I needed to use in several ways. To generate all the positions on a chessboard, I would then use a "sequence comprehension" such as { for column_number in range for row_number in range -> (column_number, row_number) }. I've begun to see this as more natural than nested for loops. I'm not doing with side effects in a loop; I'm asking for a single object (a collection of board positions) to be created for me. So the comprehension, a declarative construct, is more natural: "Here's what I want; give it to me. I'm not interested in putting it together piece by piece; that's your job." In comparison, the idiom 1, create a list, 2, go through several loops, appending on each innermost iteration, 3, finally do something with the collection (which is probably mutable, even though it really shouldn't be), obscures the intention, and is error-prone.

Then I iterate over that range (now treating it as the sequence of column numbers), and for every column I take the sequence of answers so far (i.e., each such answer is a sequence of one position each from the columns I've seen so far) and extend each one with each position in the current column (i.e. the current element of the range) that's safe. Below, I defined the functions safe() and all_safe(). I can easily generate board positions for the current column by means of another comprehension (let positions = etc.). Then I take the Cartesian product of the answers so far and this new column, filter all these new candidates through all_safe, and pass this new collection on to the next iteration, via fold(). For obvious reasons, Seq.fold starts with the collection of known 0-length answers, which is of course [[]].

If I were (re)doing it in Scheme, I'd rip off Kent Dybvig's comprehension macro, first off. I'd also rip off an implementation of fold, if I didn't already have one. I'd generate a range 1..n just as I do above. Then, on every iteration (I believe it is an iteration, if I'm folding from the left) I create the Cartesian product of the answers so far and the current column (which I just generated), (filter) these pairs through (all-safe), and continue. Something like that.

Eight Queens, and an End

Once again I was led astray by having done too much imperative programming. I hate nesting loops to create collections, but I didn't go all the way. If the columns and rows of the chessboard are represented as a range—the same range—then all the positions are produced by taking the Cartesian product of that range with itself. Then I can iterate across the columns, and, for each one, take the Cartesian product of the rows in that column and the answers so far, and fold the new answers into a collection that was empty to start with. Ah, fold, delight of all functional programmers. To wit:

> let queens size =
- // the row numbers *and* the column numbers
- let range = [1..size]
- let queens' columns =
- columns |> Seq.fold (fun answers column ->
- let positions = [for row in range -> (column, row)]
- [for p in positions for a in answers when all_safe p a -> a@[p]])
- [[]]
- queens' range;;

val queens : int -> (int * int) list list

> queens 4;;
val it : (int * int) list list
= [[(1, 3); (2, 1); (3, 4); (4, 2)]; [(1, 2); (2, 4); (3, 1); (4, 3)]]
>

Saturday, December 08, 2007

Eight Queens in F#

Man, I'm really going back to school here. The last time I did this problem was when I was learning Prolog. Let's say you're trying to place a queen in the first column. Prolog's select function will pick one of the rows, and create a branch to explore the remaining rows. If row 3 was selected, say, you don't have to write any code to concatenate [1..2] and [4..8], in order to pass that smaller set of unoccupied rows to the next iteration / recursion. Very handy! No loops, no list operations, etc.

> #light;;
> type Position = int * int;;

type Position = int * int

> let safe p q =
- fst p <> fst q && snd p <> snd q && fst p + snd p <> fst q + snd q && fst p - snd p <> fst q - snd q;;

val safe : int * int -> int * int -> bool

> let all_safe p positions =
- Seq.for_all (safe p) positions;;

val all_safe : int * int -> #seq -> bool

> all_safe (1, 2) [(1, 2)];;
val it : bool = false
> all_safe (1, 2) [(3, 3)];;
val it : bool = true

> let range = [1..8];;

val range : int list

> range;;
val it : int list = [1; 2; 3; 4; 5; 6; 7; 8]
> let queens size =
- let range = [1..size]
- let rec queens' columns answers =
- match columns with
- | [] -> answers
- | column::tail -> queens' tail [for answer in answers for position in [for row in range -> (column, row)] when all_safe position answer -> position::answer]
- queens' range [[]];;

val queens : int -> (int * int) list list

> queens 4;;
val it : (int * int) list list
= [[(4, 3); (3, 1); (2, 4); (1, 2)]; [(4, 2); (3, 4); (2, 1); (1, 3)]]
>

Implemented recursively, it's:

> let queens size =
- let range = [1..size]
- let rec queens' columns =
- match columns with
- | column::tail -> [for answer in (queens' tail) for position in [for row in range -> (column, row)] when all_safe position answer -> position::answer]
- | _ -> [[]]
- queens' range;;

val queens : int -> (int * int) list list

> queens 4
- ;;
val it : (int * int) list list
= [[(1, 3); (2, 1); (3, 4); (4, 2)]; [(1, 2); (2, 4); (3, 1); (4, 3)]]
>

Friday, December 07, 2007

F#, finally!

I've been champing at the bit to actually write some F#, but haven't been willing to risk sneaking it into production at the office, and unable to put in extra time at home to learn it. I have read Robert Pickering's Foundations of F# while commuting, but that's not verifiable experience in functional programming. As to why I would care whether I learn functional programming or not, let me plug the FringeDC interest group, founded by the estimable Dr. Conrad Barski. Not only have I found the members to be charming and personable, but they also seem much more incisive about what programming is than most developers I meet. They're not, for the most part, doing enterprise-scale stuff, where decisions about programming languages, database vendors, and so on would probably be taken out of their hands. Unless they're Paul Graham, who got rich writing LISP. Good for him! Anyway, lots of smart people write Java, or even VB. Nonetheless, the kicks in the butt that I've gotten at FringeDC, as well as from graduate classes at George Mason U., have led me to feel that what would do me the most good, and what I would most enjoy, is something more fundamental, theoretical, than spending every weekend preparing for MCAD exams. Different strokes.

Alright, enough about that. Languages that offer a REPL seem really suited to scripting. Yes, my friends at FringeDC would probably do everything in Emacs Scheme, but...I just don't know how anymore. Which I'm not proud of. Visual Studio has actually spoiled me to the point where I wouldn't know how to run some XML through an XSLT from the command line. I would actually have to take time out to figure it out. OK, really enough. Here's my task: I've got some XML describing the expected return values from some linear programming models, and it's tacked on top of some older regression testing scripts that just list the files to be solved by particular solvers. The XML file is called knownSolutions.xml, and the scripts are called solveBatch.cplex.txt and solveBatch.lpSolve.txt. Here's what I want to do: load the XML file, and, when a particular solver is not supposed to solve a model, add a corresponding attribute to the element. I've changed the schema:
<xs:attribute name="SkipSolvers" type="solverTypeList" use="optional">
...
<xs:simpleType name="solverType">
<xs:restriction base="xs:string">
<xs:enumeration value="lpSolve" />
<xs:enumeration value="Cplex" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="solverTypeList">
<xs:list itemType="solverType" />
</xs:simpleType>
So if we have the XML:
<KnownSolutions>
<Model skipsolvers="lpSolve">
<File>TestModels\JasonCh3Net.xml</File>
<ObjectiveValue>196200</ObjectiveValue>
</Model>
</KnownSolutions>
...then I'd like to add attributes like so:
<Model SkipSolvers="lpSolve" .../<

So I need to load the XML from the F# console:

> open System.Xml;; # Import the assembly and the corresponding namespace.
# The F# console waits for two semicolons before it interprets
# your statement.

> let document = new XmlDocument();; # Infer the variable's type.

val document : XmlDocument

> document.Load @"knownSolutions.xml";;
val it : unit = () # The Load() method returns void, in other words.

Now at this point I've got to make some decisions about, you know, algorithms and data structures. I've got a tree—XML—and I've got a list of files that I can't assume are sorted. So why don't I read them in, filter out the empty lines, sort them, and then iterate across the XML's elements looking for files that aren't in the list for Cplex. Since I'm going to be looking things up in the list, I'll make a set out of it first. I should point out that, in contrast to imperative programming, we do not want to open the file and loop over the lines. We want to treat the file as a sequence of lines, no different than IEnumerable<string>. I say "we," but I mean "I": I just don't want to write nested loops anymore. But how do I deal with the EOF? I use a "generator" that has a special way of "terminating" the sequence of lines it reads. This is the "pipes and filters" idiom familier to UNIX folks. F#'s Seq class provides a generate_using function that thoughtfully calls IDispose() on the object created at the beginning of the generation.
> let opener () = File.OpenText("something.txt");;

val opener : unit -> StreamReader
> > opener;;
val it : (unit -> StreamReader) = <fun:clo@0_3>
What that means is that opener is a function with no arguments (if I'd left off the parentheses before the equals sign, F# would assign the return value of OpenText() to opener, which is not what I want), that knows how to return a StreamReader. generate_using takes that residual value (I don't think it's called a return value, per se) and passes it repeatedly to the actual generator, ergo:

> let liner (reader : StreamReader) =
- if reader.EndOfStream then None
- else Some(reader.ReadLine());;

val liner : StreamReader -> string option

option means "either a string, or nothing." Don't think of it as null; F# doesn't really have null. None will mark the end of the generated sequence. generate_using keeps sending the stream as an argument to the "closure" liner, and liner decides whether it's got something worth returning; if not, it signals to the next thing in line, "I'm done." I don't need to point out how parallelizable this "pipes and filters" idiom is; whatever is waiting for a line of text only needs one line to do its job. Anyway, now I'll filter out the empty strings, and put all of them in a hashed set for fast lookup (I could also just remove the empty string from the hashed set after populating it). I'll create the filter separately, rather than inline:

> let filter s = (s <> "");;

val filter : string -> bool

The static Seq.filter method requires a function that takes a sequence element and returns a Boolean value, in this case false if the string is empty. In order to create a set that's initialized to include all the values in the generated sequence, I do:

> let lpSolveSet = HashSet<string>.Create(
- Seq.generate_using opener liner
- |> Seq.filter (fun s -> (s <> ""))
- );;

val lpSolveSet : HashSet<string>

Whew! I don't know if F#'s type inference would have let me leave out the <string>. Putting it together, as I suppose I could have done in 5 minutes if I were a PowerShell guru:

> Seq.generate_using (fun () -> File.OpenText @"solveBatch.lp_solve.txt")
- (fun reader -> if reader.EndOfStream then None else Some(reader.ReadLine()))
- |> Seq.filter (fun s -> (s <> ""))
- |> Seq.map (fun s -> lpSolveSet.Add(s));;
val it : seq<unit> = seq [null; null; null; null; ...]
> lpSolveSet;;
val it : HashSet<string>
= seq
["JasonCh3Net.xml"; "JohnCh4Net.xml"; "OffsetNet.xml";
"WithinClauseTestNet.xml"; ...]
>

In other words, generate a sequence of lines from a file , pass them through a filter that removes empty strings, and send the resulting sequence to the HashSet's constructor. I could also have piped the filtered sequence to a closure that would add each string to the set in turn. Whatever. Now...

> let models = document.SelectNodes "/KnownSolutions/Model";;

val models : XmlNodeList

> models;;
val it : XmlNodeList
= seq
[seq [seq [seq []]; seq [seq []]]; seq [seq [seq []]; seq [seq []]];
seq [seq [seq []]; seq [seq []]]; seq [seq [seq []]; seq [seq []]]; ...]
>

Oh, so it's a sequence! Now I'm rolling, so:

> Seq.hd models;;

Seq.hd models;;
-------^^^^^^^

stdin(131,7): error: FS0001: The type XmlNodeList is not compatible with the type seq<'a>
stopped due to error
Seq.hd ("head") is like LISP's car, btw. Hm, it's not a sequence. Embarrassing. I scratched my head for a few minutes, then realized that I should use a list comprehension:

> { for o in models -> o :?> XmlElement };;
val it : seq
= seq
[seq [seq [seq []]; seq [seq []]]; seq [seq [seq []]; seq [seq []]];
seq [seq [seq []]; seq [seq []]]; seq [seq [seq []]; seq [seq []]]; ...]
>
I apply the downcast operator to each element; now I have a strongly typed sequence, so F# can use type inference on the next element in the chain:

> let attrAdder (e : XmlElement) =
- if not (lpSolveSet.Contains(e.InnerText)) then begin
- let attr = document.CreateAttribute("SkipSolvers") in begin
- attr.Value <- "lpSolve";
- e.Attributes.Append(attr)
- end
- end;;

e.Attributes.Append(attr)
-----------------^^^^^^^^^^^^^

stdin(87,17): error: Type constraint mismatch. The type
XmlAttribute
is not compatibile with type
unit.
The type XmlAttribute is not compatible with the type unit
stopped due to error
What? I scratched my head over this one. In the end, the error message makes perfect sense: because there's no "alternative," the "consequent" is not allowed to return a value, and XmlAttributeCollection.Append() does. So we have to pipe it into oblivion, hence:

> let attrAdder (e : XmlElement) =
- if not (lpSolveSet.Contains(e.InnerText)) then begin
- let attr = document.CreateAttribute("SkipSolvers") in begin
- attr.Value &lt- "lpSolve";
- e.Attributes.Append(attr) |> ignore
- end
- end;;

val attrAdder : XmlElement -> unit
So this is the little function object that will add the required attribute to the element if the lpSolve solver is not supposed to solve it. Then:

> { for o in models -> o :?> XmlElement } |> Seq.iter adder;;
val it : unit = ()

And lo and behold, I get this:

> {for node in document.SelectNodes "/KnownSolutions/Model" -> node :?> XmlElement } |> Seq.iter adder;;
System.ArgumentException: The named node is from a different document context.
at System.Xml.XmlAttributeCollection.Append(XmlAttribute node)
at FSI_0014.adder(XmlElement e)
at FSI_0074.it@150.Invoke(XmlElement e@51_6)
at Microsoft.FSharp.Collections.Seq.iter@868.Invoke(IEnumerator`1 e@90_1)
at Microsoft.FSharp.Core.Operators.using[T,U](T ie, FastFunc`2 f)
at Microsoft.FSharp.Collections.Seq.iter[T,U](FastFunc`2 f, U ie)
at <StartupCode>.FSI_0074._main()
stopped due to error
After about an hour, I realized that the function object that's adding the attributes is a closure: that means that it encapsulates a reference to a previous value of document. That's what "closure" means, silly! I create a new closure that refers to the new XML document, rerun, and I'm done, almost:

> document.Save @"knownSolutionsImproved.xml"