Saturday, October 25, 2008

Efficient “Disjoint Sets” Implementation in XSLT

I actually had to revisit this problem with slightly different input, and once again I struggled. If I represent an undirected graph like this:

<?xml version="1.0"?>
<graph>
    <nodes>
        <node id="1"/>
        <node id="2"/>
        <node id="3"/>
        <node id="4"/>
        <node id="5"/>
    </nodes>
    <links>
        <link end1="1" end2="2"/>
        <link end1="1" end2="3"/>
        <link end1="1" end2="4"/>
        <link end1="1" end2="5"/>
    </links>
</graph>

…then it's a little harder to go from one node to its connected nodes, because the information I need is elsewhere in the XML document (i.e. not in a child node of the <node> element) . To index any node element, I'd have to navigate up the tree, and back down to the links. What's worse, I'd have to scan all the links for every node. So the solution involves some indirection. First I declare these two mappings:

<xsl:key name="left" match="/graph/links/link" use="@end1"/>
<xsl:key name="right" match="/graph/links/link" use="@end2"/>


 

Now I map each node (its ID, actually; remember that the key has to be of an atomic type) to its connected nodes by going through these mappings (such a recursive definition is not possible in XSLT 1.0, I believe). Note that for links that were indexed by @end1, I'm interested in the value of @end2, and v.v.


<xsl:key name="connected" match="node" use="key('left', @id)/@end2, key('right', @id)/@end1"/>


 

Finally, I can use this mapping:


 

<xsl:template match="/">
        <connectivity>
            <xsl:for-each select="graph/nodes/node">
                <node id="{@id}">
                    <connected-to>
                        <xsl:value-of select="key('connected', @id)/@id"/>
                    </connected-to>
                </node>
            </xsl:for-each>
        </connectivity>
    </xsl:template>


 

I believe that these are all the data structures I need. Now I can implement the "strongly connected components" algorithm as I did below, but using set operators on the nodes.

Friday, October 03, 2008

Graph Algorithms in XSLT

Notwithstanding that XSLT can be considered a functional programming language, you probably wouldn't choose to implement algorithms in it. Nonetheless, I recently had to split a very large XML document into smaller pieces (using xsl:result-document), and I wanted to redundancies between the outputs. So what I essentially wanted to do was separate elements in the input into "strongly-connected components." XSLT 2.0 provides lots of operators on sequences and sets, and allows you to define property functions, but what it does not lend itself to is the creation of new data structures. To be more precise, the data structures you'd need for depth-first search or Tarjan's disjoint sets would require you to copy elements into new structures. Easy, if you're working in a language that has pointer or reference equality; not possible in XLST. However, if you assume that the XSLT processor will use hash tables to represent sets, and you put that together with <xsl:key>, you can try a different approach.

Here's the input I want to process:

<?xml version="1.0"?>
<data>
<documents>
<document id="a">
<related-entities>1 2 3</related-entities>
</document>
<document id="b">
<related-entities>2 3 4</related-entities>
</document>
<document id="c">
<related-entities>5 6 7</related-entities>
</document>
<document id="d">
<related-entities>7 8 9</related-entities>
</document>
<document id="e">
<related-entities>10</related-entities>
</document>
</documents>
<entities>
<entity id="1"/>
<entity id="2"/>
<entity id="3"/>
<entity id="4"/>
<entity id="5"/>
<entity id="6"/>
<entity id="7"/>
<entity id="8"/>
<entity id="9"/>
<entity id="10"/>
</entities>
</data>

I'm going to create a hashmap (actually, a hashmultimap) from entity IDs to documents, and vice versa. Note that the keys have to be of an atomic value type, and the values have to be elements. You can't map, say, integers to strings. Alright: for any document I can get the list of entity IDs, and use them as keys pointing back to the document element. So if any two documents point to the same entity, those two documents (and, obviously, the entity as well) must belong in the same strongly-connected component. For a sequence of entity IDs I can create a set of document elements; I then have to determine if this set intersects with the set I've already accumulated. If yes, I union both sets and recurse; otherwise I spit out the subgraph, pop the stack, and resume with the next, as-yet-unselected document.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="subgraphs">
<xsl:key name="linked-documents" match="/data/documents/document" use="tokenize(related-entities, ' ')"/>
<xsl:key name="linked-entities" match="/data/entities/entity" use="key('linked-documents', @id)/@id"/>
<xsl:template match="document">
<document id="{@id}"/>
</xsl:template>
<xsl:template name="create-subgraph">
<xsl:param name="subgraph" as="element(document)*"/>
<xsl:param name="selection" as="element(document)*" />
<xsl:message select="string-join($subgraph/@id, ',')" />
<xsl:message select="string-join($selection/@id, ',')" />
<xsl:choose>
<!-- If there are no more elements to select from, the subgraph is complete (even if empty). -->
<xsl:when test="not($selection)">
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="linked-entities" as="element(entity)*" select="for $d in $subgraph return key('linked-entities', $d/@id)"/>
<xsl:message select="string-join($linked-entities/@id, ',')" />
<xsl:variable name="subselection" select="$selection[key('linked-entities', @id) intersect $linked-entities]"/>
<xsl:message select="string-join($subselection/@id, ',')" />
<xsl:choose>
<xsl:when test="$subselection">
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$selection except $subselection"/>
<xsl:with-param name="subgraph" select="$subgraph|$subselection"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<subgraph>
<xsl:apply-templates select="$subgraph"/>
</subgraph>
<xsl:variable name="new-selection" select="$selection except $subselection"/>
<xsl:call-template name="create-subgraph">
<xsl:with-param name="selection" select="$new-selection[position() gt 1]"/>
<xsl:with-param name="subgraph" select="$new-selection[1]"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template match="/">
<subgraphs>
<xsl:call-template name="create-subgraph">
<!-- Make the first document element an SCC. -->
<xsl:with-param name="subgraph" select="/data/documents/document[1]"/>
<xsl:with-param name="selection" select="/data/documents/document[position() gt 1]"/>
</xsl:call-template>
</subgraphs>
</xsl:template>
</xsl:stylesheet>


In the end, sure enough, I get:

<subgraphs xmlns:fn="subgraphs" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<subgraph>
<document id="a"/>
<document id="b"/>
</subgraph>
<subgraph>
<document id="c"/>
<document id="d"/>
</subgraph>
<subgraph>
<document id="e"/>
</subgraph>
</subgraphs>

Yes, defining one mapping in terms of another is allowed in XSLT 2.0. The problem I had at work was actually more complicated: the entities were in entirely separate XML documents. The trick there, which cost me some sweat, was to define one mapping, which pointed from a document immediately to other documents. However, I'll put off that explanation until someone asks for it.