Monday, February 05, 2007

Generating LOC from NAnt, Part D'oh!

At some point I'd like to make all the code available, and perhaps even contribute it NAntContrib, but at the moment it's really butt-ugly. I'm glad that C# 2.0 has better support for sort-of-functional programming constructs, but this is unpleasantly verbose:

public static TreeSorter Sort(string[] paths)
{
if (paths.Length == 0)
return new TreeSorter("");
List<string> pathsList = new List<string>(paths);
pathsList.Sort();
List<List<string>> splitPaths = pathsList.ConvertAll<List<string>>(delegate(string path)
{
return
new List<string>(
path.Split(
System.IO.Path.DirectorySeparatorChar));
});

int index = 0;
for (; index < splitPaths[0].Count; ++index)
{
if (!splitPaths.TrueForAll(delegate(List<string> input)
{
return index < input.Count && input[index] == splitPaths[0][index];
}))
break;
}

IEnumerable<string> commonPath = splitPaths[0].GetRange(0, index);
IList<IEnumerable<string>> truncatedPaths = splitPaths.ConvertAll<IEnumerable<string>>(delegate(List<string> input)
{
return
input.GetRange(index,
input.Count -
index);
});
return new TreeSorter(commonPath, truncatedPaths);
}

Saturday, February 03, 2007

Generating LOC from NAnt, Part 1

Last March I my team took over a failing project for our most important client. The domain to which the software applied is very interesting, and very difficult, and I can't deny that the previous contractor had employed some smart people...at some point in the project's history! The code was a mess. Simply passing FxCop over it revealed that about 5% of it was dead code. After a few months of working with it, we realized that another significant part of it would only be called if certain events fired, and they never did. This is often not so easy to prove, unfortunately. Since I use ReSharper, I found it easier simply to delete unused methods, but the overuse of dynamic binding (!) in the code base meant that I could not be sure, without regression tests (which of course didn't exist until I wrote them) that certain classes were never instantiated. And so on.

Anyway, we began by simply counting the lines of text in the entire code base. This measure decreased daily, and the project manager finally suggested that we do this more systematically so that the client could cite these figures when trying to justify his additional expenditures on this already "working" code to his superiors. I want to make clear that we are not somehow charging for KLOC. Notwithstanding that we're doing this work for a gov't agency, we're not working under some burdensome methodology. The distinction between physical and logical LOC is not so important to us. We are supposed to be making the code more maintainable; a steady decrease in the relative LOC counts as success. I imposed the requirement on myself that the LOC metric be generated from CruiseControl.NET via NAnt. None of the tools that I could use in this way were entirely satisfactory, however. NDepend measured the LOC in files that didn't interest me (e.g., MyForm.Designer.cs); GeroneSoft's Code Counter likewise. Both utilites allow me to create include and exclude lists, but none of them as neatly as NAnt! I would really like to be able to edit the .build file, quickly create another <fileset>, say for ".cs files generated by the Windows Forms Designer," measure the LOC in that set, and arrange for those results to be published by CruiseControl.NET.

NAntContrib does supply a <codestats> task, which accepts a fileset as input. Then, unfortunately, it outputs the statistics as a flat list, which is not what I wanted. What I need is to turn that list into a tree again. This turns out to be unspeakably difficult without XSLT 2.0. Don't recommend Muenchian groupings, please! My initial thought that it would be neat to write this transform cost me at least a day. Then I set about writing a custom NAnt task in C#, but once again I was sidetracked by C#'s relatively poor support for list comprehensions. Moreover, one is discouraged by "performance considerations" from repeatedly splitting and rejoining strings, slicing lists, etc. That's what happens to you when you write too much C, as opposed to learning LISP at a good school. After about two wasted days, on the bus going home, I pulled out a pad of paper and wrote out some Python. I tried it out when I got home, and it worked right out of the box. Then, I admit with some embarrassment, I retrofitted the unit tests, and implemented the special functions such as __len__(), for the sake of the tests.

import os
import itertools
import unittest

class Node:
def __init__(self, path = [], children = []):
assert isinstance(path, list)
assert isinstance(children, list)
self.path = path
self.children = {}
for child in children:
self.add(child)

def __len__(self):
return 1 + sum([len(child) for child in self.children.values()])

def __getitem__(self, key):
return self.children.__getitem__(key)

def __setitem__(self, key):
return self.children.__setitem__(key)

def __iter__(self):
yield self.path
for child in self.children.values():
for i in child:
yield self.path + i

def __repr__(self):
return os.sep.join(self.path)

def add(self, path):
assert len(path), "Cannot add an empty path."
if len(path) == 1:
self.children[path[0]] = Node(path)
elif self.children.has_key(path[0]):
self.children[path[0]].add(path[1:])
else:
self.children[path[0]] = Node([path[0]], [path[1:]])

def allsame(seq):
# An empty sequence is true because is has no mismatches.
# A 1-length sequence should always be true, i.e., imap(...) will
# return an empty list, of which False is not a member (duh!).
# Otherwise, compare all elements to the first. The iteration
# over the sequence should stop as soon as False becomes
# a member of the output sequence.
return not seq or False not in itertools.imap(lambda x: x == seq[0], itertools.islice(seq, 1, None))

def createNodeSorter(paths):
assert paths, "Cannot create a base node for 0 paths"
# Split the paths along the directory separator character.
splitFiles = [f.split(os.sep) for f in paths]
# Now "pivot" these lists in order to find out if they share a common
# base path. Note that os.commonprefix() is character-based; it is
# not suitable.
zipped = zip(*splitFiles)
# How many sequences consist of identical elements?
# Concatenate those, and you've got the common base path.
common = [f[0] for f in itertools.takewhile(allsame, zipped)]
print "Common base path: " + str(common)
# Now strip the common base path.
splitFiles = [f[len(common):] for f in splitFiles if f != common]
print "\tchildren: " + str(splitFiles)
return Node(common, splitFiles)

class AllSameTest(unittest.TestCase):
def testEmptySequence(self):
self.assertTrue(allsame([]))

def testSingleElement(self):
self.assertTrue(allsame([1]))
self.assertTrue(allsame(['test']))

def testTwoIntegers(self):
self.assertTrue(allsame([1, 1]))
self.assertFalse(allsame([0, 1]))

def testTwoStrings(self):
self.assertTrue(allsame(['test', 'test']))
self.assertFalse(allsame(['test', 'fail']))

class CreatorTest(unittest.TestCase):
def testEmptyFileSet(self):
self.assertRaises(AssertionError, lambda: createNodeSorter([]))

def testSinglePath(self):
n = createNodeSorter([r'D:\trunk\Projects\Build\NAnt\LineCounter\LineCount.cs'])
self.assertEqual(1, len(n))

def testTwoPaths(self):
files = [
r'D:\trunk\Projects\Build\NAnt\LineCounter\LineCount.cs',
r'D:\trunk\Projects\Build\NAnt\LineCounter\LineCountCollection.cs']
n = createNodeSorter(files)
self.assertEqual(3, len(n))
self.assertEqual(['D:', 'trunk', 'Projects', 'Build', 'NAnt', 'LineCounter'], n.path)

def testOverlappingPaths(self):
files = [
r'D:\trunk\Projects\Build\NAnt\LineCounter',
r'D:\trunk\Projects\Build\NAnt\LineCounter\LineCountCollection.cs']
n = createNodeSorter(files)
self.assertEqual(2, len(n))
self.assertEqual(['D:', 'trunk', 'Projects', 'Build', 'NAnt', 'LineCounter'], n.path)

class NodeSorterTest(unittest.TestCase):
def testBlankNode(self):
n = Node()
self.assertEqual('', repr(n))
self.assertEqual(1, len(n))

def testSingleChild(self):
n = Node(['base'], [['child1']])
self.assertEqual(2, len(n))
self.assertEqual('base', `n`)

def testNoChildren(self):
n = Node(['C:'])
self.assertEqual(1, len(n))

def testInvalidPath(self):
self.assertRaises(AssertionError, lambda: Node('base'))

def testIteratorBlankNode(self):
n = Node()
nodes = [node for node in n]
self.assertEqual([[]], nodes)

def testIteratorNodeWithPath(self):
n = Node(['C:', 'Program Files'])
nodes = [node for node in n]
self.assertEqual([['C:', 'Program Files']], nodes)

def testIteratorWithOneChild(self):
n = Node(['C:'], [['Program Files']])
nodes = [node for node in n]
self.assertEqual(['C:'], nodes[0])
self.assertEqual(['C:', 'Program Files'], nodes[1])

def testThreeLevels(self):
n = Node(['C:'], [['Program Files', 'Adobe', 'Acrobat 7.0'], ['Program Files', "CruiseControl.NET"]])
self.assertEqual(5, len(n))

def testNoCommonBase(self):
n = Node([], [['C:', 'Program Files', 'Adobe', 'Acrobat 7.0'], ['D:', 'Program Files', "CruiseControl.NET"]])
self.assertEquals(8, len(n))

if __name__ == '__main__':
unittest.main()