Wednesday, August 02, 2006

Generating A Regular Expression

A bioinformaticist recently posed an interesting problem to the RegexAdvice forum, where I like to lurk. Can a regular expression let me search for a pattern such as ATGGCACAGGTTATCCATTATCAGACCTTTACAAAAATCAGATAA, allowing exactly 1 mismatched character? Trust me, the patterns get a lot longer than this! Anyway, inexact matching algorithms would be too inefficient here: we know exactly how many mismatches we're allowed. On the other hand, I'm not sure that any solution could take advantage of optimized exact matching algorithms, whether those were part of a regex implementation or a custom implementation. At first, I considered how to write an expression that allowed a single mismatch somewhere in the pattern; then I realized that the general case, allowing m mismatches, is actually easier to write. I would like to present this work as part of a prospective talk on "test-driven development and recursive algorithms," or something like that, but to be honest, this provisional implementation was preceded by a lot of trial and error. At some point, however, I knew how the algorithm had to work, and how the expression had to look, and at that point, of course, it was easy to employ TDD.

OK, here it is, in Python:

import unittest

def expand(s, mismatches):
assert mismatches <= len(s)
# Only exact matching from this point on.
if mismatches == 0:
return s
# All mismatches allowed.
if mismatches == len(s):
return '.' * mismatches
# Branch: match the next character exactly, or mismatch.
return [[s[0], expand(s[1:], mismatches)],
['.', expand(s[1:], mismatches - 1)]]

def collapse(l):
if type(l) == str:
return l
if type(l[0]) == str:
# They're all strings, in this case.
return ''.join(map(collapse, l))
return '(' + '|'.join(map(collapse, l)) + ')'

class TestExpansion(unittest.TestCase):
def testZeroMismatches(self):
self.assertEquals('string', expand('string', 0))

def testOneCharOneMismatch(self):
self.assertEqual('.', expand('a', 1))

def testMultipleCharsAllMismatches(self):
self.assertEqual('..', expand('ab', 2))

def testTwoCharsOneMismatch(self):
self.assertEqual([['a', '.'], ['.', 'b']], expand('ab', 1))

def testCollapseString(self):
self.assertEqual('a', collapse('a'))

def testCollapseTwoStrings(self):
self.assertEqual('(a.|.b)', collapse([['a', '.'], ['.', 'b']]))

def testThreeCharsOneMismatch(self):
l = expand('abc', 1)
self.assertEqual( [['a', [['b', '.'], ['.', 'c']]], ['.', 'bc']], l)
r = collapse(l)
self.assertEqual('(a(b.|.c)|.bc)', r)

if __name__ == '__main__':
unittest.main()


And it seems to work, according to my understanding of what it should be able to do! I'd like to turn on "explicit capture only", to spare the expense caused by all those parentheses, but Python doesn't have that option:

>>> r.findall('tony bony tiny toby tonk toaa blah')
[('tony', 'ony', 'ny'),
('bony', '', ''),
('tiny', 'iny', ''),
('toby', 'oby', 'by'),
('tonk', 'onk', 'nk')]
>>>

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.