Thursday, September 08, 2011

Tokenizing a String with Oracle SQL

This problem actually comes up pretty frequently for me. Audit log records at my place of employment are written to the DB. I often get requests to pull out and aggregate the objects IDs in a set of rows. The IDs are space-separated within a VARCHAR2 column. The details aren't that interesting, though.

The first trick to know is the by-now conventional way of generating a sequence of integers in Oracle SQL:

SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10;

The 10 above is a necessary but arbitrary cutoff. My sample data happens to have < 10 tokens per row; if there were more, I'd boost the cutoff. Anyway, the next trick to know concerns Oracle's REGEXP_SUBSTR() function, namely that it has an optional argument for the match. You can see where this is going, right? If I JOIN each row of the audit log to the sequence of integers, then I can use the latter integers as match indexes.

Since Oracle's regex implementation doesn't include look-ahead operators, the token separator will be part of the match, and I'll have to remove it, hence TRIM(). If your data is comma-separated, your SQL will look a little different. But enough:

SELECT * FROM (
SELECT s, i, TRIM(REGEXP_SUBSTR(s, '\d+( |$)', 1, i)) token FROM (
SELECT '123 456' FROM DUAL
UNION
SELECT '789 101112' FROM DUAL
), (
SELECT ROWNUM i FROM DUAL CONNECT BY LEVEL <= 10
)
) WHERE token IS NOT NULL;

1 comment:

SlideGuitarist said...

Natch, Tom Kyte has suggested a similar method of tokenizing a comma-separated string at http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:15258974323143.