Wednesday, April 30, 2008

Regular Expressions Are Fun

Miriam and I were debugging a regular expression, and it was an educational experience.

The platform is Java, and our problem is:

  • Input is a long string
  • We want to replace occurrences of "PRE[SEARCHSTRING]POST" with "PREreplacedPOST"
  • "PRE" and "POST" are patterns
  • "[SEARCHSTRING]" is a string that contains a lot of special regex character like "[", "\", and "$".

We found java.util.regex.Pattern.quote to help with the special characters in "[SEARCHSTRING]":

String quotedPattern = Pattern.quote(searchString);

Now, how do we match "PRE" and "POST" without losing them? They disappear if we try:

return input.replace("PRE" + quotedPattern + "POST", "replaced")
;

After some fiddling we came up with:

return input.replaceAll("(PRE)" + quotedPattern + "(POST)", "$1replaced$2");

This matches "PRE" and "POST", but references the captured subgroups with "$1" and "$2" so they don't disappear.

But we then tried the following instead:

return input.replaceAll("(?<=PRE)" + quotedPattern + "(?=POST)", "replaced");

"(?=POST)" is a zero-width positive lookahead, which is an incredibly cool technical name. It matches a pattern that does appear ahead ("positive lookahead") but this pattern won't be regarded when checking what parts of the string matched the regular expression ("zero-width"). "(?<=PRE)" is, similarly, a zero-width positive lookbehind. There are also negative lookaheads and lookbehinds that make sure the pattern doesn't appear.

Is there a more elegant way to do this?