Wednesday, April 30, 2008

Regular Expressions Are Fun

Miriam and I were debugging a regular expression, and it was an educational experience.

The platform is Java, and our problem is:

  • Input is a long string
  • We want to replace occurrences of "PRE[SEARCHSTRING]POST" with "PREreplacedPOST"
  • "PRE" and "POST" are patterns
  • "[SEARCHSTRING]" is a string that contains a lot of special regex character like "[", "\", and "$".

We found java.util.regex.Pattern.quote to help with the special characters in "[SEARCHSTRING]":

String quotedPattern = Pattern.quote(searchString);

Now, how do we match "PRE" and "POST" without losing them? They disappear if we try:

return input.replace("PRE" + quotedPattern + "POST", "replaced")
;

After some fiddling we came up with:

return input.replaceAll("(PRE)" + quotedPattern + "(POST)", "$1replaced$2");

This matches "PRE" and "POST", but references the captured subgroups with "$1" and "$2" so they don't disappear.

But we then tried the following instead:

return input.replaceAll("(?<=PRE)" + quotedPattern + "(?=POST)", "replaced");

"(?=POST)" is a zero-width positive lookahead, which is an incredibly cool technical name. It matches a pattern that does appear ahead ("positive lookahead") but this pattern won't be regarded when checking what parts of the string matched the regular expression ("zero-width"). "(?<=PRE)" is, similarly, a zero-width positive lookbehind. There are also negative lookaheads and lookbehinds that make sure the pattern doesn't appear.

Is there a more elegant way to do this?

2 comments:

  1. Lookaround can indeed provide some elegant solutions, but it tends to be less efficient, performance-wise (especially lookbehind). If you can avoid lookbehind (e.g. by matching then putting back parts of the match, as in your earlier example), your regexes will tend to perform better. Also note that java.util.regex only supports finite-length lookbehind, so although you can use e.g. ? or {n,n} inside a lookbehind pattern, you can't use *, +, or {n,}. (There are no such restrictions on lookahead patterns.)

    ReplyDelete
  2. Thanks Steve!

    I didn't know lookbehinds were expensive, or that they're finite-length in Java.

    Do you know if there's a regular expression optimization guide that's aware of the differences between different engines?

    ReplyDelete