Possibly the most malicious Regular Expression

This is the next of my episode on regular expressions. Today, we look at the worst regex you can possibly come up with, although it looks innocent and simple. You will learn about this backtracking trap that let’s you easily wait for 10^30 steps, as an example of an errant email regex will illustrate. One possible solution we investigate is the use of possessive quantifiers.

Well, some of you might think: All regular expressions are malicious.

If you are one of them: You have no idea how bad they can become!

Greedily Looking for ‘aa‘

Have a look at this short example:

public static void main (String[] args) {

    final String pattern = "(aa|aab?)*";

    // two 'a'
    final String a002 = "aa";
    // three 'a'
    final String a003 = "aaa";
    // 102 'a'
    final String a102 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
    // 103 'a'
    final String a103 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";

    System.out.println(a002.matches(pattern)) ;
    System.out.println(a003.matches(pattern)) ;
    System.out.println(a102.matches(pattern)) ;
    System.out.println(a103.matches(pattern)) ;
}

What does it print?

  1. true false true false
  2. Throws an exception
  3. Does not compile
  4. None of the above

The regular expression “(aa|aab?)*” matches all strings that are made up of “aa” or “aab“. I.e., it matches for even numbers of ‘a’ and does not match for odd number of ‘a’. (I know. The same can be done with a simple “(aab?)*” but just follow me.)

So in theory it should print “true false true false” (answer 1 above). But actually, it prints “true false true ” (answer 4). The final “false” will come some billion years later if you are patient and your computer does not want you to reboot in the meantime.

It’s not a bug, It’s a feature

We got caught in a backtracking trap. Let’s have a look at a shorter example: “aaaaa“. The following table shows what the java regular engine does to see if it matches to our pattern.

Step Group 1 Group 2 Group 3 Result
1 (aa) (aa) (aa) fail
2 (aa) (aa) (aab?) fail
3 (aa) (aab?) (aa) fail
4 (aa) (aab?) (aab?) fail
5 (aab?) (aa) (aa) fail
6 (aab?) (aa) (aab?) fail
7 (aab?) (aab?) (aa) fail
8 (aab?) (aab?) (aab?) fail

Explanation:

  • Step 1: The engine finds two “aa“, tries to find a third “aa” and fails.
  • Step 2: The engine finds two “aa“, tries to find a “aab?” next and fails.
  • Step 3: The engine starts backtracking and replaces the second “aa” with “aab” and fails again for both “aa” …
  • Step 4: and “aab?“.
  • And so on.

The result is: “aaaaa” does not match “(aa|aab?)+“.

In our example with 103 ‘a’ the engine will execute 2^100 = 10^30 steps, that is 1000 billion billion billion. The regex “(aa|aab?)*” looks so innocent and simple but is probably the worst regex I ever saw.

But backtracking is useful. If you match “<a><b></b></a>” to the regex “<(.*)>(.*)<\1>” you will find a match of opening and closing tags. I think there is no regular expression for a regex engine without backtracking that can do this match. Please comment on this!

Switching to Possessive Quantifiers

What will change if we use a slightly different regex? We change the quantifier from a greedy “*” to a possessive “*+“.

final String pattern = "(aa|aab?)*+";

What does it print?

  1. true false true false
  2. Throws an exception
  3. Does not compile
  4. None of the above

Since we use a possessive quantifier the regular expression engine will not do backtracking. So the process stops after step 2 and the “aa” in ‘Group 2′ won’t be replaced for “aab?“. The first result is printed at once.

A Case Study with Email Addresses

Let’s take the example of email address. Let’s look at two similar regular expressions for it:

[a-z0-9](([\-.]|[_]+)?([a-z0-9]+))*@[a-z0-9]+[.](([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))
[a-z0-9_\+-]+(?:\.[a-z0-9_\+-]+)*@[a-z0-9-]+(?:\.[a-z0-9-]+)*\.(?:[a-z]{2,4})

Do you spot the difference? The first expression is as evil as “(aa|aab?)*” and gets stuck if you try to match e.g. "aaaaaaaaaaaaaaaaaaaaa!" while the second is fine. I.e. if you use the first regex to validate email addresses in your Java code your thread gets stuck for a few million years and your system is vulnerable to a “regular expression denial of service” attack. In principle all regular expression engines that allow backtracking will have the same problem. Imagine what will happen to your database system if there are a few thousand invalid email addresses to be stored and you use an malicious regex in the DB!

There is no simple way to recognize regular expressions which fall into the evil category. But here are some hints:

  • Be careful with options that imply each other: “aa” vs. “aab?“.
  • Try to avoid cascading quantifiers like in “(a+)+“.
  • Try to be possessive like in “(aa|aab?)*+“.

Have a look at the Wikipedia article on Redos for more information and additional examples for malicious regular expressions.

I’m curious to hear your “war” stories: Did you actually have this problem in your systems on production? What did you do to find it and fix it?

Share

Leave a Reply

You must be logged in to post a comment.

4 Responses to “Possibly the most malicious Regular Expression”

  1. Tobi says:

    Interesting example. But that seems to be an Java related issue. If you test your expression on http://regexpal.com/ or in C# you will have no problems with that regex. Therefore not the regex is bad, just the implementation in Java is bad. Maybe there are even bad regex in C# but I never got problems with it.

    For C# I got the following example that runs without any problems:

    Regex ex = new Regex(“(aa|aab?)*”);

    // two ‘a’
    String a002 = “aa”;
    Console.WriteLine(ex.IsMatch(a002)); // True
    // three ‘a’
    String a003 = “aaa”;
    Console.WriteLine(ex.IsMatch(a003)); // True
    // 102 ‘a’
    String a102 = “aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa”;
    Console.WriteLine(ex.IsMatch(a102)); // True

    // 103 ‘a’
    String a103 = “aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa”;
    Console.WriteLine(ex.IsMatch(a103)); // True

    // The expected False
    Console.WriteLine(ex.Match(a103).Value == a103);

    Console.ReadKey();

    • tbudde says:

      Late reply. But important:
      There is nothing ‘bad’ about the Java regex engine. It just uses backtracking. http://regexpal.com/ doesn’t and fails to match ‘aabaa’ for ‘(aa|aab?)*’

      A regex engine can either be fast OR support backtracking. I (and Java) perfer the later.

  2. msm says:

    @tbudde – it is just not true.

    http://swtch.com/~rsc/regexp/regexp1.html

    Regular Expression Matching Can Be Simple And Fast
    (but is slow in Java, Perl, PHP, Python, Ruby, …)

    It is possible to match any regular expression in /linear/ ( O(n) ) time (with respect to input size). I have once implemented regular expression compiler (regex -> NFA -> DFA -> jitted code) using that article.

    • Tobias Budde says:

      A quote from the summary of that article:

      ‘With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.’

      Is it true that an engine can be faster than the Java implementation and support backtracking (but not backreferences)? Does your implementation work that way?