Regular Expressions: Splitting Pipes

It’s a common saying in IT: “I had a problem and used regular expressions to solve it. Now I’ve two problems”. We want to offer help in a series of mgm “Hacking Java Puzzler” blog entries and demonstrate how regular expressions can be useful anyway. In this first episode we will focus on splitting CSV lines.

CSV is a very old file format and parsing it is very simple. There are special libraries for this but Java has built-in support as well. The class String offers the split() method that takes a regular expression to parse the content of a string. The implementation of this function directly uses the split method in the java.util.regex.Pattern class.

Have a look at this short example:

public static void main (String[] args){
    final String text = "a|b|c";
    final String delimiterPattern = "\\|";

    final String[] columns = text.split(delimiterPattern);

    System.out.println(Arrays.toString(columns));
}

Our values in the CSV are delimited by the pipe character ‘|’. This real-world example is taken from a legacy application that represents lists as pipe-concatenated strings, like "a|b|c" in the code above. This example would work in the same way for any other delimiter character.

So, what does it print? Yes, it’s “[a, b, c]“. Well done.

Puzzler 1: Warming up

Now for the first simple Java puzzler. We want to do the same, but our first two columns are empty:

    final String text = "||c";

What does it print?

  1. [, , c]
  2. [null, null, c]
  3. An exception is thrown
  4. None of the above

As expected it’s “[, , c]” and the first two columns are empty strings.

Puzzler 2: You’ll be surprised

Now it gets a little harder. We change the text so that all columns are empty:

    final String text = "||";

What does it print?

  1. [, , ]
  2. [null, null, null]
  3. An exception is thrown
  4. None of the above

Well, you’ll be surprised—it’s number 4, and what will be printed is this: “[]“.
One might want to shout out loud: “That’s *%$#& stupid. I’ll never understand regular expressions!”

What just happened?

If you blame regular expressions for this unexpected result you are actually barking up the wrong tree. Yes, it is implemented in the regex package, but let’s read the JavaDoc:

  • String.split(): [...] trailing empty strings will be discarded.
  • Pattern.split(): Trailing empty strings are [...] not included in the resulting array.

If you have a look at the code of the Pattern.split() method, you will find something like this:

// Taken from JDK's Pattern class

int resultSize = matchList.size();
if (limit == 0)
    while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
        resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);

These lines actively delete empty trailing strings, like documented. I.e. your expected result is deliberately destroyed.

What’s the reason for this API design? Does anybody have a clue? I don’t.

How to Get it Right

The workaround is quite simple: Just ensure that limit != 0 in Pattern.split. How? Luckily, there’s a variant of the split() method that takes the limit as a parameter. The following small change does the job (note the -1 as a second parameter):

final String[] columns = text.split(delimiterPattern, -1);

In my opinion this should have been the default behavior.

Another solution is to directly use the regex API:

final String text = "||";
Pattern pattern = Pattern.compile("[^|]*");
Matcher matcher = pattern.matcher(text);
List<String> columns = new ArrayList<String>();

while (matcher.find()){
    columns.add(matcher.group());
}

The regular expression “[^|]*” matches everything that is not a pipe symbol. This includes the empty words in our sample text.

Using the regex API is a little more work but is the only way for a related problem: Extract only not empty words from a CSV. Using split() will always return leading empty words (as you can see in the first simple puzzler). With regex it’s just a minor change to “[^|]+” because the asterisk means ‘none or more‘ while the plus quantifier means ‘one or more‘.

Share

Leave a Reply

*

One Response to “Regular Expressions: Splitting Pipes”

  1. seth says:

    Concerning “What’s the reason for this API design? Does anybody have a clue? I don’t.”

    Probably Java just wants to be compatible to Perl 5. The Perl the manual says:
    “In time-critical applications, it is worthwhile to avoid splitting into more fields than necessary.”[http://perldoc.perl.org/functions/split.html]

    (Maybe this behavior will change in Perl6[http://www.perl6.org/archive/rfc/361.html]. If so, then it will become interesting whether Java will follow that new idea, too. ;-) )