50

Regular expressions are powerful tool in programmer's arsenal, but - there are some cases when they are not a best choice, or even outright harmful.

Simple example #1 is parsing HTML with regexp - a known road to numerous bugs. Probably, this also attributes to parsing in general.

But, are there other clearly no-go areas for regular expressions ?


p.s.: "The question you're asking appears subjective and is likely to be closed." - thus, I want to emphasize, that i am interested in examples where usage of regexps is known to cause problems.

c69
  • 1,358

5 Answers5

60

Don't use regular expressions:

  • When there are parsers.

This doesn't limit to HTML. A simple valid XML cannot be reasonably parsed with a regular expression, even if you know the schema and you know it will never change.

Don't try, for example, parse C# source code. Parse it instead, to get a meaningful tree structure or the tokens.

  • More generally, when you have better tools to do your job.

What if you must search for a letter, both small and capital? If you love regular expressions, you'll use them. But isn't it easier/faster/readable to use two searches, one after another? Chances are in most languages you'll achieve better performance and make your code more readable.

For example the sample code in Ingo's answer is a good example when you must not use regular expressions. Just search for foo, then for bar.

  • When parsing human writing.

A good example is an obscenity filter. Not only it is a bad idea in general to implement it, but you may be tempted to do it using regular expressions, and you'll do it wrong. There are plenty of ways an human can write a word, a number, a sentence and will be understood by another human, but not your regular expression. So instead of catching real obscenity, your regular expression will spend her time hurting other users.

  • When validating some types of data.

For example, don't validate an e-mail address through a regular expression. In most cases, you'll do it wrong. In a rare case, you'll do it right and finish with a 6 343 characters length coding horror.

Without the right tools, you will make mistakes. And you will notice them at the last moment, or maybe never. If you don't care about clean code, you'll write a twenty lines string with no comments, no spaces, no newlines.

  • When your code will be read. And then read again, and again and again, every time by different developers.

Seriously, if I take your code and must review it or modify it, I don't want to spend a week trying to understand a twenty lines long string plenty of symbols.

ChrisF
  • 38,948
  • 11
  • 127
  • 168
18

The most important thing: when the language you are parsing is not a regular language.

HTML is not a regular language and parsing it with a regular expression is not possible (not only difficult or a road to buggy code).

Matteo
  • 471
12

On stackoverflow one often sees people ask for regexes that find out whether a given string does not contain this or that. This is, IMHO, reversing the purpose of regular expression. Even if a solution exists (employing negative lookbehind assertions or such stuff), it is often much better to use the regex for what it was made for and handle the negative case with program logic.

Example:

# bad
if (/complicated regex that assures the string does NOT conatin foo|bar/) {
    # do something
}

# appropriate
if (/foo|bar/) {
    # error handling
} else {
    # do something
}
Ingo
  • 3,941
5

Two cases:

When there is an easier way

  • Most languages provide a simple function like INSTR to determine if one string is a subset of another. If that's what you want to do, use the simpler function. Don't write your own regular expression.

  • If there is a library available for performing a complex string manipulation, use it rather than writing your own regular expression.

When regular expressions are not sufficiently powerful

  • If you need a parser, use a parser.
Kramii
  • 14,199
  • 5
  • 46
  • 64
0

Regular expressions cannot identify recursive structures. This is the fundamental limitation.

Take JSON - it is a pretty simple format, but since an object may contain other objects as member values (arbitrarily deep), the syntax is recursive and cannot be parsed by a regex. On the other hand CSV can be parsed by regex'es since it does not contain any recursive structures.

In short regular expressions does not allow the pattern to refer to itself. You cannot say: at this point in the syntax match the whole pattern again. To put it another way, regular expressions only matches linearly, it does not contain a stack which would allow it to keep track of how deep it is an a nested pattern.

Note it has nothing to do with how complex or convoluted the format is otherwise. S-expressions are really really simple, but cannot be parsed with a regex. CSS2 on the other hand is a pretty complex language, but does not contain recursive structures and therefor can be parsed with a regex. (Although this is not true for CSS3 due to CSS expressions, which have a recursive syntax.)

So it is not because it is ugly or complex or error-prone to parse HTML using only regex. It is that it is simply not possible.

If you need to parse a format which contains recursive structures, you need to at least supplement the use of regular expressions with a stack to keep track of the level of recursive structures. This is typically how a parser works. Regular expressions is used to recognize the "linear" parts, while custom code outside the regex is used to keep track of the nested structures.

Usually parsing like this is split into separate phases. Tokenization is the first phase where regular expressions are used to split the input into a sequence of "tokens" like words, punctuation, brackets etc. Parsing is the next phase where these tokens are parsed into a hierarchical structure, a syntax tree.

So when you hear that HTML or C# cannot be parsed by regular expressions, be aware that regular expressions still are a critical part of the parsers. You just cannot parse such language using only regular expressions and no helper code.

JacquesB
  • 61,955
  • 21
  • 135
  • 189