How do regular expressions actually work?

Question

Say you have a document with an essay written. You want to parse this essay to only select certain words. Cool.

Is using a regular expression faster than parsing the file line by line and word by word looking for a match? If so, how does it work? How can you go faster than looking at each word?

score 49 · Accepted Answer · edited Nov 10 '16 at 16:09

How does it work?

Take a look at automata theory

In short, each regular expression has an equivalent finite automaton and can be compiled and optimized to a finite automaton. The involved algorithms can be found in many compiler books. These algorithms are used by unix programs like awk and grep.

However, most modern programming languages (Perl, Python, Ruby, Java (and JVM based languages), C#) do not use this approach. They use a recursive backtracking approach, which compiles a regular expression into a tree or a sequence of constructs representing various sub-chunks of the regular expression. Most modern "regular expression" syntaxes offer backreferences which are outside the group of regular languages (they have no representation in finite automata), which are trivially implementable in recursive backtracking approach.

The optimization does usually yield a more efficient state machine. For example: consider aaaab|aaaac|aaaad, a normal programmer can get the simple but less efficient search implementation (comparing three strings separately) right in ten minutes; but realizing it is equivalent to aaaa[bcd], a better search can be done by searching first four 'a' then test the 5th character against [b,c,d]. The process of optimization was one of my compiler home work many years ago so I assume it is also in most modern regular expression engines.

On the other hand, state machines do have some advantage when they are accepting strings because they use more space compared to a "trivial implementation". Consider a program to un-escape quotation on SQL strings, that is: 1) starts and ends with single quotation marks; 2) single quotation marks are escaped by two consecutive single quotations. So: input ['a'''] should yield output [a']. With a state machine, the consecutive single quotation marks are handled by two states. These two states serve the purpose of remembering the input history such that each input character is processed exactly only once, as the following illustrated:

...
S1->'->S2
S1->*->S1, output *, * can be any other character 
S2->'->S1, output '
S2->*->END, end the current string

So, in my opinion, regular expression may be slower in some trivial cases, but usually faster than a manually crafted search algorithm, given the fact that the optimization cannot be reliably done by human.

(Even in trivial cases like searching a string, a smart engine can recognize the single path in the state map and reduce that part to a simple string comparison and avoid managing states.)

A particular engine from a framework/library may be slow because the engine does a bunch of other things a programmer usually don't need. Example: the Regex class in .NET create a bunch of objects including Match, Groups and Captures.

score 17 · Answer 2 · answered Nov 30 '11 at 04:01

Regular expressions just look fast because you have fast computers.

Back in the 1980's when 1 MIPS was a fast computer, regular expressions were a fairly big area of worry, concern and research because they were slow and ugly and compute intensive. Clever algorithm development followed and helped - but for all practical purposes these days you are seeing the miracle of fast machines papering over the cracks.

score 7 · Answer 3 · answered Nov 30 '11 at 03:44

Why do you think they are quicker than searching the document?

There are some tricks you can do, eg. if you are searching for a 10letter word beginning with A and ending with B then if you find a A and the character 9 positions further on isn't B then you can skip some. see Knuth–Morris–Pratt algorithm

score 6 · Answer 4 · answered Nov 30 '11 at 04:51

RegEx's are comparably faster to code you might write because most libraries are the result of many developers spending many years optimizing them to squeak out every last bit of performance possible. Its difficult for a single individual to duplicate that in their own search code.

score 5 · Answer 5 · edited May 23 '17 at 12:40

5

What makes a regular expression fast?

Actually, they're not. Not that much. It's just that they're not slow enough for most of us to notice. Back in the 'old slow days, it was much more noticeable.

They're also not the right tool for every job - the hammer.

edited May 23 '17 at 12:40

Community

1

answered Nov 30 '11 at 04:09

Rook

19,947

score 4 · Answer 6 · edited Nov 10 '16 at 16:05

Your basic premise is wrong.

Regular expressions are not always faster than a simple search. It all depends on context. It depends on the complexity of the expression, the length of the document being searched, and a whole host of factors.

What happens is that the regular expression will be compiled into a simple parser (which takes time). Thus, if the document is small, this extra time will outweigh any advantage. Also, if the expression is simple, then the regular expression will not give you any advantage.

If the expression is complex and the document large enough, then you can gain some benefit. Whether this is significant enough to consider regular expressions to be faster will depend a lot on how much effort you want to put into the search (also regular expressions may have some optimizations that a library could provide that you would not have thought of yourself).

What I am trying to say is that there is no generalized, blanket answer. If you had a specific expression (and a known document size), then you could say derive a yes/no answer of whether the expression will be quicker than a simple search (and why).

The real advantage of regular expressions is that once you understand how to write them, the ability to express a complex search in a concise way. Because it is a generalized form, you can then build tools that allow searches in a way that is useful in the general case; it is usually at least as fast as a simple search (on documents of minimum size; on documents smaller than this it would not matter since even if it is slower, it is still fast enough).

score 1 · Answer 7 · answered Nov 30 '11 at 07:26

It is plausible that in some high-level languages (perhaps javascript), using a regex library implemented in a low-level language (perhaps C) would be faster than writing parser logic in the high-level language.

Plausible - I have no idea if this is ever actually the case.

How do regular expressions actually work?

7 Answers7