23

Exhibit 1, Exhibit 2, I guess you won't find it hard to recall other examples.

Thing is: if there is more than one way to solve a problem, the PHP programmer (I usually browse the PHP tag on StackOverflow) will ask for help on the solution involving regular expressions.

Even when it will be less economic, even when the php manual suggests (link) to use str_replace instead of any preg_* or ereg_* function when no fancy substitution rules are required.

Does somebody have a clue about why this happens?

Don't get me wrong, some of my best friends are regular expressions and I don't despise Perl. What I don't get is why there is no looking for alternatives whatsoever, even when the overkill is obvious (regex to switch strings) or the code complexity rises exponentially (regex for getting data from html in PHP)

cbrandolino
  • 2,009
  • 1
  • 18
  • 21

17 Answers17

49

When the only tool you have is a regex, every problem looks like ^((?>[a-zA-Z\d!#$%&'*+\-/=?^_{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$

glenatron
  • 8,689
23

I think it's because:

  1. They are fantastically concise (when used properly) compared to the equivalent code, and
  2. They are widely supported across programming languages, so most developers are familiar with them.
hallidave
  • 311
  • 1
  • 2
23

In earlier phases of my career (ie. pre-PHP), I was a Perl guru, and one major aspect of Perl gurudom is mastery of regular expressions.

On my current team, I'm literally the only one of us who reaches for regex before other (usually nastier) tools. Seems like to the rest of the team they're pure magic. They'll wheel over to my desk and ask for a regex that takes me literally ten seconds to put together, and then be blown away when it works. I don't know--I've worked with them so long, it's just natural at this point.

In the absence of regex-fluency, you're left with combinations of flow-control statements wrapping strstr and strpos statements, which gets ugly and hard to run in your head. I'd much rather craft one elegant regex than thirty lines of plodding string searching.

Dan Ray
  • 9,116
  • 3
  • 38
  • 49
20

Why are regular expressions so morbidly attractive?

Because on the subconscious level they feel like an entire smart program who can accomplish a lot on its own accord while being encompassing and self-adjusting (think patterns).

This is why people immediately believe regular expressions will solve any of their text-based task, somehow not thinking it might be overkill and not realizing it might me underkill (parsing languages with it).

A tiny thing containing magic power. You can't say no, can you?

15

On the contrary. People are parrotting the regex are evil meme way too often IMO. It's obvious that preg_match is overused in php, but it's less obvious that it's oftentimes sensible to do so (in PHP).

I would go so far and conjecture that it's yet another microoptimization in php land to use the string functions. There are many and many useful, and they are usually the better choice. But you shouldn't shun preg_match in favour of multiple strpos and if chains. Because in practice it turns out, libpcre is often faster than PHP can execute a loop looking for string alternatives e.g.

As a recent example made me realize, testing if a string is all-lowercase:

 if ($string == strtolower($string))

Is more readble than:

 if (!preg_match("/[A-Z]/", $string))

And you would assume the first must be faster, since it's all-PHP. But in reality the regex only looks over the string once, and can abort the negated condition as soon as it finds an uppercase letter. The strtolower() approach however looks over the string twice. First strtolower() makes a string duplicate by iterating over each letter, comparing and uppercasing it. Then the == iterates over the original and the copy again, comparing them once more.

So that's not an obvious case. And to be objective the first one is often faster, since you normally just compare short strings. But it's imperative to not go blindly by the assumption that PHP string functions are always advisable over regular expressions.

(I'm tempted to add another rant about @bobince's fun answer regarding xhtml-regexes, and how it's recently often linked in a very unhelpful manner. And the more objective answers below go ignored.)

mario
  • 2,333
8

Regular expressions are very attractive because they are the best tool for parsing a regular language.

They have the following advantages:

  • They are concise. It generally takes a lot more code to parse a specific regular language using a specific algorithm that you have come up with than with a regexp.
  • They are quick to use. It generally takes a lot more time to write a parser for a specific regular language using a specific algorithm that you have come up with than with a regexp.
  • They are easy. Once you learn the set of special characters and their meanings, it is easy to compose a regexp (although a little harder to read them). Regexps are languages themselves - a useful trait because our species has evolved to be very good at language.
  • They are fast. Once compiled, they can match a string length N in O(N) time.
  • They are flexible. They can match any regular language and a lot of our data is expressed as a regular language.
  • They are ubiquitous. Most programming languages have basic regexp support - either through external libraries or embedded into the language itself. There is also not too much variation between the regexp languages themselves.

This makes them attractive for situations to which they are suited, but people may use them in contexts where they are not the best tool, because they:

  • Don't understand that what they are matching can't be expressed using a regexp (eg. HTML).
  • Are lazy (in a bad way) - they know a tool and recognise that it isn't the best tool for what they are doing but it will work without problems 95% of the time and takes 95% of the effort of learning a particular parser or writing one from scratch.
  • They are unaware that better tools exist.
david4dev
  • 638
6

Hmmm, I can only guess. Maybe some people have experienced that 30 lines of their code were replaced by a 20-character-long regex, so it feels wrong to them to use anything else instead when regexes can be used.

user281377
  • 28,434
4

It fits with how some people think. I don't like them, but I have friends who seem to think in regexps. I guess the pattern matching part of their brain is more exposed than the formal logic one. :-)

3

I think the ubiquity of regex is due to the ubiquity of strings. The string is the simplest data structure, the first one that most of us learn. Since all of our code is written in symbolic form, it is natural for a programmer to consider modelling something in symbolic form. But if our programming language offers any resistance when we try to extend its syntax for our clever new symbolic forms, they all end up between quotes. The relational data model has SQL. The XML data model has XQuery. But what about the humble string data model? Regex!

Just yesterday, I was looking over the API for a shiny new Javascript framework that supports HTML5 game development. It has a declarative mechanism for describing the main subsystems that your game would need. How does one specify those features? JSON? Fluent dot notation? An array? Nope -- a string containing a comma- and whitespace-separated list of feature names. I wonder how it parses that list... ?

WReach
  • 131
2

Because you can see the whole thing at once. By being able to see the whole thing, it can be easier to work with, and that's always nice. It's sort of like the reason that many C++ programmers still use printf-type statements: It's not typesafe (though gcc at least can check types on printf statements), and it's not pretty, but boy is it compact and usable.

If it's a simple enough regex, then they often ARE the best way to do things - their compact form and many capabilities make them perfect for certain tasks. The problem comes when you make the regex so complicated that you can't read it anymore, or when you're using a complex regex to do something that could be more quickly done via simple string operations.

Regex, like any other powerful tool, must be use in proper moderation - not too much, not too little. And unless performance is a big concern, a single regex may at times be quicker to write and easier to debug than a series of string operations.

Michael Kohne
  • 10,146
2

Hmm, the current answers center too much on technical aspects, and the readability pros/cons (which is an important point). So let me try to shift it a bit more onto the PHP environment/community:

  • PHP is Perls little stepsister. And an integral part of Perl are regular expressions (they invented that stuff, didn't they?). Therefore it's one cause why regexps are pervasive in PHP too.
  • The use case of PHP is coincidentally not much unlike the use case for regular expressions. PHP is structurally used for glueing together HTML pages. And regexps work on text. (what WReach said)
  • Micro optimization. As mentioned before: people use regexps and/or PHP string functions frequently after perceived speed. A core problem in PHP circles, not specific to regexps.
  • Regular expressions are built-in. In Python, in Java, in C#, in Ruby? there is availability, but a deterrent in having to load an extra module. And see how in PHP or Javascript where it's a core feature, the usage pattern differs. Another exhibit: CSS where it's getting more frequently used.
  • The PHP manual is at fault. It often is. Regular expressions are easily discoverable, and I postponed this fun fact because it's boring in its obviousness: all the damn tutorials and PHP introduction books always teach about regular expressions, but fail to educate on use cases.
  • The string API in PHP was designed by the same people that brought you magic quotes and the namespace \ separator. It's encompassing, better than Java, but not glamorous in its entirety. Particularily if strings could double as objects (see Python), string functions might outdo regexps.

But that just as side notes. I believe it's anyway mostly perceptional and technical reasons that lead to overuse and/or shunning regular expressions in general. Yet PHP and its userbase has a few properties which compound it, and why we see more questions on SO about it [citation needed!] and they are "morbidly attractive" there.

mario
  • 2,333
1

Why are regular expressions so morbidly attractive?

They're not. They're actually ugly as hell. And incomprehensible. They're an abomination that should be killed as soon as possible.

Now, this being said, I'm going back to debugging a little Perl app. Can't help it; unfortunatelly, they're still the best tool for the job sometimes.

Rook
  • 19,947
1

I like regular expressions in general I find them easier to read/understand than the 20 lines of code I would have to replace them with. Short regular expressions are quickly read and understood and they are relatively easy to maintain (if the expression changes you only have one line to change versus looking through the 20 lines of code to make the change). There are times where they are misused but so are many other things.

The reason you probably see so much abuse of them is because your browsing the PHP section of StackOverFlow as I am sure you are aware there are a lot of umm immature PHP programmers out there.

stoj
  • 249
0

In my experiencie, regexes are like an ancient art, something obscure, some peolpe resent them because they can't understand the sorcery involved and maybe because nobody will explain them to you. I haven't heard of universities teaching them for something less trivial than matching an e-mail. Then there's the mystical inner workings of it, since most people don't understand them, they must be slow. And getting them to work fine in the first try is always a challenge for newcomers.

The same thing can be said about Perl, awk, Linux, and everything that has no shiny buttons or nice colored syntax. So, it's like added complexity to "trivial tasks", just throw some loops, splits, a switch, some magic and that's it, something that might work. But well, if you are on the other side of the road, regexes are beautiful cookie cutters that look like signal noise without any nasty loops or more stuff to debug. I like them also for the flexibility they provide. When the pattern to match changes, you just change the regex, not the algorithm, or tool/whatever, and it's nice and working again. And since they are a magical string, you can put it outside the sourcecode if you wish. And another thing that makes me think of perl, if you write a regex that's 20+ chars long, it feels that you accomplished a lot, at least for me, it's just so neat and compact. I'm a lazy programmer also, i don't like writing a lot of code with nice identation and comments and adding some bugs to the mix.

alfa64
  • 413
0

Man is a tool-using creature, and regular expressions are powerful tools. A nice metaphor for regular expressions is a meat slicer from a deli. If you want paper-thin slices of turkey, corned beef, etc., it's just the thing. However, you need skilled hands to use it, because you can cut yourself really badly with it and you won't feel a thing until you see the blood. What I mean by this is that the big problem with regular expressions is getting them slightly off means that you match something you shouldn't, or vice versa, and you don't find out until it causes an issue further along in the process.

0

Regular expressions are very attractive because they wield power. You can do a very complicated piece of work in very few characters.

The problem is that the standard regular expression construct is not Turing-complete which means that there are programs you simply cannot implement with a regular expression, and people don't KNOW that when they are lured by the apparent power of regular expressions.

This - I guess - is the reason for the jwz-quote of "now they have two problems".

I would guess that Perl regular expressions are Turing-complete, but apparently it has not been decisively proved or disproved yet.

0

Because it's an efficient way to program a finite state machine, which is a powerful tool when it applies. It's basically it's own language for programming FSMs, which is helpful if you know the language, annoying if you don't.

DanTilkin
  • 101