Perl: Greedy and Ungreedy Matching

Greedy and Ungreedy Matching

Perl regular expressions normally match the longest string possible. For instance:

my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";

Run the preceding code, and here’s what you get:


It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:

my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";

Now look what the code produces:


Clearly, the use of the question mark makes the match ungreedy. But theres another problem in that regular expressions always try to match as early as possible.



One thing to watch out for in non-greedy matching:


One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one
could match and extract the first two fields of a like of % separated

#!/usr/bin/perl -w
use strict;
$_ = ‘Johnson%Andrew%AX321%37’;
print “$2 $1\n”;

And one can easily begin to think of each subexpression as
meaning ‘match up to the next % symbol’, but that isn’t exactly what it
means. Let’s say that the third field represents an ID tag and we want
to extract only those names of people with ID tags starting with ‘A’.
We might be tempted to do this:

#!/usr/bin/perl -w
use strict;
while () {
print “$2 $1\n” if m/^(.*?)%(.*?)%A/;

This would print out:

Andrew Johnson
John%BC142 Smith

But that isn’t what we wanted at all — what happened? Well, the second
half of the regex does not say match up to the next % symbol and then
match an ‘A’, it says, match up to the next % symbol that is followed
by an ‘A’. The pattern ‘(.*?)’ part is not prevented from matching and
proceeding past a % character if that is what is necessary to cause the
whole regex to succeed. What we really wanted in this case was a
negated character class:

#!/usr/bin/perl -w
use strict;
while () {
print “$2 $1\n” if m/^([^%]*)%([^%]*)%A/;

Now we are saying exactly what we want: the first subexpression grabs
zero or more of anything except a % character, then we match a %
character, then the second subexpression also grabs zero or more of
anything but a % character, and finally we match ‘%A’ or we fail.

To summarize, a greedy quantifier takes as much as it can get, and a
non-greedy quantifier takes as little as possible (in both cases only
while still allowing the entire regex to succeed). Take care in how you
use non-greedy quantifiers — it is easy to get fooled into using one
where a negated character class is more appropriate.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s