From: Tom Christiansen <tchrist@mox.perl.com>
Subject: SRC: Matching multiple patterns
Date: 22 May 1998 13:28:53 GMT
Ever want to check a variable number of matches against one string?
It runs about 10x slower per match than if it were a constant string.
For example, let say we wanted print out lines that contain any
of a number of strings -- at word boundaries.
#!/usr/bin/grep
# popgrep1 - grep for abbreviations of some places in the
# North American heartland where they say "pop"
# version 1: slow but obvious way
@popstates = qw(CO ON MI WI MN);
LINE: while (defined($line = <>)) {
for $state (@popstates) {
if ($line =~ /\b$state\b/) {
print; next LINE;
}
}
}
Here's the normal workaround:
#!/usr/bin/perl
# popgrep2 - grep for abbreviations of places that say "pop"
# version 2: eval strings; fast but hard to quote
@popstates = qw(CO ON MI WI MN);
$code = 'while (defined($line = <>)) {';
for $state (@popstates) {
$code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n";
}
$code .= '}';
print "CODE IS\n----\n$code\n----\n" if 0; # turn on to debug
eval $code;
die if $@;
Or, using the fixerupper notion on here docs presented in a previous
posting:
#!/usr/bin/perl
# popgrep2 - grep for abbreviations of places that say "pop"
# version 2: eval strings; fast but hard to quote
@popstates = qw(CO ON MI WI MN);
sub fix {
local $_ = shift;
s/^\s*<NEWCODE> ?//gm;
return $_;
}
$code = fix <<"START_CODE";
<NEWCODE> while (defined(\$line = <>)) {
START_CODE
for $state (@popstates) {
$code .= fix <<" MIDDLE_CODE";
<NEWCODE> if (\$line =~ /\\b$state\\b/) {
<NEWCODE> print \$line;
<NEWCODE> next;
<NEWCODE> }
MIDDLE_CODE
}
$code .= '}';
print "CODE IS\n----\n$code\n----\n" if 1; # turn on to debug
eval $code;
die if $@;
That's got a problem in that the patterns can't have slashes.
It's also hard to quote and understand. Here's a solution:
#!/usr/bin/perl
# popgrep3 - grep for abbreviations of places that say "pop"
# version 3: use jfriedl's build_match_func algorithm
@popstates = qw(CO ON MI WI MN);
$expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates);
$match_any = eval "sub { $expr }";
die if $@;
while (<>) {
print if &$match_any;
}
That solves the problems of 1) speed 2) slashes in pattern 3) quoting
difficulties, but not the 4) hard to understand part.
There's also a RegExp module from CPAN:
#!/usr/bin/perl
# popgrep4 - grep for abbreviations of places that say "pop"
# version 4: use Regexp module
use Regexp;
@popstates = qw(CO ON MI WI MN);
@poppats = map { Regexp->new( '\b' . $_ . '\b') } @popstates;
while (defined($line = <>)) {
for $patobj (@poppats) {
print $line if $patobj->match($line);
}
}
You might wonder about the comparative speeds of these approaches.
When run against the 22,000 line text file (the Jargon File, to be exact),
version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3
in just 0.79 seconds, and version 4 in 1.74 seconds. The last technique
is a lot easier to understand than the others, although it does run a
little bit slower than they do. It's also more flexible. Something like
it will someday make its way into the standard release of Perl.
--tom
--
There's going to be no serious problem after this. --Ken Thompson