From: Tom Christiansen <tchrist@mox.perl.com> Subject: SRC: Matching multiple patterns Date: 22 May 1998 13:28:53 GMT Ever want to check a variable number of matches against one string? It runs about 10x slower per match than if it were a constant string. For example, let say we wanted print out lines that contain any of a number of strings -- at word boundaries. #!/usr/bin/grep # popgrep1 - grep for abbreviations of some places in the # North American heartland where they say "pop" # version 1: slow but obvious way @popstates = qw(CO ON MI WI MN); LINE: while (defined($line = <>)) { for $state (@popstates) { if ($line =~ /\b$state\b/) { print; next LINE; } } } Here's the normal workaround: #!/usr/bin/perl # popgrep2 - grep for abbreviations of places that say "pop" # version 2: eval strings; fast but hard to quote @popstates = qw(CO ON MI WI MN); $code = 'while (defined($line = <>)) {'; for $state (@popstates) { $code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n"; } $code .= '}'; print "CODE IS\n----\n$code\n----\n" if 0; # turn on to debug eval $code; die if $@; Or, using the fixerupper notion on here docs presented in a previous posting: #!/usr/bin/perl # popgrep2 - grep for abbreviations of places that say "pop" # version 2: eval strings; fast but hard to quote @popstates = qw(CO ON MI WI MN); sub fix { local $_ = shift; s/^\s*<NEWCODE> ?//gm; return $_; } $code = fix <<"START_CODE"; <NEWCODE> while (defined(\$line = <>)) { START_CODE for $state (@popstates) { $code .= fix <<" MIDDLE_CODE"; <NEWCODE> if (\$line =~ /\\b$state\\b/) { <NEWCODE> print \$line; <NEWCODE> next; <NEWCODE> } MIDDLE_CODE } $code .= '}'; print "CODE IS\n----\n$code\n----\n" if 1; # turn on to debug eval $code; die if $@; That's got a problem in that the patterns can't have slashes. It's also hard to quote and understand. Here's a solution: #!/usr/bin/perl # popgrep3 - grep for abbreviations of places that say "pop" # version 3: use jfriedl's build_match_func algorithm @popstates = qw(CO ON MI WI MN); $expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates); $match_any = eval "sub { $expr }"; die if $@; while (<>) { print if &$match_any; } That solves the problems of 1) speed 2) slashes in pattern 3) quoting difficulties, but not the 4) hard to understand part. There's also a RegExp module from CPAN: #!/usr/bin/perl # popgrep4 - grep for abbreviations of places that say "pop" # version 4: use Regexp module use Regexp; @popstates = qw(CO ON MI WI MN); @poppats = map { Regexp->new( '\b' . $_ . '\b') } @popstates; while (defined($line = <>)) { for $patobj (@poppats) { print $line if $patobj->match($line); } } You might wonder about the comparative speeds of these approaches. When run against the 22,000 line text file (the Jargon File, to be exact), version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3 in just 0.79 seconds, and version 4 in 1.74 seconds. The last technique is a lot easier to understand than the others, although it does run a little bit slower than they do. It's also more flexible. Something like it will someday make its way into the standard release of Perl. --tom -- There's going to be no serious problem after this. --Ken Thompson