a/tct-pattern

From: Tom Christiansen <tchrist@mox.perl.com>
Subject: SRC: Matching multiple patterns
Date: 22 May 1998 13:28:53 GMT

Ever want to check a variable number of matches against one string?
It runs about 10x slower per match than if it were a constant string.
For example, let say we wanted print out lines that contain any 
of a number of strings -- at word boundaries.

    #!/usr/bin/grep
    # popgrep1 - grep for abbreviations of some places in the 
    # 		 North American heartland where they say "pop"
    # version 1: slow but obvious way
    @popstates = qw(CO ON MI WI MN);
    LINE: while (defined($line = <>)) {
	for $state (@popstates) { 
	    if ($line =~ /\b$state\b/) {
		print; next LINE;
	    } 
	}
    } 

Here's the normal workaround:

    #!/usr/bin/perl
    # popgrep2 - grep for abbreviations of places that say "pop"
    # version 2: eval strings; fast but hard to quote
    @popstates = qw(CO ON MI WI MN);
    $code = 'while (defined($line = <>)) {';
    for $state (@popstates) { 
	$code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n";
    } 
    $code .= '}';
    print "CODE IS\n----\n$code\n----\n" if 0;  # turn on to debug
    eval $code;
    die if $@;

Or, using the fixerupper notion on here docs presented in a previous 
posting:

    #!/usr/bin/perl
    # popgrep2 - grep for abbreviations of places that say "pop"
    # version 2: eval strings; fast but hard to quote
    @popstates = qw(CO ON MI WI MN);
    sub fix {
	local $_ = shift;
	s/^\s*<NEWCODE> ?//gm;
	return $_;
    } 

    $code = fix <<"START_CODE";
	    <NEWCODE> while (defined(\$line = <>)) {
    START_CODE

    for $state (@popstates) {
	$code .= fix <<"    MIDDLE_CODE";
	     <NEWCODE>      if (\$line =~ /\\b$state\\b/) {
	     <NEWCODE>          print \$line;
	     <NEWCODE>          next;
	     <NEWCODE>      }
	MIDDLE_CODE
    }
    $code .= '}';
    print "CODE IS\n----\n$code\n----\n" if 1;  # turn on to debug
    eval $code;
    die if $@;

That's got a problem in that the patterns can't have slashes.
It's also hard to quote and understand.  Here's a solution:

    #!/usr/bin/perl
    # popgrep3 - grep for abbreviations of places that say "pop"
    # version 3: use jfriedl's build_match_func algorithm
    @popstates = qw(CO ON MI WI MN);
    $expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates);
    $match_any = eval "sub { $expr }";
    die if $@;
    while (<>) {
	print if &$match_any;
    }

That solves the problems of 1) speed 2) slashes in pattern 3) quoting
difficulties, but not the 4) hard to understand part.

There's also a RegExp module from CPAN:

    #!/usr/bin/perl
    # popgrep4 - grep for abbreviations of places that say "pop"
    # version 4: use Regexp module
    use Regexp;
    @popstates = qw(CO ON MI WI MN);
    @poppats   = map { Regexp->new( '\b' . $_ . '\b') } @popstates;
    while (defined($line = <>)) {
	for $patobj (@poppats) {
	    print $line if $patobj->match($line);
	}
    }

You might wonder about the comparative speeds of these approaches.
When run against the 22,000 line text file (the Jargon File, to be exact),
version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3
in just 0.79 seconds, and version 4 in 1.74 seconds.  The last technique
is a lot easier to understand than the others, although it does run a
little bit slower than they do.  It's also more flexible.  Something like
it will someday make its way into the standard release of Perl.

--tom
-- 
There's going to be no serious problem after this.  --Ken Thompson