Regular Expressions

Beyond the fundamentals

Peter Makholm

Copenhagen.pm

Disclaimer

The example in this talk
MUST NOT[0]
be viewed as an endorsement of doing calendary calculations with regular expressions




0)This is to be interpreted as described in RFC 2119

Beyond the fundementals

So you learned the basics, what's next?

In a 40 minute timeslot I can't provide you with a lot of practise, so my main focus wuld be on the other parts

Advanced features

Most of the advanced features regarding captures and backreferences a new in perl 5.10

General syntax

Look around assertions

- examples

Captures and backreferences

Perl 5.10 provides some new options for captures and back references:

Global matches

With global matches enabled, Perl keeps track of the position it has reached while performing matches on a string

- A simple lexer


while(<>) {
    print(" lowercase") and redo if /\G[a-z]+\s+/gc;
    print(" uppercase") and redo if /\G[A-Z]+\s+/gc;
    print(" digits")    and redo if /\G[0-9]+\s+/gc;
    print(" mixed")     and redo if /\G[a-zA-Z0-9]+\s+/gc;
    print(" noise")     and redo if /\G[^a-zA-Z0-9]+\s+/gc;
    print "\n";
}

Independent subexpressions

This is also called "atomic matching"

Possessive quantifiers

The possessive quantifiers are a special case of independent subexpressions:

The qr// operator

The qr// operator turns the regexp into a value you can store and parse around

@regexp = ( qr/foo/, qr/bar/, qr/baz/ );
while(defined( $line = <> )) {
    print $line if grep { $line =~ $_ } for @regexp;
}

Keep It Simple, Stupid

It is easy to try to solve everything with a single regular expression but ...

KISS, the alternative

However, don't end up like this

qr/
   (((0[48]|[2468][048]|[13579][26])00|\d\d(0[48]|[2468]
   [048]| [13579] [26])) |(( [02468] [1235679]|  [13579]
   [01345789])00|\d\d([02468][1235679]|[13579][01345789]
   ))(?!(-|)02\g{-1}29))(?<sep>-|)(02(?!\g{sep}3)|0[469]
   (?!\g{sep}31)|11(?!\g{sep} 31)|0[13578]|1[02])\g{sep}
   (0[1-9]|[12][0-9]|3[01])
  /x;

Add spaces and comments

For improved readability, use /x to add whitespaces and comments:


$isLeapYear = 
  qr/( 
    # Either century divisible by 400:
      ( 0[48] | [2468][048] | [13579][26] ) 00
    # Or year divisible by 4, but not a century 
      \d\d ( 0[48] | [2468][048] | [13579][26] )
     )/x;

The missing and operator

Often people want to match "this" and "that" in a single regexp. It's possible, but ...

Which one is easier to read?

A benchmark shows the latter to be upto 3 times faster

Extract and validate in Perl

Remember this?

qr/
   (((0[48]|[2468][048]|[13579][26])00|\d\d(0[48]|[2468]
   [048]| [13579] [26])) |(( [02468] [1235679]|  [13579]
   [01345789])00|\d\d([02468][1235679]|[13579][01345789]
   ))(?!(-|)02\g{-1}29))(?<sep>-|)(02(?!\g{sep}3)|0[469]
   (?!\g{sep}31)|11(?!\g{sep} 31)|0[13578]|1[02])\g{sep}
   (0[1-9]|[12][0-9]|3[01])
  /x;

It validates dates on the form 'YYYY-MM-DD' or 'YYYYMMDD'.
Excercise: Add support for 'dd/mm/yyyy'

- Validating dates

Try this instead:

my @daysInMonth = (  0, 31, 28, 31, 30, 31, 
                    30, 31, 31, 30, 31, 30, 31);

if( /(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)/ ) {
  return if $+{m} == 0 
         || $+{d} == 0;
         || $+{m} > 12;
  return 1 if $+{d} <= $daysInMonth[ $+{m} ];
  return 1 if $+{m} = 2 
           && $+{d} = 29
           && leapYear( $+{y} );
  return;
}
Of course, there are modules which are even easier...
I've benchmarked the two solutions. For ordinary vaild dates the regexp seems 3 times faster. For 2000-02-29 the regexps seems to be 10 times faster. The named captures seems to account for a lot of the slowdown. Plain old numbered captures makes the two solutions perform equally on ordinary valid dates.

Never parse HTML or XML

Regexp::Common: Don't reinvent

Ready to use regexpes for a lot of common cases:

Test::Regexp: Testing Regular expressions


use Test::Regexp 'no_plan';

match    subject      => "Foo bar",
         keep_pattern => qr /(?<first_word>\w+)\s+(\w+)/,
         captures     => [[first_word => 'Foo'], ['bar']];

no_match subject      => "Baz",
         pattern      => qr /Quux/;
Abigail gave a talk about this a few hours before my talk was scheduled... See here

Questions?

?