perl =~ operator interprets a RHS expression at run-time

This article starts with a conversation about adjusting compiled perl regular expressions and a sudden realisation – which went a little along the lines of:

OtherPerson> I can force a regular expression to change, for example:
OtherPerson> DB<1> $a = qr{Abc}
OtherPerson> DB<2> $a =~ s/A/a/
OtherPerson> DB<3> x $a
OtherPerson> 0  '(?^:abc)'
Me> No that doesn't work does it, oh wait it does, why?

It took me a few moments of poking this myself in the debugger to notice the key change – which was that the compiled regular expression had become a string. See in the below example that $a starts as a RegExp and then becomes a scalar (at which point the ref function returns the empty string):

DB<7> $a = qr{Abc}
DB<8> x ref $a
0 'Regexp'
DB<9> x $a
0 (?^:Abc)
 -> qr/(?^:Abc)/
DB<10> $a =~ s/A/a/g
DB<11> x ref $a
0  ''
DB<12> x $a
0 '(?^:abc)'

This isn’t actually surprising – in “DB<10>” when $a is modified using the s/// substitution operator perl uses the context to convert the regular expression in scalar context and then operates on it as a string. As a string substituting A with a clearly results in ‘(?^:abc)’.

But the surprising bit was that this then worked as a regular expression:

DB<21> print "match\n" if "Abc" =~ $a
DB<22> print "match\n" if "abc" =~ $a
match

So what’s happening here? It looks like perl is interpreting the string at run-time as a regular expression – and that’s exactly what is happening. If you real the documentation on the =~ operator (see extract below) it turns out that “If the right argument is an expression rather than a search pattern, substitution, or transliteration, it is interpreted as a search pattern at run time.

So the string ‘(?^:abc)’ is literally being re-interpreted as a regular expression and then used.

This process works because a compiled regular expression converted to a string is represented as legal syntax used to define that regular expression with pattern modifiers (that might have been applied at the end of a regular expression). For example a case insensitive regex like qr/abc/i becomes ‘(?^i:abc)’.

This little bit of silent interpretation has a number of implications – some of these being:

  • while it’s possible to use this to make all sorts of alterations to a regular expression it’s not a very good idea unless there’s a very good reason. It’s possible to take a regular expression which was syntax checked at compile time, make an invalid change to it and then end up with an error when the modified-string version is interpreted at run time:
    DB<25> $a =~ s/:/Z/
    DB<26> p $a
    (?^Zabc)
    DB<27> print "match\n" if "abc" =~ $a
    Sequence (?^Z...) not recognized in regex; marked by <-- HERE in m/(?^Z <-- HERE abc)/ at (eval 31)[/usr/share/perl/5.14/perl5db.pl:640] line 2.
  • If precomiled regular expressions are used as hash keys (which you can legally with no errors or problems) you are not gaining any performance by using pre-compiled regular expressions (in fact the opposite) because your code will compile all the regular expressions and then stringify them to use them as hash keys and then every time they are used in the code they will then be re-interpreted as regular expressions. Note that using compiled regular expressions as hash values (not keys) does work as you would intend.
  • The right-hand argument (if it is not already a pattern) is interpreted as an expression – not just a string – so for example it’s possible to use a function which returns a regular expression (or string which get’s interpreted as a regular expression!):
    DB<31> sub pat { return rand() < 0.5 ? qr/a/ : qr/b/ }
    DB<32> p pat
    (?^:b)
    DB<33> p pat
    (?^:b)
    DB<34> p pat
    (?^:a)
    DB<35> print "yes\n" if "a" =~ pat
    yes
    DB<36> print "yes\n" if "a" =~ pat
    yes
    DB<37> print "yes\n" if "a" =~ pat
    yes
    DB<38> print "yes\n" if "a" =~ pat()
    
    DB<39> print "yes\n" if "a" =~ pat()
    
    DB<40> print "yes" if "a" =~ pat()
    
    DB<41> print "yes" if "a" =~ pat()
    yes
    DB<42> print "yes" if "a" =~ pat()
    yes
    DB<43> print "yes" if "a" =~ pat()
    yes
    DB<44> print "yes" if "a" =~ pat()
    yes

perldoc perlre extract

See the perldoc perlre “Binding Operators” section on the =~ operator:

Binding Operators

Binary “=~” binds a scalar expression to a pattern match. Certain operations search or modify the string $_ by default. This operator makes that kind of operation work on some other string. The right argument is a search pattern, substitution, or transliteration. The left argument is what is supposed to be searched, substituted, or transliterated instead of the default $_. When used in scalar context, the return value generally indicates the success of the operation. The exceptions are substitution (s///) and transliteration (y///) with the “/r” (non-destructive) option, which cause the return value to be the result of the substitution. Behavior in list context depends on the particular operator. See “Regexp Quote-Like Operators” for details and perlretut for examples using these operators.

If the right argument is an expression rather than a search pattern, substitution, or transliteration, it is interpreted as a search pattern at run time. Note that this means that its contents will be interpolated twice, so

‘\\’ =~ q’\\’;

is not ok, as the regex engine will end up trying to compile the pattern “\”, which it will consider a syntax error.

 

Leave a Reply

Your email address will not be published. Required fields are marked *