perlhaq3::Regexps

Sweet regular expressions, more efficient regular expressions and haquer regular expressions.


Recommendation
A lot of this may be better explained in perlre or perlfaq6 (Regexps), but if you like from simple to advanced examples you should still check this out.

IPv4 Address Matching
Let's look at a basic IPv4 address: 127.0.0.1
Here will be the final regular expression we use: /^(?:0*(?:2(?:[0-4]\d|5[0-5])|1?\d{1,2})(?:\.|$)){4}/
We break the IP down to different parts: <127> <.> <0> <.> <0> <.> <1>
Since we know what an IP is shaped like (from 0.0.0.0 to 255.255.255.255), we can break the numbers down more to fit a regular expression:
<0 - 255> <.> <0 - 255> <.> <0 - 255> <.> <0 - 255>
Looking at more like a regexp: ((<2> (<0 - 4> <0 - 9> or <5> <0 - 5>)) or <0 - 1> <0 - 9> <0 - 9>) <.> etc., etc..
Now converting this to a regular expression...we'll assume the regexp we're making will be replaced in the code where it says 'REGEXP':

$ip = "127.0.0.1";
if ($ip =~ /REGEXP/x) {
 print "$ip is a valid IPv4 address.\n";
}
else {
 print "$ip is not a valid IPv4 address.\n";
}
The reason there is a 'x' at the end of the /REGEXP/ is so we can allow white space and comments inside of the regular expression.
To take a better look at the regular expression we'll use some basic color cordination :)
^ (?:0* (?:2 (?:[0-4] \d | 5 [0-5]) | 1? \d{1,2}) (?: \. | $)){4}
Here's a basic explanation of how this regular expression works. at the end, the entire cluster is matched 4 times. The main cluster is a number from 0 to 255 with any amount of 0's before the number followed by a period or the end of the string. The number matching is like this. The next cluster in matches a 2, and if it matches a 2 then it tries to match 0 - 4 followed by any number or it tries to match a 5 followed by a 0 - 5. This is to limit from it going over the possible value, which is 255. If it doesn't match any of those it matches, but not required, a 1 (for a 1xx). Then it matches 1 or 2 numbers (0 - 99). The end checks for either a decimal or the end of the string. This successfully matches an IP address.

$ matches the end of the line.
* matches 0 or more times.
? matches 0 or 1 times.
+ (not used in this) matches 1 or more times.
{} is used for matching.
{n} matches n times.
{n,} matches at least n times.
{n,m} matches at least n times but not more than m times.
^ matches at the beginning of the line.
() is for grouping (capturing and clustering data).
?: at the beginning of () clusters but doesn't capture the data.
[] is for character classing.
- in a character class matches anything from the character before it to the character after it.
| is for alternation. There is grouping because we need to alternate what's in the cluster and not just everything.
\. is just the decimal character. The backslash quotes the metacharacter after it, in this case .. It must be backslashed because a \d matches a digit. . in a regular expression matches any character (except newline unless specific regular expression options are added at the end).

Execution of Code in a Regexp


Changes

August 29, 2001 - I began writing this.


Author

That would be me, Samy Kamkar. You can reach me at CommPort5@LucidX.com or on IRC on SUIDnet in #suid with the IRC name 'CommPort5'.