Programming Perl [Chapter 2] 2.4 Pattern Matching

2.4 Pattern Matching

The two main pattern matching operators are m//, the match operator, and s///, the substitution operator. There is also a split operator, which takes an ordinary match operator as its first argument but otherwise behaves like a function, and is therefore documented in Chapter 3, Functions.

Although we write m// and s/// here, you'll recall that you can pick your own quote characters. On the other hand, for the m// operator only, the m may be omitted if the delimiters you pick are in fact slashes. (You'll often see patterns written this way, for historical reasons.)

Now that we've gone to all the trouble of enumerating these weird, quote-like operators, you might wonder what it is we've gone to all the trouble of quoting. The answer is that the string inside the quotes specifies a regular expression. We'll discuss regular expressions in the next section, because there's a lot to discuss.

The matching operations can have various modifiers, some of which affect the interpretation of the regular expression inside:

Modifier Meaning

i Do case-insensitive pattern matching.

m
Treat string as multiple lines (^ and $ match internal \n).

s
Treat string as single line (^ and $ ignore \n, but . matches \n).

x
Extend your pattern's legibility with whitespace and comments.

These are usually written as "the /x modifier", even though the delimiter in question might not actually be a slash. In fact, any of these modifiers may also be embedded within the regular expression itself using the (?...) construct. See the section "Regular Expression Extensions" later in this chapter.

The /x modifier itself needs a little more explanation. It tells the regular expression parser to ignore whitespace that is not backslashed or within a character class. You can use this modifier to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. Taken together, these features go a long way toward making Perl a readable language.

Regular Expressions

The regular expressions used in the pattern matching and substitution operators are syntactically similar to those used by the UNIX egrep program. When you write a regular expression, you're actually writing a grammar for a little language. The regular expression interpreter (which we'll call the Engine) takes your grammar and compares it to the string you're doing pattern matching on. If some portion of the string can be parsed as a sentence of your little language, it says "yes". If not, it says "no".

What happens after the Engine has said "yes" depends on how you invoked it. An ordinary pattern match is usually used as a conditional expression, in which case you don't care where it matched, only whether it matched. (But you can also find out where it matched if you need to know that.) A substitution command will take the part that matched and replace it with some other string of your choice. And the split operator will return (as a list) all the places your pattern didn't match.

Regular expressions are powerful, packing a lot of meaning into a short space. They can therefore be quite daunting if you try to intuit the meaning of a large regular expression as a whole. But if you break it up into its parts, and if you know how the Engine interprets those parts, you can understand any regular expression.

The regular expression bestiary

Before we dive into the rules for interpreting regular expressions, let's take a look at some of the things you'll see in regular expressions. First of all, you'll see literal strings. Most characters[23] in a regular expression simply match themselves. If you string several characters in a row, they must match in order, just as you'd expect. So if you write the pattern match:

[23] In this section we are misusing the term "character" to mean "byte". So far, Perl only knows about byte-sized characters, but this will change someday, at which point "character" will be a more appropriate word.

/Fred/

you can know that the pattern won't match unless the string contains the substring "Fred" somewhere.

Other characters don't match themselves, but are metacharacters. (Before we explain what metacharacters do, we should reassure you that you can always match such a character literally by putting a backslash in front of it. For example, backslash is itself a metacharacter, so to match a literal backslash, you'd backslash the backslash: \\.) The list of metacharacters is:

\ | ( ) [  {  ^ $ * + ? .

We said that backslash turns a metacharacter into a literal character, but it does the opposite to an alphanumeric character: it turns the literal character into a sort of metacharacter or sequence. So whenever you see a two-character sequence:

\b \D \t \3 \s

you'll know that the sequence matches something strange. A \b matches a word boundary, for instance, while \t matches an ordinary tab character. Notice that a word boundary is zero characters wide, while a tab character is one character wide. Still, they're alike in that they both assert that something is true about a particular spot in the string. Most of the things in a regular expression fall into the class of assertions, including the ordinary characters that simply assert that they match themselves. (To be precise, they also assert that the next thing will match one character later in the string, which is why we talk about the tab character being "one character wide". Some assertions eat up some of the string as they match, and others don't. But we usually reserve the term "assertion" for the zero-width assertions. We'll call these assertions with nonzero width atoms.) You'll also see some things that aren't assertions. Alternation is indicated with a vertical bar:

/Fred|Wilma|Barney|Betty/

That means that any of those strings can trigger a match. Grouping of various sorts is done with parentheses, including grouping of alternating substrings within a longer regular expression:

/(Fred|Wilma|Pebbles) Flintstone/

Another thing you'll see are what we call quantifiers. They say how many of the previous thing should match in a row. Quantifiers look like:

* + ? *? {2,5}

Quantifiers only make sense when attached to atoms, that is, assertions that have width. Quantifiers attach only to the previous atom, which in human terms means they only quantify one character. So if you want to match three copies of "moo" in a row, you need to group the "moo" with parentheses, like this:

/(moo){3}/

That will match "moomoomoo". If you'd said /moo{3}/, it would only have matched "moooo".

Since patterns are processed as double-quoted strings, the normal double-quoted interpolations will work. (See "String Literals" earlier in this chapter.) These are applied before the string is interpreted as a regular expression. One caveat though: any $ immediately followed by a vertical bar, closing parenthesis, or the end of the string will be interpreted as an end-of-line assertion rather than a variable interpolation. So if you say:

$foo = "moo";
/$foo$/;

it's equivalent to saying:

/moo$/;

You should also know that interpolating variables into a pattern slows down the pattern matcher considerably, because it feels it needs to recompile the pattern each time through, since the variable might have changed.

The rules of regular expression matching

Now that you've seen some of the things you'll be seeing, we'll lay out the rules that the Engine uses to match your pattern against the string. The Perl Engine uses a nondeterministic finite-state automaton (NFA) to find a match. That just means that it keeps track of what it has tried and what it hasn't, and when something doesn't pan out, it backs up and tries something else. This is called backtracking. The Perl Engine is capable of trying a million things at one spot, then giving up on all those, backing up to within one choice of the beginning, and trying the million things again at a different spot. If you're cagey, you can write efficient patterns that don't do a lot of silly backtracking.

The order of the rules below specifies which order the Engine tries things. So when someone trots out a stock phrase like "left-most, longest match", you'll know that overall Perl prefers left-most over longest. But the Engine doesn't realize it's preferring anything at that level. The global preferences result from a lot of localized choices. The Engine thinks locally and acts globally.

Rule 1. The Engine tries to match as far left in the string as it can, such that the entire regular expression matches under Rule 2.

In order to do this, its first choice is to start just before the first character (it could have started anywhere), and to try to match the entire regular expression at that point. The regular expression matches if and only if Engine reaches the end of the regular expression before it runs off the end of the string. If it matches, it quits immediately--it doesn't keep looking for a "better" match, even though the regular expression could match in many different ways. The match only has to reach the end of the regular expression; it doesn't have to reach the end of the string, unless there's an assertion in the regular expression that says it must. If it exhausts all possibilities at the first position, it realizes that its very first choice was wrong, and proceeds to its second choice. It goes to the second position in the string (between the first and second characters), and tries all the possibilities again. If it succeeds, it stops. If it fails, it continues on down the string. The pattern match as a whole doesn't fail until it has tried to match the entire regular expression at every position in the string, including after the last character in the string.

Note that the positions it's trying to match at are between the characters of the string. This rule sometimes surprises people when they write a pattern like /x*/ that can match zero or more x's. If you try the pattern on a string like "fox", it will match the null string before the "f" in preference to the "x" that's later in the string. If you want it to match one or more x's, you need to tell it that by using /x+/ instead. See the quantifiers under Rule 5.

A corollary to this rule is that any regular expression that can match the null string is guaranteed to match at the leftmost position in the string.

Rule 2. For this rule, the whole regular expression is regarded as a set of alternatives (where the degenerate case is just a set with one alternative). If there are two or more alternatives, they are syntactically separated by the | character (usually called a vertical bar). A set of alternatives matches a string if any of the alternatives match under Rule 3. It tries the alternatives left-to-right (according to their position in the regular expression), and stops on the first match that allows successful completion of the entire regular expression. If none of the alternatives matches, it backtracks to the Rule that invoked this Rule, which is usually Rule 1, but could be Rule 4 or 6. That rule will then look for a new position at which to apply Rule 2.

If there's only one alternative, then it either it matches or doesn't, and the rule still applies. (There's no such thing as zero alternatives, because a null string can always match something of zero width.)

Rule 3. Any particular alternative matches if every item in the alternative matches sequentially according to Rules 4 and 5 (such that the entire regular expression can be satisfied). An item consists of either an assertion, which is covered in Rule 4, or a quantified atom, which is covered by Rule 5. Items that have choices on how to match are given "pecking order" from left to right. If the items cannot be matched in order, the Engine backtracks to the next alternative under Rule 2.

Items that must be matched sequentially aren't separated in the regular expression by anything syntactic--they're merely juxtaposed in the order they must match. When you ask to match /^foo/, you're actually asking for four items to be matched one after the other. The first is a zero-width assertion, and the other three are ordinary letters that must match themselves, one after the other.

The left-to-right pecking order means that in a pattern like:

/x*y*/

x gets to pick one way to match, and then y tries all its ways. If that fails, then x gets to pick its second choice, and make y try all of its ways again. And so on. The items to the right vary faster, to borrow a phrase from multi-dimensional arrays.

Rule 4. An assertion must match according to this table. If the assertion does not match at the current position, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices.

Assertion Meaning

^
Matches at the beginning of the string (or line, if /m used)

$
Matches at the end of the string (or line, if /m used)

\b
Matches at word boundary (between \w and \W)

\B
Matches except at word boundary

\A
Matches at the beginning of the string

\Z
Matches at the end of the string

\G
Matches where previous m//g left off

(?=...)
Matches if engine would match ... next

(?!...)
Matches if engine wouldn't match ... next

The $ and \Z assertions can match not only at the end of the string, but also one character earlier than that, if the last character of the string happens to be a newline.

The positive (?=...) and negative (?!...) lookahead assertions are zero-width themselves, but assert that the regular expression represented above by ... would (or would not) match at this point, were we to attempt it. In fact, the Engine does attempt it. The Engine goes back to Rule 2 to test the subexpression, and then wipes out any record of how much string was eaten, returning only the success or failure of the subexpression as the value of the assertion. We'll show you some examples later.

Rule 5. A quantified atom matches only if the atom itself matches some number of times allowed by the quantifier. (The atom is matched according to Rule 6.) Different quantifiers require different numbers of matches, and most of them allow a range of numbers of matches. Multiple matches must all match in a row, that is, they must be adjacent within the string. An unquantified atom is assumed to have a quantifier requiring exactly one match. Quantifiers constrain and control matching according to the table below. If no match can be found at the current position for any allowed quantity of the atom in question, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices.

Quantifiers are:

Maximal Minimal Allowed Range

{n,m} {n,m}?
Must occur at least n times but no more than m times

{n,} {n,}?
Must occur at least n times

{n} {n}?
Must match exactly n times

* *?
0 or more times (same as {0,})

+ +?
1 or more times (same as {1,})

? ??
0 or 1 time (same as {0,1})

If a brace occurs in any other context, it is treated as a regular character. n and m are limited to integral values less than 65,536.

If you use the {n} form, then there is no choice, and the atom must match exactly that number of times or not at all. Otherwise, the atom can match over a range of quantities, and the Engine keeps track of all the choices so that it can backtrack if necessary. But then the question arises as to which of these choices to try first. One could start with the maximal number of matches and work down, or the minimal number of matches and work up.

The quantifiers in the left column above try the biggest quantity first. This is often called "greedy" matching. To find the greediest match, the Engine doesn't actually count down from the maximum value, which after all could be infinity. What actually happens in this case is that the Engine first counts up to find out how many atoms it's possible to match in a row in the current string, and then it remembers all the shorter choices and starts out from the longest one. This could fail, of course, in which case it backtracks to a shorter choice.

If you say /.*foo/, for example, it will try to match the maximal number of "any" characters (represented by the dot) clear out to the end of the line before it ever tries looking for "foo", and then when the "foo" doesn't match there (and it can't, because there's not enough room for it at the end of the string), the Engine will back off one character at a time until it finds a "foo". If there is more than one "foo" in the line, it'll stop on the last one, and throw away all the shorter choices it could have made.

By placing a question mark after any of the greedy quantifiers, they can be made to choose the smallest quantity for the first try. So if you say /.*?foo/, the .*? first tries to match 0 characters, then 1 character, then 2, and so on until it can match the "foo". Instead of backtracking backward, it backtracks forward, so to speak, and ends up finding the first "foo" on the line instead of the last.

Rule 6. Each atom matches according to its type, listed below. If the atom doesn't match (or doesn't allow a match of the rest of the regular expression), the Engine backtracks to Rule 5 and tries the next choice for the atom's quantity.

Atoms match according to the following types:

A regular expression in parentheses, (...), matches whatever the regular expression (represented by ...) matches according to Rule 2. Parentheses therefore serve as a grouping operator for quantification. Parentheses also have the side effect of remembering the matched substring for later use in a backreference (to be discussed later). This side effect can be suppressed by using (?:...) instead, which has only the grouping semantics--it doesn't store anything in $1, $2, and so on.
A "." matches any character except \n. (It also matches \n if you use the /s modifier.) The main use of dot is as a vehicle for a minimal or maximal quantifier. A .* matches a maximal number of don't-care characters, while a .*? matches a minimal number of don't-care characters. But it's also sometimes used within parentheses for its width: /(..):(..):(..)/ matches three colon-separated fields, each of which is two characters long.
A list of characters in square brackets (called a character class) matches any one of the characters in the list. A caret at the front of the list causes it to match only characters that are not in the list. Character ranges may be indicated using the a-z notation. You may also use any of \d, \w, \s, \n, \r, \t, \f, or \nnn, as listed below. A \b means a backspace in a character class. You may use a backslash to protect a hyphen that would otherwise be interpreted as a range delimiter. To match a right square bracket, either backslash it or place it first in the list. To match a caret, don't put it first. Note that most other metacharacters lose their meta-ness inside square brackets. In particular, it's meaningless to specify alternation in a character class, since the characters are interpreted individually. For example, [fee|fie|foe] means the same thing as [feio|].

A backslashed letter matches a special character or character class:

Code Matches

\a Alarm (beep)

\n Newline

\r Carriage return

\t Tab

\f Formfeed

\e Escape

\d A digit, same as [0-9]

\D A nondigit

\w A word character (alphanumeric), same as [a-zA-Z_0-9]

\W A nonword character

\s A whitespace character, same as [ \t\n\r\f]

\S A non-whitespace character

Note that \w matches a character of a word, not a whole word. Use \w+ to match a word.

A backslashed single-digit number matches whatever the corresponding parentheses actually matched (except that \0 matches a null character). This is called a backreference to a substring. A backslashed multi-digit number such as \10 will be considered a backreference if the pattern contains at least that many substrings prior to it, and the number does not start with a 0. Pairs of parentheses are numbered by counting left parentheses from the left.
A backslashed two- or three-digit octal number such as \033 matches the character with the specified value, unless it would be interpreted as a backreference.
A backslashed x followed by one or two hexadecimal digits, such as \x7f, matches the character having that hexadecimal value.
A backslashed c followed by a single character, such as \cD, matches the corresponding control character.
Any other backslashed character matches that character.
Any character not mentioned above matches itself.

The fine print

As mentioned above, \1, \2, \3, and so on, are equivalent to whatever the corresponding set of parentheses matched, counting opening parentheses from left to right. (If the particular pair of parentheses had a quantifier such as * after it, such that it matched a series of substrings, then only the last match counts as the backreference.) Note that such a backreference matches whatever actually matched for the subpattern in the string being examined; it's not just a shorthand for the rules of that subpattern. Therefore, (0|0x)\d*\s\1\d* will match "0x1234 0x4321", but not "0x1234 01234", since subpattern 1 actually matched "0x", even though the rule 0|0x could potentially match the leading 0 in the second number.

Outside of the pattern (in particular, in the replacement of a substitution operator) you can continue to refer to backreferences by using $ instead of \ in front of the number. The variables $1, $2, $3 . . . are automatically localized, and their scope (and that of $ `, $&, and $ ' below) extends to the end of the enclosing block or eval string, or to the next successful pattern match, whichever comes first. (The \1 notation sometimes works outside the current pattern, but should not be relied upon.) $+ returns whatever the last bracket match matched. $& returns the entire matched string. $` returns everything before the matched string.[24] $' returns everything after the matched string. For more explanation of these magical variables (and for a way to write them in English), see the section "Special Variables" at the end of this chapter.

[24] In the case of something like s/pattern/length($`)/eg, which does multiple replacements if the pattern occurs multiple times, the value of $` does not include any modifications done by previous replacement iterations. To get the other effect, say:
1 while s/pattern/length($`)/e;
For example, to change all tabs to the corresponding number of spaces, you could say:
1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;

You may have as many parentheses as you wish. If you have more than nine pairs, the variables $10, $11, . . . refer to the corresponding substring. Within the pattern, \10, \11, and so on, refer back to substrings if there have been at least that many left parentheses before the backreference. Otherwise (for backward compatibility) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)

Examples:

s/^([^ ]+) +([^ ]+)/$2 $1/;   # swap first two words
/(\w+)\s*=\s*\1/;             # match "foo = foo"
/.{80,}/;                     # match line of at least 80 chars
/^(\d+\.?\d*|\.\d+)$/;        # match valid number
if (/Time: (..):(..):(..)/) { # pull fields out of a line
        $hours   = $1;
        $minutes = $2;
        $seconds = $3;
}

Hint: instead of writing patterns like /(...)(..)(.....)/, use the unpack function. It's more efficient.

A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W. (Within character classes \b represents backspace rather than a word boundary.)

Normally, the ^ character is guaranteed to match only at the beginning of the string, the $ character only at the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by ^ or $. However, you may wish to treat a string as a multi-line buffer, such that the ^ will also match after any newline within the string, and $ will also match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting $*, but this practice is now deprecated.) \A and \Z are just like ^ and $ except that they won't match multiple times when the /m modifier is used, while ^ and $ will match at every internal line boundary. To match the actual end of the string, not ignoring newline, you can use \Z(?!\n). There's an example of a negative lookahead assertion.

To facilitate multi-line substitutions, the . character never matches a newline unless you use the /s modifier, which tells Perl to pretend the string is a single line--even if it isn't. (The /s modifier also overrides the setting of $*, in case you have some (badly behaved) older code that sets it in another module.) In particular, the following leaves a newline on the $_ string:

$_ = <STDIN>;
s/.*(some_string).*/$1/;

If the newline is unwanted, use any of these:

s/.*(some_string).*/$1/s;
s/.*(some_string).*\n/$1/;
s/.*(some_string)[^\0]*/$1/;
s/.*(some_string)(.|\n)*/$1/;
chop; s/.*(some_string).*/$1/;
/(some_string)/ && ($_ = $1);

Note that all backslashed metacharacters in Perl are alphanumeric, such as \b, \w, and \n. Unlike some regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, $, $, \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This makes it simple to quote a string that you want to use for a pattern but that you are afraid might contain metacharacters. Just quote all the non-alphanumeric characters:

$pattern =~ s/(\W)/\\$1/g;

You can also use the built-in quotemeta function to do this. An even easier way to quote metacharacters right in the match operator is to say:

/$unquoted\Q$quoted\E$unquoted/

Remember that the first and last alternatives (before the first | and after the last one) tend to gobble up the other elements of the regular expression on either side, out to the ends of the expression, unless there are enclosing parentheses. A common mistake is to ask for:

/^fee|fie|foe$/

when you really mean:

/^(fee|fie|foe)$/

The first matches "fee" at the beginning of the string, or "fie" anywhere, or "foe" at the end of the string. The second matches any string consisting solely of "fee" or "fie" or "foe".

Regular expression extensions

Perl defines a consistent extension syntax for regular expressions. You've seen some of them already. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses.[25] The character after the question mark gives the function of the extension. Several extensions are already supported:

[25] This was a syntax error in older versions of Perl. If you try to use this and have problems, upgrade to the newest version.

(?#text)

A comment. The text is ignored. If the /x switch is used to enable whitespace formatting, a simple # will suffice.

(?:...)

This groups things like "(...)" but doesn't make backreferences like "(...)" does. So:

split(/\b(?:a|b|c)\b/)

is like:

split(/\b(a|b|c)\b/)

but doesn't actually save anything in $1, which means that the first split doesn't spit out extra delimiter fields as the second one does.

(?=...)

A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&.

(?!...)

A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note, however, that lookahead and lookbehind are not the same thing. You cannot use this for lookbehind: /(?!foo)bar/ will not find an occurrence of "bar" that is preceded by something that is not "foo". That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo). . .|^. .?)bar/. Sometimes it's still easier just to say:

if (/bar/ and $` !~ /foo$/)

(?imsx)

One or more embedded pattern-match modifiers. This is particularly useful for patterns that are specified in a table somewhere, some of which want to be case-sensitive, and some of which don't. The case-insensitive ones merely need to include (?i) at the front of the pattern. For example:

# hardwired case insensitivity
$pattern = "buffalo";
if ( /$pattern/i )
# data-driven case insensitivity
$pattern = "(?i)buffalo";
if ( /$pattern/ )

We chose to use the question mark for this (and for the new minimal matching construct) because (1) question mark is pretty rare in older regular expressions, and (2) whenever you see one, you should stop and question exactly what is going on. That's psychology.

Pattern-Matching Operators

Now that we've got all that out of the way, here finally are the quotelike operators (er, terms) that perform pattern matching and related activities.

m/PATTERN/gimosx /PATTERN/gimosx: This operator searches a string for a pattern match, and in a scalar context returns true (1) or false (" "). If no string is specified via the =~ or !~ operator, the $_ string is searched. (The string specified with =~ need not be an lvalue--it may be the result of an expression evaluation, but remember the =~ binds rather tightly, so you may need parentheses around your expression.)

Modifiers are:

Modifier Meaning

g Match globally, that is, find all occurrences.

i Do case-insensitive pattern matching.

m Treat string as multiple lines. (continued)

o Only compile pattern once.

s Treat string as single line.

x Use extended regular expressions.

If / is the delimiter then the initial m is optional. With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters. This is particularly useful for matching filenames that contain "/", thus avoiding LTS (leaning toothpick syndrome).

PATTERN may contain variables, which will be interpolated (and the pattern recompiled) every time the pattern search is evaluated. (Note that $) and $| will not be interpolated because they look like end-of-line tests.) If you want such a pattern to be compiled only once, add a /o after the trailing delimiter. This avoids expensive run-time recompilations, and is useful when the value you are interpolating won't change during execution. However, mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.

If the PATTERN evaluates to a null string, the last successfully executed regular expression not hidden within an inner block (including split, grep, and map) is used instead.

If used in a context that requires a list value, a pattern match returns a list consisting of the subexpressions matched by the parentheses in the pattern--that is, ($1, $2, $3 . . . ). (The variables are also set.) If the match fails, a null list is returned. If the match succeeds, but there were no parentheses, a list value of (1) is returned.

Examples:

# case insensitive matching
open(TTY, '/dev/tty');
<TTY> =~ /^y/i and foo();    # do foo() if they want it
# pulling a substring out of a line
if (/Version: *([0-9.]+)/) { $version = $1; }
# avoiding Leaning Toothpick Syndrome
next if m#^/usr/spool/uucp#;
# poor man's grep
$arg = shift;
while (<>) {
    print if /$arg/o;       # compile only once
}
# get first two words and remainder as a list
if (($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/))

This last example splits $foo into the first two words and the remainder of the line, and assigns those three fields to $F1, $F2, and $Etc. The conditional is true if any variables were assigned, that is, if the pattern matched. Usually, though, one would just write the equivalent split:

if (($F1, $F2, $Etc) = split(' ', $foo, 3))

The /g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In a list context, it returns a list of all the substrings matched by all the parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

In a scalar context, m//g iterates through the string, returning true each time it matches, and false when it eventually runs out of matches. (In other words, it remembers where it left off last time and restarts the search at that point. You can find the current match position of a string using the pos function--see Chapter 3, Functions.) If you modify the string in any way, the match position is reset to the beginning. Examples:

# list context--extract three numeric fields from uptime command
($one,$five,$fifteen) = (`uptime` =~ /(\d+\.\d+)/g);
# scalar context--count sentences in a document by recognizing
# sentences ending in [.!?], perhaps with quotes or parens on 
# either side.  Observe how dot in the character class is a literal
# dot, not merely any character.
$/ = "";  # paragraph mode
while ($paragraph = <>) {
    while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) {
        $sentences++;
    }
}
print "$sentences\n";
# find duplicate words in paragraphs, possibly spanning line boundaries.
#   Use /x for space and comments, /i to match the both `is' 
#   in "Is is this ok?", and use /g to find all dups.
$/ = "";        # paragrep mode again
while (<>) {
    while ( m{
                \b            # start at a word boundary
                (\w\S+)       # find a wordish chunk
                ( 
                    \s+       # separated by some whitespace
                    \1        # and that chunk again
                ) +           # repeat ad lib
                \b            # until another word boundary
             }xig
         ) 
    {
        print "dup word `$1' at paragraph $.\n";
    } 
}

?PATTERN?: This is just like the /PATTERN/ search, except that it matches only once between calls to the reset operator. This is a useful optimization when you only want to see the first occurrence of something in each file of a set of files, for instance. Only ?? patterns local to the current package are reset.

This usage is vaguely deprecated, and may be removed in some future version of Perl. Most people just bomb out of the loop when they get the match they want.

s/PATTERN/REPLACEMENT/egimosx: This operator searches a string for PATTERN, and if found, replaces that match with the REPLACEMENT text and returns the number of substitutions made, which can be more than one with the /g modifier. Otherwise it returns false (0).

If no string is specified via the =~ or !~ operator, the $_ variable is searched and modified. (The string specified with =~ must be a scalar variable, an array element, a hash element, or an assignment to one of those, that is, an lvalue.)

If the delimiter you choose happens to be a single quote, no variable interpolation is done on either the PATTERN or the REPLACEMENT. Otherwise, if the PATTERN contains a $ that looks like a variable rather than an end-of-string test, the variable will be interpolated into the PATTERN at run-time. If you want the PATTERN compiled only once, when the variable is first interpolated, use the /o option. If the PATTERN evaluates to a null string, the last successfully executed regular expression is used instead. The REPLACEMENT pattern also undergoes variable interpolation, but it does so each time the PATTERN matches, unlike the PATTERN, which just gets interpolated once when the operator is evaluated. (The PATTERN can match multiple times in one evaluation if you use the /g option below.)

Modifiers are:

Modifier Meaning

e Evaluate the right side as an expression.

g Replace globally, that is, all occurrences.

i Do case-insensitive pattern matching.

m Treat string as multiple lines.

o Only compile pattern once.

s Treat string as single line.

x Use extended regular expressions.

Any non-alphanumeric, non-whitespace delimiter may replace the slashes. If single quotes are used, no interpretation is done on the replacement string (the /e modifier overrides this, however). If the PATTERN is contained within naturally paired delimiters (such as parentheses), the REPLACEMENT has its own pair of delimiters, which may or may not be the same ones used for PATTERN--for example, s(foo)(bar) or s<foo>/bar/. A /e will cause the replacement portion to be interpreted as a full-fledged Perl expression instead of as a double-quoted string. (It's kind of like an eval, but its syntax is checked at compile-time.)

Examples:

# don't change wintergreen
s/\bgreen\b/mauve/g;
# avoid LTS with different quote characters
$path =~ s(/usr/bin)(/usr/local/bin);
# interpolated pattern and replacement
s/Login: $foo/Login: $bar/;
# modifying a string "en passant"
($foo = $bar) =~ s/this/that/;
# counting the changes
$count = ($paragraph =~ s/Mister\b/Mr./g);
# using an expression for the replacement
$_ = 'abc123xyz';
s/\d+/$&*2/e;               # yields 'abc246xyz'
s/\d+/sprintf("%5d",$&)/e;  # yields 'abc  246xyz'
s/\w/$& x 2/eg;             # yields 'aabbcc  224466xxyyzz'
# how to default things with /e
s/%(.)/$percent{$1}/g;            # change percent escapes; no /e
s/%(.)/$percent{$1} || $&/ge;     # expr now, so /e
s/^=(\w+)/&pod($1)/ge;            # use function call
# /e's can even nest; this will expand simple embedded variables in $_
s/(\$\w+)/$1/eeg;
# delete C comments
$program =~ s {
    /\*     # Match the opening delimiter.
    .*?     # Match a minimal number of characters.
    \*/     # Match the closing delimiter.
} []gsx;
# trim white space
s/^\s*(.*?)\s*$/$1/;
# reverse 1st two fields
s/([^ ]*) *([^ ]*)/$2 $1/;

Note the use of $ instead of \ in the last example. Some people get a little too used to writing things like:

$pattern =~ s/(\W)/\\\1/g;

This is grandfathered for the right-hand side of a substitution to avoid shocking the sed addicts, but it's a dirty habit to get into.[26] That's because in PerlThink, the right-hand side of a s/// is a double-quoted string. In an ordinary double-quoted string, \1 would mean a control-A, but for s/// the customary UNIX meaning of \1 is kludged in. (The lexer actually translates it to $1 on the fly.) If you start to rely on that, however, you get yourself into trouble if you then add an /e modifier:

[26] Or to not get out of, depending on how you look at it.

s/(\d+)/ \1 + 1 /eg;   # a scalar reference plus one?

Or if you try to do:

s/(\d+)/\ 1000/;        # "\ 100" . "0" == "@0"?

You can't disambiguate that by saying \{1}000, whereas you can fix it with ${1}000. Basically, the operation of interpolation should not be confused with the operation of matching a backreference. Certainly, interpolation and matching mean two different things on the left side of the s///.

Occasionally, you can't just use a /g to get all the changes to occur, either because the substitutions have to happen right-to-left, or because you need the length of $` to change between matches. In this case you can usually do what you want by calling the substitution repeatedly. Here are two common cases:

# put commas in the right places in an integer
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;
# expand tabs to 8-column spacing
1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e;

tr/SEARCHLIST/REPLACEMENTLIST/cds y/SEARCHLIST/REPLACEMENTLIST/cds: Strictly speaking, this operator doesn't belong in a section on pattern matching because it doesn't use regular expressions. Rather, it scans a string character by character, and replaces all occurrences of the characters found in the SEARCHLIST with the corresponding character in the REPLACEMENTLIST. It returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is translated. (The string specified with =~ must be a scalar variable, an array element, or an assignment to one of those, that is, an lvalue.) For sed devotees, y is provided as a synonym for tr///. If the SEARCHLIST is contained within naturally paired delimiters (such as parentheses), the REPLACEMENTLIST has its own pair of delimiters, which may or may not be naturally paired ones--for example, tr[A-Z][a-z] or tr(+-*/)/ABCD/.

Modifiers:

Modifier Meaning

c Complement the SEARCHLIST.

d Delete found but unreplaced characters.

s Squash duplicate replaced characters.

If the /c modifier is specified, the SEARCHLIST character set is complemented; that is, the effective search list consists of all the characters not in SEARCHLIST. If the /d modifier is specified, any characters specified by SEARCHLIST but not given a replacement in REPLACEMENTLIST are deleted. (Note that this is slightly more flexible than the behavior of some tr/// programs, which delete anything they find in the SEARCHLIST, period.) If the /s modifier is specified, sequences of characters that were translated to the same character are squashed down to a single instance of the character.

If the /d modifier is used, the REPLACEMENTLIST is always interpreted exactly as specified. Otherwise, if the REPLACEMENTLIST is shorter than the SEARCHLIST, the final character is replicated until it is long enough. If the REPLACEMENTLIST is null, the SEARCHLIST is replicated. This latter is useful for counting characters in a class or for squashing character sequences in a class.

Examples:

$ARGV[1] =~ tr/A-Z/a-z/;    # canonicalize to lower case
$cnt = tr/*/*/;             # count the stars in $_
$cnt = $sky =~ tr/*/*/;     # count the stars in $sky
$cnt = tr/0-9//;            # count the digits in $_
tr/a-zA-Z//s;               # bookkeeper -> bokeper
($HOST = $host) =~ tr/a-z/A-Z/;
tr/a-zA-Z/ /cs;             # change non-alphas to single space
tr [\200-\377]
   [\000-\177];             # delete 8th bit

If multiple translations are given for a character, only the first one is used:

tr/AAA/XYZ/

will translate any A to X.

Note that because the translation table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subject to double quote interpolation. That means that if you want to use variables, you must use an eval:

eval "tr/$oldlist/$newlist/";
die $@ if $@;
eval "tr/$oldlist/$newlist/, 1" or die $@;

One more note: if you want to change your text to uppercase or lowercase, it's better to use the \U or \L sequences in a double-quoted string, since they will pay attention to locale information, but tr/a-z/A-Z/ won't.

Previous | Home | Next