Sei sulla pagina 1di 18

Topic 6: Regular expressions

CSE2395/CSE3395 Perl Programming

Llama3 chapters 7-9, pages 98-127 Camel3 pages 139-195 perlre manpage

In this topic
Regular expressions
performing pattern matching

2
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Matching strings
Can find one string within another using index
function
returns position of start of substring, or -1 on failure

$needle = "tac"; print index "haystack", $needle; # 4

Only works for constant substrings


not usually sufficient for common pattern-matching

uses

Llama3 pages 208-209; Camel3 page 731; perlfunc manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressions
Regular expressions are a mini-language used to
describe patterns of characters

e.g., e.g., look look for for a a t, t, followed followed by by any any vowel, vowel, followed followed by by any any letter letter haystack haystack taciturn taciturn (twice) (twice) settee settee top top mouse mouse cattle cattle bite bite me me (has (has space space where where consonant consonant needed needed to to be) be) empty empty string string
Llama3 pages 98-99

Some strings satisfy a given regular expression


Some strings cant satisfy it


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressions
Several Unix programs have support for regular
expressions
usually programs which manipulate text grep (print lines matching a pattern) sed and awk (stream editors) vi and emacs (text editors) lex (tokenizer) procmail (mail filter)

perl (some programming language)

Share a (reasonably) common format

some minor differences in capabilities and dialects previous slides example written t[aeiou][a-z]
5

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Unix grep program


grep prints out any line in its input that matches
a regular expression
only distantly related to Perls grep function

% grep 't[aeiou][a-z]' /usr/dict/words abated abetted abolition ... lots more words here ... yesterday youngster ytterbium

Llama3 page 99
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Regular expressions in Perl


Perl tries to match regular expression patterns to
the string in the variable $_
if successful anywhere inside string, result is true otherwise (unsuccessful everywhere), result is false

Pattern is written between two forward slashes


/t[aeiou][a-z]/ /.../ called match operator boolean value returned

usually usually used used inside inside if if or or while while condition condition if if (/t[aeiou][a-z]/) (/t[aeiou][a-z]/) { { ... ... } }

Llama3 page 100; Camel3 pages 140, 145-150, 218; perldoc manpage
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Timeout
# # Find Find occurrences occurrences of of a a pattern pattern in in the the named named files. files. # # Read Read lines lines of of input input into into $_, $_, one one at at a a time. time. while while (<>) (<>) { { # # Check Check for for the the pattern pattern in in $_. $_. if if (/t[aeiou][a-z]/) (/t[aeiou][a-z]/) { { # # Success. Success. Print Print out out this this line. line. print; print; } } } }

8
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: literal characters


Alphanumeric characters match themselves

/abc/ /abc/ matches matches substring substring "abc" "abc" /123/ /123/ matches matches substring substring "123" "123"

Most other characters require a backslash in order to


match themselves

/\[a\]/ /\[a\]/ matches matches substring substring "[a]" "[a]" /\/usr\/bin/ /\/usr\/bin/ matches matches substring substring "/usr/bin" "/usr/bin" if if in in doubt, doubt, backslash backslash all all non-alphanumerics non-alphanumerics /\n/ /\n/ matches matches newline newline character character /\b/ /\b/ matches matches word word boundary boundary /\d/ /\d/ is is shorthand shorthand for for /[0-9]/ /[0-9]/ /\1/ /\1/ is is a a backreference backreference

Backslashes before alphanumerics are special


Llama3 page 100; Camel3 page 158; perlre manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: character classes


[letters] matches exactly one of the enclosed letters

/[abc]/ /[abc]/ matches matches substrings substrings "a" "a" or or "b" "b" or or "c" "c" can can specify specify ranges ranges with with hyphen hyphen /[0-9]/ /[0-9]/ matches matches any any single single digit digit

inverted classes: [^letters] matches any one character


except any of those enclosed

/[^abc]/ /[^abc]/ matches matches substring substring "x" "x" but but not not "a" "a" /[^0-9]/ /[^0-9]/ matches matches any any one one non-digit non-digit /\d/ /\d/ (digit) (digit) same same as as /[0-9]/ /[0-9]/ /\s/ /\s/ (space) (space) same same as as /[ /[ \t\n\r\f]/ \t\n\r\f]/ /\w/ /\w/ (word (word letter) letter) same same as as /[a-zA-Z0-9_]/ /[a-zA-Z0-9_]/ inverted inverted shortcuts shortcuts /\D/ /\D/ (non-digit), (non-digit), /\S/ /\S/ (non-space), (non-space), /\W/ /\W/
Llama3 page 105-107; Camel3 pages 159, 165-167; perlre manpage

Some common character classes have shorthand forms


10

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Patterns: any character


. (full stop) shorthand for [^\n] (any character but
newline)

effectively effectively any any character character because because $_ $_ seldom seldom contains contains newline newline
except except perhaps perhaps unchomped unchomped one one at at very very end end

/d.g/ /d.g/ matches matches substrings substrings "dog" "dog",, "dig" "dig",, "d "d g" g",, "d!g" "d!g" /...../ matches substring containing any five characters /...../ matches substring containing any five characters
true true when when $_ $_ contains contains at at least least five five characters characters

/.\../ /.\../ matches matches any any character, character, a a dot, dot, then then any any character character
true true when when $_ $_ contains contains a a dot dot that that isnt isnt the the first first or or last last character character of of the the line line

Llama3 page 100; Camel3 page 159; perlre manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

11

Timeout
Write regular expressions to match strings
containing:
the word dog in any form of capitalization a cars number plate a phone number a four-letter word beginning with s s at the beginning of the line no text at all (an empty line) a double letter

12
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Multipliers
Multipliers allow the previous part of the pattern
to repeat
by default, applies to previous letter or character class write multiplier after part of pattern to repeat * (asterisk) means 0 or more times

can can group group using using parentheses parentheses

+ (plus) means one or more times

/at*e/ /at*e/ matches matches strings strings "Caesar" "Caesar",, "fate" "fate",, "matter" "matter" /.*/ /.*/ matches matches zero zero or or more more of of any any character character by by itself, itself, matches matches any any string string /at+e/ /at+e/ matches matches "fate" "fate",, "matter" "matter" but but not not "Caesar" "Caesar" /colou?r/ /colou?r/ matches matches substrings substrings "color" "color" and and "colour" "colour"
Llama3 page 100; Camel3 pages 176-178; perlre manpage
13

? (question mark) means 0 or 1 times

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Alternation and grouping


| (vertical bar) separates alternatives
more flexible than character classes

/cat|dog/ matches substrings "cat" and "dog" /a|b|c/ means same as /[abc]/

( parentheses ) used to group part of pattern


to apply multiplier to more than one character

/c(er)+s/ /c(er)+s/ matches matches strings strings "saucers" "saucers" and and "sorcerers" "sorcerers"
to factor out common parts of a pattern

/(cat|sel)fish/ /(cat|sel)fish/ matches matches substrings substrings "catfish" "catfish" and and "selfish" "selfish"
to use backreferences and capture strings

see see later later


Llama3 page102; Camel3 page 187-188,182-185; perlre manpage
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

14

Anchors
Sometimes want a pattern to match only at beginning or
end of string

^ (caret) means beginning of string


called called anchoring a pattern

/^s/ /^s/ matches matches beginning beginning of of string string followed followed by by s s
i.e., i.e., any any string string that that starts starts with with s s

$ (dollar) means end of string


/r$/ /r$/ matches matches r r followed followed by by end end of of string string works works even even if if string string has has not not been been chomp chomped ed /^dog$/ /^dog$/ matches matches only only if if entire entire string string is is "dog "dog
i.e., i.e., any any string string that that ends ends with with r r

Both can be used in same regular expression


\b means boundary between word (\w) and non-word


(\W) characters
Llama3 pages 108-109; Camel3 page 178-180; perlre manpage
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

15

Timeout
# # Mail Mail headers headers revisited: revisited: verify verify mail mail header header format. format. # # Mail Mail headers headers look look like like either either of of these these lines: lines: # # word: word: anything anything after after the the colon colon # continuation # continuation lines lines are are indented indented while while (<>) (<>) { { # # Stop Stop when when blank blank line line reached; reached; end end of of headers. headers. last last if if /^$/; /^$/; # # Patterns Patterns match match if if line line starts starts with with either either # # - at at least least one one non-space, non-space, then then colon, colon, or or # # - a a space space unless unless (/^(\S+:|\s)/) (/^(\S+:|\s)/) { { print print "Bad "Bad header header line:\n$_"; line:\n$_"; } }
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

} }

16

split and join


split function breaks a string up into pieces

takes takes regular regular expression expression to to specify specify how how pieces pieces are are to to be be separated; separated; returns returns the the pieces pieces as as a a list list @threeparts @threeparts = = split split / / /, /, "cat "cat and and mouse"; mouse"; foreach foreach (split (split /\s+/, /\s+/, $line) $line) { { ... ... } } @fields @fields = = split split /,/, /,/, $record; $record; # # CSV CSV takes takes string string to to specify specify what what goes goes between between pieces; pieces; returns returns the the glued glued pieces pieces together together into into a a string string $phrase $phrase = = join join " " and and ", ", "cat", "cat", "mouse", "mouse", "fish" "fish" print print join join " " ", ", @words; @words; $record $record = = join join ",", ",", @fields; @fields; # # CSV CSV
Llama3 pages 125-127; Camel3 pages 794-796, 733; perlfunc manpage

join function joins a list into a string


17

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Timeout
# # Iterate Iterate over over every every word word in in an an input input stream. stream. # # Read Read each each line line of of input input while while (<STDIN>) (<STDIN>) { { foreach foreach (split (split /\s+/, /\s+/, $_) $_) { { next next if if /^$/; /^$/; # # Skip Skip blank blank words. words. do_something($_); do_something($_); } } } } sub sub do_something do_something { { print print "Saw "Saw word word ", ", shift, shift, "\n"; "\n"; } }
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

18

Timeout
Write regular expressions to match strings
containing:
the word dog in any form of capitalization a cars number plate a phone number a four-letter word beginning with s s at the beginning of the line no text at all (an empty line) a double letter

19
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Advanced regular expressions


Most languages can process regular
expressions of complexity seen so far Perl has many more advanced features which use regular expressions
case-insensitive matching interpolating patterns backreferences capturing matched strings substitution matching variables other than $_ greedy and lazy multipliers
20
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

10

Case-insensitive matches
Regular expressions normally sensitive to case

/a/ doesnt match substring "A"

Can make pattern case-insensitive using i


modifier
put i character immediately after end of match

operator /a/i matches substrings "a" or "A"

Llama3 page 116; Camel3 pages 147-178; perlre manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

21

Interpolating into patterns


Variables can be interpolated into regular
expressions
like double-quoted strings

$pattern = 'fish(es)?'; /cat$pattern/


same same as as /catfish(es)?/ /catfish(es)?/

Llama3 page 118; Camel3 pages 190-191; perlre manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

22

11

Timeout
# # Perl Perl implementation implementation of of Unix Unix grep grep program program # # Pattern Pattern is is first first command-line command-line argument argument $pattern $pattern = = shift; shift; while while (<>) (<>) { { # # Print Print the the line line if if it it matches matches the the pattern. pattern. # # o o ("once") ("once") modifier modifier tells tells Perl Perl to to assume assume that that # # the the pattern pattern never never changes; changes; this this allows allows Perl Perl # # to to re-use re-use the the compiled compiled regular regular expression, expression, # # making making the the program program run run faster. faster. print print if if /$pattern/o; /$pattern/o; } }

23
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Backreferences
So far, cannot write pattern to match double letter

/[a-z][a-z]/ /[a-z][a-z]/ matches matches any any two two letters, letters, even even if if different different

Need pattern that says: match any letter, calling the


matched string 1, then match string 1 again Backreferences refer to the substrings matched by previous parts of the pattern

put put parentheses parentheses around around part part of of pattern pattern to to remember remember
first first ( ( and and its its matching matching ) ) become become string string 1 1 second ( and its matching ) become second ( and its matching ) become string string 2 2

write write backreference backreference as as \1 \1,, \2 \2,, etc. etc. /([a-z])\1/ /([a-z])\1/ matches matches substring substring composed composed of of any any double double letter letter /\b(\w+)\b.*\b\1\b/ matches any string containing the /\b(\w+)\b.*\b\1\b/ matches any string containing the same same word word twice twice
Llama3 pages 109-111; Camel3 pages 182-184; perlre manpage
24

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

12

Capturing strings
Matched backreference substrings are available after the
match succeeds

backreference backreference \1 \1 is is available available in in special special variable variable $1 $1 backreference backreference \2 \2 is is available available in in special special variable variable $2 $2 etc. etc.

Allows code to find out what strings matched which parts


of a pattern

$_ $_ = = "haystack"; "haystack"; /(t([aeiou])[a-z])/ /(t([aeiou])[a-z])/ puts puts "tac" "tac" in in $1 $1 and and "a" "a" in in $2 $2 if if match match fails, fails, variables variables are are not not set set
Llama3 pages 109-111; Camel3 pages 182-185; perlre, perlvar manpages

Captured strings are available until next match succeeds


25

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Timeout
# # Identify Identify mail mail headers headers while while (<STDIN>) (<STDIN>) { { last last if if /^$/; /^$/; # # Extract Extract name name of of header header (before (before colon) colon) into into $1 $1 # # and and content content of of header header (after (after colon colon to to end end of of # # line) line) into into $2. $2. # # Match Match fails fails on on continuation continuation lines, lines, so so # # $1 $1 and and $2 $2 variables variables not not set. set. if if (/^(\S+):\s?(.*)$/) (/^(\S+):\s?(.*)$/) { { print print "Header "Header name name is is $1, $1, contains contains $2\n"; $2\n"; } } } }
26
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

13

Timeout
# # Fancy Fancy Unix Unix grep grep that that identifies identifies where where a a match match was. was. $pattern $pattern = = shift; shift; # # ANSI ANSI terminal terminal escapes escapes $bold $bold = = "\033[1m"; "\033[1m"; $norm $norm = = "\033[0m"; "\033[0m"; while while (<>) (<>) { { # # Look Look for for pattern, pattern, capture capture it it into into $2. $2. # # Also Also capture capture all all previous previous text text on on line line into into $1 $1 # # and and all all following following text text to to $3. $3. if if (/^(.*)($pattern)(.*)$/o) (/^(.*)($pattern)(.*)$/o) { { print print "$1$bold$2$norm$3"; "$1$bold$2$norm$3"; } } } }
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

27

Multiplier greediness
Multipliers *, + and ? are normally greedy

if if there there are are two two ways ways to to successfully successfully match match a a string, string, they they will will try try to to match match the the longest longest substring substring $_ $_ = = "mississippi"; "mississippi"; /m.*ss/ /m.*ss/ matches matches up up to to second second ss ss
because because /.*/ /.*/ would would prefer prefer to to match match issi issi than than just just i i

Non-greedy (lazy) multipliers *?, +? and ?? exist


will will try try to to match match the the shortest shortest substring substring $_ $_ = = "mississippi"; "mississippi"; /m.*?ss/ /m.*?ss/ matches matches up to to first first ss ss

If only one way to match, greedy and lazy multipliers

match same way Greediness only important if need to know which part of string matched a pattern

if if using using \1 \1,, \2 \2,, $1 $1,, $2 $2,, etc. etc. if if using using s/ s/... .../ /... .../ /

Camel3 pages 177-178; perlre manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

28

14

Substitution
To replace a matched substring with a new
substring, use s/pattern/replacement/ operator pattern is a regular expression to find in the $_ variable replacement is the string to replace the matching part of $_
not a regular expression may contain $1, $2, etc. captured strings

If pattern not found, no change is made to $_ s/colou?r/hue/; # Make a synonym


Llama3 pages 122-123; Camel3 pages 152-155; perlop manpage
29

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Substitution
Variables are interpolated into both pattern and
replacement

s/$regex/$new/;

Substitution normally only occurs for the first


match in a string
use g (global) modifier to make substitution repeat

as often as possible on the string


s/cat/dog/g; s/cat/dog/g;

substitution also takes i (case-insensitive) modifier

Llama3 pages 123, 124; Camel3 page 153; perlop manpage


Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

30

15

Timeout
# # Censor: Censor: change change some some words words in in input input to to others. others. %swearwords %swearwords = = ( ( 'Micro[s\$]oft' 'Micro[s\$]oft' => => 'M.......t', 'M.......t', 'Windows( 'Windows( (95|98|ME))?' (95|98|ME))?' => => 'Windoze', 'Windoze', 'Python' 'Python' => => 'anti-Perl' 'anti-Perl' ); ); while while (<>) (<>) { { while while (($bad, (($bad, $euphemism) $euphemism) = = each each %swearwords) %swearwords) { { # # s/// s/// returns returns number number of of times times succeeded succeeded $count $count += += s/$bad/$euphemism/gi; s/$bad/$euphemism/gi; } } print; print; } } print print "$count "$count words words changed\n"; changed\n";
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

31

Binding operator =~
Match /.../ and substitution s/.../.../ operators
match against $_ variable by default Can match against any variable with binding operator =~
put variable on left of operator put match/substitution on right of operator

if ($string =~ /pattern/) { ... } $changeme =~ s/cat/dog/g; ($copy = $orig) =~ s/cat/dog/g; if ($_ =~ /pattern/) # Redundant
Llama3 pages 117-118; Camel3 pages 93-94; perlop manpage

32

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

16

Covered in this topic


Regular expressions

Character Character classes classes Multipliers Multipliers Anchors Anchors


^ ^,, $ $

[ [... ...] ],, . .,, \s \s,, \S \S,, \d \d,, etc. etc. * *,, + +,, ? ?,, non-greedy non-greedy versions versions *? *?,, +? +?,, ?? ??

Match operator / /.../ / Interpolation


split split and join join

Alternation and grouping Backreferences and capturing substrings


\1 \1,, \2 \2,, $1 $1,, $2 $2,, etc. etc.

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

Substitution operator s/ s/.../ /.../ / Binding operator =~ =~

33

Going further
Advanced regular expressions
look-ahead, look-behind, evaluating Perl expressions

as regular expressions, etc. Camel3 pages 195-216 Mastering Regular Expressions, by Jeffrey Friedl, OReilly & Associates

tr/.../.../
transliteration operator, like Unix tr program

sed, awk, grep, vi, ...


some of Unixs more powerful pattern-matching tools

man sed, man awk, ...


34

Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

17

Next topic

File I/O Opening and closing files Reading from and writing to files Manipulating files and directories Communicating with processes

Llama3 chapter 6, pages 86-97, chapters 11-14, pages 148-207 Camel3 pages 20-22, 28-29, 97-100, 426-428, 747-755, 770 perlfunc, perlopentut manpages
Original Slides by Debbie Pickett, Modified by David Abramson, 2006, Copyright Monash University

35

18

Potrebbero piacerti anche