Exegesis 5
Regular Expressions
by Damian ConwayAugust 22, 2002
- Exegesis 5
- What's the diff?
- Starting gently
- Lay it out for me
- Interpolate ye not ...
- The incredible
$hunk - Modified modifiers
- Take no prisoners
- Meanwhile, back at the
$hunk... - This or nothing
- Failing with style
- Home, home on the (line) range
- What's my line?
- The final frontier
- Match-maker, match-maker ...
- A cleaner approach
- What's in a name?
- Bad line! No match!
- Thinking ahead
- What you match is what you get
- A hypothetical solution to a very real problem
- The nesting instinct
- Extracting the insertions
- Don't just match there; do something!
- Smarter alternatives
- Rearranging the deck chairs
- Deriving a benefit
- Different diffs
- Let's get cooking
Exegesis 5
Come gather round Mongers, whatever you code
And admit that your forehead's about to explode
'Cos Perl patterns induce complete brain overload
If there's source code, you should be maintainin'
Then you better start learnin' Perl 6 patterns soon
For the regexes, they are a-changin'
Apocalypse 5 marks a significant departure in the ongoing design of Perl 6.
|
Related articles: Apocalypse 5 -- In part 5 of his design for Perl 6, Larry takes a long hard look at regular expressions, and comes up with some interesting ideas... Exegesis 4 -- What does the fourth apocalypse really mean to you? A4 explained what control structures would look like in Perl 6; Damian Conway expands on those ideas and presents a complete view of the Perl 6 control flow mechanism. Apocalypse 4 -- In his latest article explaining the design of Perl 6, Larry Wall tackles the syntax of the language. Exegesis 3 -- Damian Conway puts Larry's third Apocalypse to work and explains what it means for the budding Perl 6 programmer. Apocalypse 3 -- Larry Wall brings us the next installment in the unfolding of Perl 6's design. Exegesis 2 -- Having trouble visualizing how the approved RFC's for Perl 6 will translate into actual Perl code? Damian Conway provides and exegesis to Larry Wall's Apocalypse 2 and reveals what the code will look like. Larry Wall: Apocalypse Two -- Larry Wall produces the next episode in his series of "Apocalypses": glimpses into the design of Perl 6. This week, he explains how Perl 6 will differ from Perl 5 in terms of chapter 2 of the Camel Book: fundamental data types, variables and the context and scoping of the language. Apocalypse 1: The Ugly, the Bad, and the Good -- With breathless expectation, the Perl community has been waiting for Larry Wall to reveal how Perl 6 is going to take shape. In the first of a series of "apocalyptic" articles, Larry reveals the ugly, the bad, and the good parts of the Perl 6 design process. |
Previous Apocalypses took an evolutionary approach to changing Perl's general syntax, data structures, control mechanisms and operators. New features were added, old features removed, and existing features were enhanced, extended and simplified. But the changes described were remedial, not radical.
Larry could have taken the same approach with regular expressions. He could
have tweaked some of the syntax, added new (?...) constructs, cleaned
up the rougher edges, and moved on.
Fortunately, however, he's taking a much broader view of Perl's future
than that. And he saw that the problem with regular expressions was not
that they lacked a (?$var:...) extension to do named captures, or
that they needed a \R metatoken to denote a recursive subpattern,
or that there was a [:YourNamedCharClassHere:] mechanism missing.
He saw that those features, laudable as they were individually, would just compound the real problem, which was that Perl 5 regular expressions were already groaning under the accumulated weight of their own metasyntax. And that a decade of accretion had left the once-clean notation arcane, baroque, inconsistent and obscure.
It was time to throw away the prototype.
Even more importantly, as powerful as Perl 5 regexes are, they are not nearly powerful enough. Modern text manipulation is predominantly about processing structured, hierarchical text. And that's just plain painful with regular expressions. The advent of modules like Parse::Yapp and Parse::RecDescent reflects the community's widespread need for more sophisticated parsing mechanisms. Mechanisms that should be native to Perl.
As Piers Cawley has so eloquently misquoted: “It is a truth universally acknowledged that any language in possession of a rich syntax must be in want of a rewrite.” Perl regexes are such a language. And Apocalypse 5 is precisely that rewrite.
What's the diff?
So let's take a look at some of those new features. To do that, we'll consider a series of examples structured around a common theme: recognizing and manipulating data in the Unix diff
A classic diff consists of zero-or-more text transformations, each of
which is known as a “hunk”. A hunk consists of a modification specifier,
followed by one or more lines of context. Each hunk is either an append,
a delete, or a change, and the type of hunk is specified by a single
letter ('a', 'd', or 'c'). Each of these single-letter specifiers is
prefixed by the line numbers of the lines in the original document it
affects, and followed by the equivalent line numbers in the transformed
file. The context information consists of the lines of the original file
(each preceded by a '<' character), then the lines of the
transformed file (each preceded by a '>'). Deletes omit the
transformed context, appends omit the original context. If both contexts
appear, then they are separated by a line consisting of three hyphens.
Phew! You can see why natural language isn't the preferred way of specifying data formats.
The preferred way is, of course, to specify such formats as patterns. And, indeed, we could easily throw together a few Perl 6 patterns that collectively would match any data conforming to that format:
$file = rx/ ^ <$hunk>* $ /;
$hunk = rx :i {
[ <$linenum> a :: <$linerange> \n
<$appendline>+
|
<$linerange> d :: <$linenum> \n
<$deleteline>+
|
<$linerange> c :: <$linerange> \n
<$deleteline>+
--- \n
<$appendline>+
]
|
(\N*) ::: { fail "Invalid diff hunk: $1" }
};
$linerange = rx/ <$linenum> , <$linenum>
| <$linenum>
/;
$linenum = rx/ \d+ /;
$deleteline = rx/^^ \< <sp> (\N* \n) /;
$appendline = rx/^^ \> <sp> (\N* \n) /;
# and later...
my $text is from($*ARGS);
print "Valid diff"
if $text =~ /<$file>/;
Starting gently
There's a lot of new syntax there, so let's step through it slowly, starting with:
$file = rx/ ^ <$hunk>* $ /;
This statement creates a pattern object. Or, as it's known in Perl 6, a
“rule”. People will probably still call them “regular expressions” or
“regexes” too (and the keyword rx reflects that), but Perl patterns
long ago ceased being anything like “regular”, so we'll try and avoid
those terms.
In any case, the rx constructor builds a new rule, which is then
stored in the $file variable. The Perl 5 equivalent would be:
# Perl 5
my $file = qr/ ^ (??{$hunk})* $ /x;
This illustrates quite nicely why the entire syntax needed to change.
The name of the rule constructor has changed from qr to rx,
because in Perl 6 rule constructors aren't quotelike contexts.
In particular, variables don't interpolate into rx constructors
in the way they do for a qq or a qx. That's why we can embed the
$hunk variable before it's actually initialized.
In Perl 6, an embedded variable becomes part of the rule's implementation rather than part of its “source code”. As we'll see shortly, the pattern itself can determine how the variable is treated (i.e., whether to interpolate it literally, treat it as a subpattern or use it as a container).

