Posted by
| Nick Gammon
Australia (23,120 posts) Bio
Forum Administrator |
Message
| I have been experimenting a bit with LPeg - a pattern matching library for Lua, based on Parsing Expression Grammars. This is written by Roberto Ierusalimschy, the developer of Lua.
You can read about the Lua implementation at:
http://www.inf.puc-rio.br/~roberto/lpeg/
and also:
http://www.inf.puc-rio.br/~roberto/lpeg/re.html
Plus more information in a PDF file at:
http://www.inf.puc-rio.br/~roberto/docs/peg.pdf
Effectively what lpeg does (as far as I can tell) is give you the tools to make something similar to a regular expression parser, but with the regular expression pattern (and its components) being "first class values". That is, they can be operated on in pieces.
I am pretty confident that David Haley could explain these grammars a lot better than I can, but I will make a stab at it.
If you want to try these examples at home, I have taken the Lua library distributed by Roberto, compiled it under Cygwin to get a Windows DLL, and released that here (84 Kb):
http://www.gammon.com.au/files/mushclient/lpeg-0-1.8.1_b.zip
Take the file lpeg.dll from the .zip file and place it in the same directory as MUSHclient.exe, and you are ready to roll.
To get started, let's look at looking for matching on something like "You gain 300 hp.".
require "lpeg"
digits = lpeg.R ("09")
hp = lpeg.P ("You gain ") * digits^1 * " hp."
print (lpeg.match (hp, "You gain 300 hp.")) --> 17
What we have done here is make a variable (actually a Lua userdatum) called "digits" which is the pattern for the range 0-9.
Then the main pattern consists of:
- lpeg.P ("You gain ") --> match on this literal string
- digits^1 --> match on the digits pattern, one or more times
- "hp." --> match on "hp." literally
The "*" symbol is the concatenation operator for lpeg. That is, this followed by that.
We now pass that to lpeg.match, along with our target string, and get a non-nil result (actually the column number just after the match ends), which confirms we matched.
We can do something similar to match on mana:
mana = lpeg.P ("You gain ") * digits^1 * " mana."
print (lpeg.match (mana, "You gain 500 mana.")) --> 19
Now we probably want the "capture" results back (that, is what *was* our HP value?). So we can put lpeg.C (capture) around the part in question:
hp = lpeg.P ("You gain ") * lpeg.C (digits^1) * " hp."
print (lpeg.match (hp, "You gain 300 hp.")) --> 300
mana = lpeg.P ("You gain ") * lpeg.C (digits^1) * " mana."
print (lpeg.match (mana, "You gain 500 mana.")) --> 500
Things start to get interesting now. We have two patterns, one matches on hp, and the other on mana. We can concatenate the patterns together to get a larger pattern:
print (lpeg.match (hp * mana, "You gain 300 hp.You gain 500 mana.")) --> 300 500
Both results are returned, and we are matching on the concatenated string.
We can swap around the order:
print (lpeg.match (mana * hp, "You gain 500 mana.You gain 300 hp.")) --> 500 300
Or, we can do one *or* the other:
print (lpeg.match (hp + mana, "You gain 123 hp.")) --> 123
print (lpeg.match (hp + mana, "You gain 456 mana.")) --> 456
The "+" operator is the choice operator. We will match on the hp string, or the mana string.
Next, let's parse a more complex string. Something like "You see exits leading north, up, down, west and south".
This can be broken into components. The first component is the possible exit directions:
directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down")
This shows that the possible directions are (in each case) a choice of north, south etc., which we want captured.
The next part of our exits string is the part underlined:
You see exits leading north, up, down, west and south
We can see that what we have here is one initial direction (north in this case), followed by zero or more extra directions (up, down, west). These extra directions are preceded by a comma. Finally we have the word "and". This can all be expressed like this:
comma_directions = directions * (", " * directions)^0 * " and "
Finally the whole string starts with "You see exits leading" and finishes off with the final direction.
patt = lpeg.P ("You see exits leading ") * (comma_directions^-1) * directions
We want all the directions nicely placed in a table, so we can use lpeg.Ct to gather all the submatches together:
result = lpeg.match (lpeg.Ct (patt), "You see exits leading north, up, down, west and south")
require "tprint"
if result then
tprint (result)
else
print "no match"
end -- if
Output
1="north"
2="up"
3="down"
4="west"
5="south"
Another way of expressing the same idea is with an "exit line grammar". This might look like this:
directions <- "north" | "south" | "east" | "west" | "up" | "down"
comma_directions <- directions (", " directions)* " and "
exitline <- "You see exits leading " comma_directions? directions
In the notation above "|" means "or", "*" means "zero or more" and "?" means zero or one.
We can express that grammer in lpeg like this:
exitgrammar = lpeg.P {
"exitline",
directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down"),
comma_directions = lpeg.V "directions" * (", " * lpeg.V "directions")^0 * " and ",
exitline = "You see exits leading " * lpeg.V "comma_directions"^-1 * lpeg.V "directions",
}
result = lpeg.match (lpeg.Ct (exitgrammar), "You see exits leading north, up, down, west and south")
This gives the same results as before.
There is considerably more power in lpeg than I have described here (or, indeed, been able to understand). The links given above give more details. There are examples there of patterns that parse comma-separated strings, and even evaluate whole arithmetic expressions, returning the result. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|