Gammon Forum : MUSHclient : Lua : LPeg pattern parsing

Entire forum ➜ MUSHclient ➜ Lua ➜ LPeg pattern parsing

LPeg pattern parsing

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Mon 02 Jun 2008 06:01 AM (UTC)

Amended on Mon 02 Jun 2008 09:34 PM (UTC) by Nick Gammon

Message

I have been experimenting a bit with LPeg - a pattern matching library for Lua, based on Parsing Expression Grammars. This is written by Roberto Ierusalimschy, the developer of Lua.

You can read about the Lua implementation at:

http://www.inf.puc-rio.br/~roberto/lpeg/

and also:

http://www.inf.puc-rio.br/~roberto/lpeg/re.html

Plus more information in a PDF file at:

http://www.inf.puc-rio.br/~roberto/docs/peg.pdf

Effectively what lpeg does (as far as I can tell) is give you the tools to make something similar to a regular expression parser, but with the regular expression pattern (and its components) being "first class values". That is, they can be operated on in pieces.

I am pretty confident that David Haley could explain these grammars a lot better than I can, but I will make a stab at it.

If you want to try these examples at home, I have taken the Lua library distributed by Roberto, compiled it under Cygwin to get a Windows DLL, and released that here (84 Kb):

http://www.gammon.com.au/files/mushclient/lpeg-0-1.8.1_b.zip

Take the file lpeg.dll from the .zip file and place it in the same directory as MUSHclient.exe, and you are ready to roll.

To get started, let's look at looking for matching on something like "You gain 300 hp.".


require "lpeg"

digits = lpeg.R ("09")

hp = lpeg.P ("You gain ") * digits^1 * " hp."

print (lpeg.match (hp, "You gain 300 hp."))  --> 17

What we have done here is make a variable (actually a Lua userdatum) called "digits" which is the pattern for the range 0-9.

Then the main pattern consists of:

lpeg.P ("You gain ") --> match on this literal string
digits^1 --> match on the digits pattern, one or more times
"hp." --> match on "hp." literally

The "*" symbol is the concatenation operator for lpeg. That is, this followed by that.

We now pass that to lpeg.match, along with our target string, and get a non-nil result (actually the column number just after the match ends), which confirms we matched.

We can do something similar to match on mana:


mana = lpeg.P ("You gain ") * digits^1 * " mana."

print (lpeg.match (mana, "You gain 500 mana."))  --> 19

Now we probably want the "capture" results back (that, is what *was* our HP value?). So we can put lpeg.C (capture) around the part in question:


hp = lpeg.P ("You gain ") * lpeg.C (digits^1) * " hp."
print (lpeg.match (hp, "You gain 300 hp."))  --> 300

mana = lpeg.P ("You gain ") * lpeg.C (digits^1) * " mana."
print (lpeg.match (mana, "You gain 500 mana."))  --> 500

Things start to get interesting now. We have two patterns, one matches on hp, and the other on mana. We can concatenate the patterns together to get a larger pattern:


print (lpeg.match (hp * mana, "You gain 300 hp.You gain 500 mana."))  --> 300 500

Both results are returned, and we are matching on the concatenated string.

We can swap around the order:


print (lpeg.match (mana * hp, "You gain 500 mana.You gain 300 hp."))  --> 500 300

Or, we can do one *or* the other:


print (lpeg.match (hp + mana, "You gain 123 hp."))  --> 123
print (lpeg.match (hp + mana, "You gain 456 mana."))  --> 456

The "+" operator is the choice operator. We will match on the hp string, or the mana string.

Next, let's parse a more complex string. Something like "You see exits leading north, up, down, west and south".

This can be broken into components. The first component is the possible exit directions:


directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down")

This shows that the possible directions are (in each case) a choice of north, south etc., which we want captured.

The next part of our exits string is the part underlined:



You see exits leading north, up, down, west and south

We can see that what we have here is one initial direction (north in this case), followed by zero or more extra directions (up, down, west). These extra directions are preceded by a comma. Finally we have the word "and". This can all be expressed like this:


comma_directions = directions * (", " * directions)^0 * " and "

Finally the whole string starts with "You see exits leading" and finishes off with the final direction.


patt = lpeg.P ("You see exits leading ") * (comma_directions^-1) * directions

We want all the directions nicely placed in a table, so we can use lpeg.Ct to gather all the submatches together:


result = lpeg.match (lpeg.Ct (patt), "You see exits leading north, up, down, west and south")

require "tprint"

if result then
  tprint (result)
else
  print "no match"
end -- if

Output

1="north"
2="up"
3="down"
4="west"
5="south"

Another way of expressing the same idea is with an "exit line grammar". This might look like this:


directions       <- "north" | "south" | "east" | "west" | "up" | "down"

comma_directions <- directions (", " directions)*  " and "

exitline         <- "You see exits leading "  comma_directions? directions

In the notation above "|" means "or", "*" means "zero or more" and "?" means zero or one.

We can express that grammer in lpeg like this:


exitgrammar = lpeg.P {
           "exitline",
           directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down"),
           comma_directions = lpeg.V "directions" * (", " * lpeg.V "directions")^0 * " and ",
           exitline = "You see exits leading " * lpeg.V "comma_directions"^-1 * lpeg.V "directions",
           }
           
result = lpeg.match (lpeg.Ct (exitgrammar),  "You see exits leading north, up, down, west and south")

This gives the same results as before.

There is considerably more power in lpeg than I have described here (or, indeed, been able to understand). The links given above give more details. There are examples there of patterns that parse comma-separated strings, and even evaluate whole arithmetic expressions, returning the result.

- Nick Gammon

www.gammon.com.au, www.mushclient.com