Introduction

This post is going to explore the different ways of parsing complex expressions in Lua.

Examples are intended to be used in a script file or a plugin “script” section, not in “send to script” in MUSHclient. The reason for that is that “send to script” would need “%” and “\” symbols to be doubled to work correctly.

Note: I will use the abbreviation regexp for a regular expression and LPEG for an LPEG construction.

Also note: Strictly speaking, in Lua they are patterns rather than regexps however they are close enough that I’ll stick to the word “regexp” here.


Why use regular expressions?

Regular expressions help match on complex lines. A normal string compare suffices for things like:

The door is closed.

But if there is variable data, you need to be able to use some sort of wildcards, eg.

You killed the kobold and got 10 experience.

In MUSHclient triggers and aliases, you can use “simple” wildcards like this:

You killed the * and got * experience.

However internally they get turned into a regexp, so I won’t cover them here.


Simple examples, getting started

We will assume that you have loaded the “tprint” (table print) module like this:

require "tprint"

Let’s try to match the example line:

You killed the kobold and got 10 experience.

We’ll make that test string into a Lua variable:

target = "You killed the kobold and got 10 experience."

Lua regexp

print (string.match (target, "You killed the .+ and got .+ experience%."))
--> Output: You killed the kobold and got 10 experience.

In regexps, the:

The final period has to be “escaped” because we want to literally match a period, not “anything”. In Lua regexps, the “%” character escapes the character after it.

The output is the entire matching text. If there was no match we would get nil as a result.


PCRE regexp

line = rex.new ("You killed the .+ and got .+ experience\\.")
s, e = line:match (target)
print (s, e)
--> Output: 1 44

In this case we know the start and end matching columns. If there was no match we would get nil as a result.

The final period has to be “escaped” because we want to literally match a period, not “anything”. In PCRE regexps, the “\” character escapes the character after it. Since it is in a Lua string we have to double it, because otherwise Lua interprets that as escaping the next symbol (the period). If you are using a regexp inside a trigger or alias “match” field you don’t need to double the backslashes.


LPEG

Things get a little more complex with LPEG. First, let’s pull in some functions and table items as local variables, to save typing:

require "lpeg"    -- get the LPEG module - not needed for MUSHclient which has it built in

-- save typing function names with "lpeg" in front of them:
local P, V, Cg, Ct, Cc, S, R, C, Cf, Cb, Cs =
  lpeg.P, lpeg.V, lpeg.Cg, lpeg.Ct, lpeg.Cc, lpeg.S, lpeg.R, lpeg.C, lpeg.Cf, lpeg.Cb, lpeg.Cs

-- character classes
lpeg.locale (lpeg)  -- get digit, alpha, etc.
local alpha, cntrl, digit, graph, lower, punct, space, upper, alnum, xdigit =
   lpeg.alpha, lpeg.cntrl, lpeg.digit, lpeg.graph, lpeg.lower, lpeg.punct,
   lpeg.space, lpeg.upper, lpeg.alnum, lpeg.xdigit

Now, the “P” function makes a pattern according to what you supply it. The simple case is a string literal, so that P“foo” matches “foo”.

Matching the variable words like “kobold” and “10” is a bit more complex. We can use the pattern P(1) to match a single character (any character). But we might have more than one character (indeed, “kobold” is 6 characters). Now in LPEG we can do this to match “one or more” characters:

P(1)^1

The trouble is, that is a “greedy” match so it will consume the rest of the line.

Since LPEG does not do backtracking, we have to do this a different way.

We either need to:

  1. Know what to match (eg. letters or numbers, etc.)
  2. Know what follows the thing we are trying to match (eg. “and got”). The Lua and PCRE regexps do this automatically, however LPEG doesn’t.

We will use method #1 first, and match alpha (letters) (for “kobold”) and digits (numbers) (for the experience):

line = P"You killed the " * alpha^1 * " and got " * digit^1 * P" experience."

print (lpeg.match (line, target))
--> Output: 45

LPEG returns the first column past the match. If there was no match we would get nil as a result.

In LPEG:


Now we’ll try method #2, and look for “anything that is not ‘and got’”. That effectively matches the word “kobold”.

line = P"You killed the " * (1 - P" and got")^1 * " and got " * digit^1 * P" experience."

print (lpeg.match (line, target))

However this looks a bit weird. It makes more sense to decide what you need to match on (eg. some letters) rather than skip everything that is not what follows.

That approach is tedious because we need to put “and got” twice into the expression. If we need to change that to “and received” then we have to change two places. So, we can make a helper function to do it for us:

function upto (what)
  return ((P(1) - P(what))^1) * P(what)
end -- upto

line = P("You killed the ") * upto(" and got ") * upto(" experience.")

The upto function takes a pattern, and looks for a character followed by something that is not that pattern, and if that succeeds, it repeats, until it hits the target pattern. So in other words, it consumes all the characters up to the stopping pattern.

An alternative would be to put the word you don’t want first, and then look for one character, like this:

function upto (what)
  return ((-P(what) * P(1))^1) * P(what)
end -- upto

re module

The LPEG “re” (regular expression) module lets you describe LPEG in a more “regexp” way, like this:

require "re"

line = re.compile [[
'You killed the ' %a+ ' and got ' %d+ ' experience.'
]]

print (lpeg.match (line, target))
--> Output: 45

The underlying matching is still the same as LPEG, so you need to explicitly describe what you want to match (eg. “%a+” for the word “kobold”).


Greediness

Both PCRE and Lua regular expressions can match “greedily” or not. What does “greedy matching” do? Consider wanting to match on the regexp:

a+

And imagine an input of:

aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb

Greedy matching would be:

aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb
^^^^^^^^^^^^^^^^^^
match

Non-greedy matching would be:

aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb
^
match

The greedy matching matches as much as it can, and still satisfy that part of the regexp. Non-greedy matches as little as it can.


Lua regexp

Greedy

print (string.match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb", "a+"))
--> Output: aaaaaaaaaaaaaaaaaa

Non-greedy

print (string.match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb", "a-"))
--> Output: (nothing)

In this case, “as little as it can” is nothing at all! In Lua the “-” symbol means “zero or more, non-greedy” so in this case the minimum is nothing.


PCRE regexp

Greedy

line = rex.new ("a+")
s, e = line:match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb")
print (s, e)
--> Output: 1 18

Non-greedy

line = rex.new ("a+?")
s, e = line:match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb")
print (s, e)
--> Output: 1 1

With PCRE regexps you append a “?” to the match counter to get non-greedy, so in this case we get a single “a” (because we have to match on one or more).


LPEG

LPEG only does greedy matches.


Captures

It’s all very well matching on a string like:

You killed the kobold and got 10 experience.

But what if you want to know what the variable words are (“kobold” and “10”)?

This is where “captures” are useful. You place in the regexp symbols to tell it that we want the matching part returned.


Lua regexp

mobName, experience = string.match (target, "You killed the (.+) and got (.+) experience%.")

print (mobName)
print (experience)
--> Output: kobold
            10

Things we want to capture are placed in round brackets. If we need to match on round brackets we have to put a “%” in front of them.


PCRE regexp

line = rex.new ("You killed the (.+) and got (.+) experience\\.")
s, e, matches = line:match (target)
print (s, e)
tprint (matches)
--> Output: 1 44
            1="kobold"
            2="10"

Things we want to capture are placed in round brackets. If we need to match on round brackets we have to put a “\” in front of them.

In other words, we got the same start and end columns as before, but also a table of captures, where capture #1 is the first set of round brackets, and capture #2 is the second set.


LPEG

line = P"You killed the " * C(alpha^1) * " and got " * C(digit^1) * P" experience."

mobName, experience = lpeg.match (line, target)
print (mobName)
print (experience)
--> Output: kobold
            10

That format returns each capture as a result from the lpeg.match() call. If you want a table of matches surround the pattern by a Ct() function call:

line = Ct (P"You killed the " * C(alpha^1) * " and got " * C(digit^1) * P" experience.")

tprint (lpeg.match (line, target))
--> Output: 1="kobold"
            2="10"

This required two changes. First we put C( ) around things we want to capture. Second, we put Ct( ) around the whole expression. The C (capture) functions mark those parts as needing to be captured. The Ct (capture table) function places the captures into a table.

Now, using the method of stopping on “and got” we can modify our upto helper function slightly to return a capture:

function upto (what)
  return C((P(1) - P(what))^1) * P(what)
end -- upto

line = P("You killed the ") * upto(" and got ") * upto(" experience.")

mobName, experience = lpeg.match (line, target)
print (mobName)
print (experience)
--> Output: kobold
            10

re module

line = re.compile [[
'You killed the ' { %a+ } ' and got ' { %d+ } ' experience.'
]]

mobName, experience = lpeg.match (line, target)
--> Output: kobold
            10

The above returns the captures as results from the lpeg.match() call.

If the mob name could consist of any characters (except " and got " of course) then you can use the concept described earlier of looking for not the string " and got " followed by a single character, and repeat that until " and got " is reached, like this:

line = re.compile [[ 'You killed the '
                       {(!' and got ' .)*}
                       ' and got '
                       {[0-9]+}
                       ' experience.'
                  ]]

If you want a table returned, do this:

line = re.compile [[
{| 'You killed the ' { %a+ } ' and got ' { %d+ } ' experience.' |}
]]

tprint (lpeg.match (line, target))
--> Output: 1="kobold"
            2="10"

The {…} syntax replicates the lpeg.C (capture) call. The {|…|} syntax replicates the lpeg.Ct (capture table) call.


Anchoring

A normal Lua or PCRE regexp is not anchored.

For example, in Lua:

target = "You saw a dog and a cat"
print (string.match (target, "dog"))
--> Output: dog

That matched “dog” even though it wasn’t at the start of the line.

LPEG would not match in that situation.

target = "You saw a dog and a cat."
print (lpeg.match ("dog", target))
--> Output: nil

Force Lua and PCRE regexps to be anchored

To anchor to the start of the line, we put a “^” symbol at the start, for example:

target = "You saw a dog and a cat"
print (string.match (target, "^dog"))
--> Output: nil

However it matches “You saw”:

target = "You saw a dog and a cat"
print (string.match (target, "^You saw"))
--> Output: You saw

To anchor to the end of the line, we put a “$” symbol at the end, for example:

target = "You saw a dog and a cat"
print (string.match (target, "dog$"))
--> Output: nil

However it matches “cat”:

target = "You saw a dog and a cat"
print (string.match (target, "cat$"))
--> Output: cat

To match the exact regular expression you use both:

target = "You saw a dog and a cat"
print (string.match (target, "^dog$"))
--> Output: nil

However:

target = "dog"
print (string.match (target, "^dog$"))
--> Output: dog

Force LPEG to be anchored at the end

LPEG is already anchored at the start, so how to anchor at the end? We add a pattern of P(-1) to the end, which is the same as:

"" - P(1)

In other words, it matches the empty string, providing there is nothing following the empty string. This can only happen at the end of the line.

target = "You saw a dog and a cat"
print (lpeg.match ("You saw a" * P(-1), target))
--> Output: nil
target = "You saw a dog and a cat"
print (lpeg.match (P"You saw a dog and a cat" * P(-1), target))
--> Output: 24

Or to see the matching string add a capture around the pattern:

print (lpeg.match (C("You saw a dog and a cat" * P(-1)), target))
 --> Output: You saw a dog and a cat

Force re to be anchored at the end

Similarly to what you do with LPEG, you can anchor an re pattern by finishing with “!.”. For example:

require "re"
target = "You saw a dog and a cat"
print (lpeg.match (re.compile ("'You saw a' !."), target))
--> Output: nil
require "re"
target = "You saw a dog and a cat"
print (lpeg.match (re.compile ("'You saw a dog and a cat' !."), target))
--> Output: 24

Force LPEG to find the pattern anywhere in the line

function anywhere (p)
  return lpeg.P { p + 1 * lpeg.V(1) }
end

print (lpeg.match (anywhere ("dog"), target))
--> Output: 14

The helper function anywhere accomplishes this. This actually sets up a “grammar” which is what you get when you give LPEG a table (note the curly braces).

The grammar could be written like this:

grammar = {
   [1] =  p + 1 * V(1)  -- rule #1
   }

So, the grammar has one rule, named 1.

Looking at the grammar, we can see:

Effectively you could say it recurses, and tries to match one position in from the start. If that fails, it repeats, and matches two positions in, and so on until it runs out of things to match, or gets a match.

This might sound slow, trying the pattern over and over, but really, in most cases the test would immediately fail (ie, on the first letter). Most of the time, the attempt to match (on “dog” in this case) immediately fails, so only one character needs to be tested.

If you wanted to capture the matching word you can add a capture to the anywhere function, eg.

function anywhere (p)
  return lpeg.P { C(p) + 1 * lpeg.V(1) }
end

target = "You see 666 dogs and a cat"
print (lpeg.match (anywhere (digit^1), target))
--> Output: 666

Force LPEG to find the pattern at the end of the line

If you want to scan the line for the pattern, but have it anchored to the end, then we can add “* P(-1)” to the end of the pattern, like this:

function anywhere (p)
  return lpeg.P { C(p) + 1 * lpeg.V(1) }
end

target = "You see 666 dogs and a cat"
print (lpeg.match (anywhere ("cat" * P(-1)), target))
--> Output: cat

Make a grammar

For more complex strings you can make a “grammar” - that is, a set of rules for parsing the line.

Take this for example:

You see exits leading north, up, down, west and south

We can break that down into parts like this:

Directions       <- "north" | "south" | "east" | "west" | "up" | "down"
CommaDirections  <- Directions (", " Directions)*  " and "
ExitLine         <- "You see exits leading "  CommaDirections? Directions

In the notation above “|” means “or”, “*" means “zero or more” and “?” means zero or one.

We can express that grammar in LPEG like this:

exitgrammar = P {
           "ExitLine",   --> this tells LPEG which rule to process first
           Directions       = C (P"north" + "south" + "east" + "west" + "up" + "down"),
           CommaDirections  = V"Directions" * (", " * V"Directions")^0 * " and ",
           ExitLine         = "You see exits leading " * V"CommaDirections"^-1 * V"Directions",
           }

result = lpeg.match (Ct (exitgrammar),  "You see exits leading north, up, down, west and south")

tprint (result)

This gives a table of matches, like this:

1="north"
2="up"
3="down"
4="west"
5="south"

Using the “re” module for describing a grammar

The same grammar as above can be expressed in more natural way using the ‘re’ module:

require "re"

exits = re.compile[[
  ExitLine         <- {| "You see exits leading "  CommaDirections? Directions |}
  CommaDirections  <- Directions (", " Directions)*  " and "
  Directions       <- { "north" / "south" / "east" / "west" / "up" / "down" }
]]

tprint ( exits:match ("You see exits leading north, up, down, west and south") )

This gives results like this:

1="north"
2="up"
3="down"
4="west"
5="south"

re syntax

Syntax Description
( p ) grouping
'string' literal string
"string" literal string
[class] character class
. any character
%name pattern defs[name] or a pre-defined pattern
name non terminal
<name> non terminal
{} position capture
{ p } simple capture
{: p :} anonymous group capture
{:name: p :} named group capture
{~ p ~} substitution capture
{| p |} table capture
=name back reference
p ? optional match
p * zero or more repetitions
p + one or more repetitions
p^num exactly n repetitions
p^+num at least n repetitions
p^-num at most n repetitions
p -> 'string' string capture
p -> "string" string capture
p -> num numbered capture
p -> name function/query/string capture equivalent to p / defs[name]
p => name match-time capture equivalent to lpeg.Cmt(p, defs[name])
& p and predicate
! p not predicate
p1 p2 concatenation
p1 / p2 ordered choice
(name <- p)+ grammar

And now for a fancier example. I wanted to match lines with colour codes in them (like Aardwolf uses) but not include the colour codes in word matching. For example, the word “jumped” should match even if preceded by “@x2” (as in “@x2jumped”).

To do this I made up a grammar where we had a rule for the colour codes. This can be either:

The “word” to match on was just alphabetic (ie. “%a+”) but you could include underscores or numbers if you wanted.

require "re"

local target = "the quick @r@g@Wbrown@g, fox@x2jumped, @x009over the lazy frog helicopter"

local grammar = re.compile[[
  line              <- {| (wordWithColour+ / .)* |}
  wordWithColour    <- colourCode* {} {word} colourCode*
  word              <- %a+
  colourCode        <- "@" (("x" %d^-3) / colourLetters)
  colourLetters     <- [bBcCrRmMgGwWyYdD]
]]

-- run grammar on target text
local resultTable = grammar:match (target)

tprint (resultTable)

The output is a table of positions and matching words, like this:

1=1
2="the"
3=5
4="quick"
5=17
6="brown"
7=26
8="fox"
9=32
10="jumped"
11=45
12="over"
13=50
14="the"
15=54
16="lazy"
17=59
18="frog"
19=64
20="helicopter"

You can then do something with that (like substitute another word for the matching ones).

You can control match numbers like this:

So in this case “%d^-3” matches a maximum of 3 digits.

Calling functions for a pattern

Expanding on the above example, a modified version calls a function for matching words

In this example the grammar calls gotWord (notice the table as the second argument to re.compile). When a pattern matches a word, gotWord is called which then optionally substitutes a different word. The entire match is then put into the result table, which can be concatenated to reconstruct the original line, with substitutions.

The rule “%a+ -> gotWord” means that matches get sent to the gotWord function, and whatever it returns is used as the final capture value.

require "re"

print (string.rep ("-", 60))

-- words they want replaced
local wantedReplacements = {
  ['quick']      = 'slow',
  ['jumped']     = 'hopped',
  ['brown']      = 'green',
  ['the']        = "THE",
  ['helicopter'] = 'bus',

-- and so on
  } -- end of wantedReplacements

function gotWord (x)
  return wantedReplacements [x] or x
end -- gotWord

local target = "the quick @r@g@Wbrown@g, @@fox@x2jumped, @x009over the lazy frog helicopter"

local grammar = re.compile ([[
  line              <- {| (wordWithColour+ / {.} )* |}
  wordWithColour    <- colourCode* word colourCode*
  word              <-  %a+ -> gotWord
  colourCode        <- { ("@" (("x" %d^-3) / colourLetters)) }
  colourLetters     <- [bBcCrRmMgGwWyYdD]
]], { gotWord = gotWord } )

-- run grammar on target text
result = grammar:match (target)

-- debug
require "tprint"
tprint (result)

print (table.concat (result))

Output is:

1="THE"
2=" "
3="slow"
4=" "
5="@r"
6="@g"
7="@W"
8="green"
9="@g"
10=","
11=" "
12="@"
13="@"
14="fox"
15="@x2"
16="hopped"
17=","
18=" "
19="@x009"
20="over"
21=" "
22="THE"
23=" "
24="lazy"
25=" "
26="frog"
27=" "
28="bus"
THE slow @r@g@Wgreen@g, @@fox@x2hopped, @x009over THE lazy frog bus

Substitutions

LPEG substitution using inbuilt string function

The function lpeg.Cs returns a string with the values for captures replacing what they capture. We can call a function to do the replacements, for example:

pattern = lpeg.R"am"
pattern = lpeg.Cs((pattern / string.upper + 1)^0)
print (pattern:match ("the quick brown fox jumped over the lazy dog"))

Output is:

tHE quICK Brown Fox JuMpED ovEr tHE LAzy DoG

In this case the pattern matches letters in the range “a” to “m”. The second line repeatedly matches the pattern, or advances one character (if there is no match).


“re” substitution using inbuilt string function

We can do a similar thing using the “re” module:

require "re"
pattern = re.compile ("{~ ([a-m] -> upper / .)* ~}", { upper = string.upper } )
print (pattern:match ("the quick brown fox jumped over the lazy dog"))

In this case the “{~ … ~}” sequence indicates a substitution capture (like lpeg.Cs). Inside that we look for the set “a” to “m” and if found send it to the function “upper” (supplied in a table as the second argument to re.compile). Otherwise we skip one character and repeat.


LPEG substitution using custom function

We can supply our own function for transforming a match on the pattern. It takes arguments, one for each capture (in this case there are two captures):

lpeg.locale (lpeg)  -- get digit, alpha, etc.

-- match on alphas followed by digits (eg. abc123) and capture each
pattern = lpeg.C (lpeg.alpha^1) * lpeg.C (lpeg.digit^1)

function f (a, b)
  return b .. a
end -- f

pattern = lpeg.Cs((pattern / f + 1)^0)

print (pattern:match ("I am testing abc123 and def567"))

Output is:

I am testing 123abc and 567def

The function “f” reverses the two captures, so that “abc123” becomes “123abc”.


“re” substitution using custom function

We can do a similar thing using the “re” module:

require "re"

pattern = re.compile ("{~ ( ( {%alpha+} {%digit+} ) -> reverse / .)* ~}",
                      { reverse = function (a, b) return b .. a end } )

print (pattern:match ("I am testing abc123 and def567"))

The pattern again contains a substitution capture sequence: “{~ … ~}”. Inside that we look for one or more alpha characters (%alpha) which are the first capture (indicated by the braces) followed by one or more digit characters (%digit) which are the second capture. If found, they are passed to the “reverse” function to have the order reversed. If not, we skip one character and try again.


[Home] [Downloads] [Search] [Help/forum]

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality