Now that MUSHclient has a regular expression helper for use when editing aliases and triggers, it might be time to revisit what the various things do in regular expressions.
The examples below will illustrate the major things available in regular expressions (regexps), along with an example showing the results. The matching text will be underlined.
Testing site
The site https://regex101.com/ lets you type in a regexp and test it interactively. It also has good documentation about the various constructs you can use inside a regexp.
Ordinary text
Ordinary text just matches itself...
Regexp: patter
Match : You hear the patter of little feet
Any character
A period (".") matches any character except newline (a line break). You can use it to match unknown letters ...
Regexp: p.tter
Match : You hear the patter of little feet
This matched the same as above, however any single letter can be subsituted where the dot is ...
Regexp: p.tter
Match : You hear the pitter of little feet
To match "." literally, you need to escape it with a backslash ...
Regexp: mailbox\.
Match : You see a mailbox. It is closed.
Note: if you need to include line breaks in a "dot" match (for example, in a multi-line trigger), then you need to set the "dot-all" option "(?s)". See below for more information about that.
Character in set or range
You can specify a set of matching characters ...
Regexp: p[aeiou]tter
Match : You hear the pitter of little feet
This example matches a vowel (aeiou).
However a non-vowel does not match ...
Regexp: p[aeiou]tter
Match : You hear the pxtter of little feet
(No match)
By using a hyphen you can specify a range:
Regexp: [A-G]olf
Match : I see a Golf cart
However a letter out of the range A-G does not match:
Regexp: [A-G]olf
Match : I see a Molf cart
(No match)
If you want to use "-" itself in the set, it must be in the first or last position (that is, just after the opening [ or just before the closing ] ) or it is considered a range specifier.
Some useful sets
- [A-Z] - upper-case letters
- [a-z] - lower-case letters
- [A-Za-z] - letters
- [0-9] - digits (same as \d - see below)
- [0-9+-] - digits including a plus or minus sign
- [0-9A-Fa-f] - hexadecimal numbers
- [A-Za-z0-9] - letters or digits
Character not in set or range
If we use a set (square brackets) but start it with a carat (^) then we get the inverse of the set ...
This will match anything ending in "ad" unless it starts with "b" or "s".
Regexp: [^bs]ad
Match : He was sad
(No match)
Regexp: [^bs]ad
Match : He was bad
(No match)
However "mad" matches:
Regexp: [^bs]ad
Match : He was mad
If you want to use "^" itself in the set, then simply put it anywhere except in the first position.
Special pre-defined character sets
- \d - any decimal digit
- \D - any character that is not a decimal digit
- \s - any whitespace character (eg. space, tab, linefeed, carriage-return)
- \S - any character that is not a whitespace character
- \w - any "word" character
- \W - any "non-word" character
A "word" character is a letter, digit or underscore (in other words, the set: [A-Za-z0-9_] )
Regexp: \d+
Match: You see 20 coins
The "+" symbol is explained just below. It means "one or more of the previous item".
You can use the pre-defined sets inside other sets, like this:
Regexp: [\d,-]+
Match: You get -123,456 experience for doing that
Quantifiers
The general way of expressing a number of matches is to use curly brackets, like this:
Regexp: a{4}
Match : baaaaaaaa
This matched exactly 4 of "a".
You can specify a range, eg. 1 to 3 matches:
Regexp: ta{1,3}d
Match : tadpole
Regexp: ta{1,3}d
Match : taadpole
Regexp: ta{1,3}d
Match : taaadpole
Regexp: ta{1,3}d
Match : taaaadpole
(No match)
The above would match up to 3 of "a". The last example fails because there are 4 "a" between "t" and "d".
You can also specify zero matches:
Regexp: ta{0,3}d
Match : td
Regexp: ta{0,3}d
Match : tad
The first example matched even with zero "a" between "t" and "d".
A sequence like "{5,}" means "5 or more matches".
Shortcut quantifiers
There are three special characters that can be used for common quantifier cases:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
Example of a shortcut:
Regexp: ta?p
Match : tp
Regexp: ta?p
Match : tap
Regexp: ta?p
Match : taap
(No match)
In this case the a? sequence matches zero or one occurrence of "a".
Greedy matches
The default is for sequences to be "greedy" which basically means they will take as many characters are they can. For example:
Regexp: ta{1,8}
Match : taaaaaaaaa da!
However by following the quantifier by a question mark the match matches the least it can:
Regexp: ta{1,8}?
Match : taaaaaaaaa da!
In that example we specified a minimum of 1 "a" and that is what we got (rather than the maximum).
Note that the "non-greedy" character "?" has a different meaning to the "?" which means "0 or 1 quantity".
This rather contrived example shows the different uses in a single regexp:
Regexp: ba??
Match : baaaa
Regexp: ba?
Match : baaaa
The first example above matches "a??" which means "0 or 1 of a, non-greedy". Well, zero occurrences is the least greedy, so it simply returns "b" as the match.
The second example above matches "a?" which means "0 or 1 of a, greedy". The greediest it can get is to match once, so it returns "ba" as the match.
You can also make the entire regular expression (from a particular point) non-greedy by using the (?U) configuration sequence. For example:
Regexp: (?U)ta{1,8}
Match : taaaaaaaaa da!
Escape next character
If you need to literally match one of the special characters (like ".", "[", "(", "{" and so on) then you need to precede it with a backslash:
Regexp: You go north\.
Match : You go north. You see a bear.
In this case we "escape" the dot with a backslash so we literally match the dot.
The following characters need to be escaped, outside square brackets (sets):
- \ - general escape character with several uses
- ^ - assert start of string (or line, in multiline mode)
- $ - assert end of string (or line, in multiline mode)
- . - match any character except newline (by default)
- [ - start character class definition (set)
- | - start of alternative branch (this OR that)
- ( - start subpattern (wildcard or group)
- ) - end subpattern (wildcard or group)
- ? - extends the meaning of "(", also 0 or 1 quantifier, also quantifier minimizer
- * - 0 or more quantifier
- + - 1 or more quantifier, also "possessive quantifier"
- { - start min/max quantifier
Inside square brackets (sets) the only characters that need to be escaped are:
- \ - general escape character
- ^ - negate the class, but only if the first character
- - - indicates character range, unless first or last character
- [ - POSIX character class (only if followed by POSIX syntax)
- ] - terminates the character class
Groups and tagged expressions
You can group things together by putting them into round brackets. The default behaviour is for each group to be returned as a "wildcard" for the trigger or alias...
Regexp: tell (.+) (.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick hi'
Wildcard '2' = 'there'
However you can see a problem here, the first wildcard is "Nick hi" where we really want it to be the name. So we need to make it less greedy:
Regexp: tell (.+?) (.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'
Named wildcards
You can name your wildcards to make it easier to use them later (and not worry about what number each one is) by using the ?P<name> syntax, like this:
Regexp: tell (?P<who>.+?) (?P<what>.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'
Wildcard 'who' = 'Nick'
Wildcard 'what' = 'hi there'
In this example we still have wildcards numbered 1 and 2, but they are also saved as the names "who" and "what".
From version 4.03 onwards of MUSHclient, there are two alternatives for named wildcards, namely: ?<name> and ?'name'.
Regexp: tell (?<who>.+?) (?'what'.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'
Wildcard 'who' = 'Nick'
Wildcard 'what' = 'hi there'
Choices
You can use the "or" symbol ("|") to choose between alternatives, usually inside a group ...
Regexp: You are being chased by a (bat|dog|gorilla) which looks hungry!
Match : You are being chased by a gorilla which looks hungry!
Wildcard '1' = 'gorilla'
Regexp: You are being chased by a (bat|dog|gorilla) which looks hungry!
Match : You are being chased by a bat which looks hungry!
Wildcard '1' = 'bat'
Quantities of groups
Groups can also be followed by a quantifier:
Regexp: You see a (dog|cat){1,3} inside
Match : You see a dog inside
Wildcard '1' = 'dog'
Regexp: You see a (dog|cat){1,3} inside
Match : You see a dogdogcat inside
Wildcard '1' = 'cat'
Interestingly, in the second example the wildcard is still "cat" (not "dogdogcat") because it has remembered the last match for the group. If you want to capture the whole group (ie. all three occurrences) then make a second group around the quantifier, like this:
Regexp: You see a ((dog|cat){1,3}) inside
Match : You see a dogdogcat inside
Wildcard '1' = 'dogdogcat'
Wildcard '2' = 'cat'
Now the first wildcard (the outer one) returns the full matching group, and the second wildcard just returns the last one.
Assertions
Assertions are interesting - they let you test for things without those things actually appearing in the match. Here is an example, we want to match the word "dog" on its own:
Regexp: dog
Match : I see a doggy here
This hasn't worked, because we have matched "dog" inside "doggy".
We can put in the word break ourselves, but then it becomes part of the match, which we might not want:
Regexp: dog[ ]
Match : I see a dog here
Notice how the space after the word dog is also part of the match. This might not seem so bad, but what if the character after "dog" is not a space? For example:
Regexp: dog[ ]
Match : I see a dog.
(No match)
So we fix up the match to match on "dog" followed by anything that is not a word character, like this:
Regexp: dog[^\w]
Match : I see a dog.
Regexp: dog[^\w]
Match : I see a dog
(No match)
Our first test isn't so great - we get the period as part of the match, which we didn't really want. The second test is worse, it doesn't match, because the end of the line is not considered to match a non-word character.
The solution is to use a "word boundary" assertion, like this:
Regexp: dog\b
Match : I see a dog.
Regexp: dog\b
Match : I see a dog
This has worked in both cases. The \b asserts that we want a end-of-word boundary, but the boundary is not actually part of the match. It also works for end-of-line too.
Strictly speaking, to find the word "dog" on its own you need to assert for word boundary at the start as well. For example:
Regexp: dog\b
Match : I see a hounddog.
Regexp: \bdog\b
Match : I see a dog
Regexp: \bdog\b
Match : I see a hounddog
(No match)
The second and third examples only matched "dog" on its own, because it required both ends to be on a word boundary.
Start of line, end of line
Other common assertions are:
- ^ - assert start of line
Regexp: ^You see
Match : You see strange things here
Regexp: ^You see
Match : You are in a large room. You see bread here.
(No match)
- $ - assert end of line
Regexp: exits (north|south|east|west)$
Match : Juliet exits east
Wildcard '1' = 'east'
Regexp: exits (north|south|east|west)$
Match : There are exits east of here
(No match)
Lookahead assertions
Extending this idea a bit, we can assert that text is to follow our match, but not be part of the match. For example:
Regexp: foo(?=bar)
Match : foobar
Regexp: foo(?=bar)
Match : foot
(No match)
In this example we are asserting we want "foo" to be followed by "bar", however "bar" is not to be part of the match.
We can turn that around with a negative assertion ...
Regexp: foo(?!bar)
Match : foobar
(No match)
Regexp: foo(?!bar)
Match : foot
In this case we want "foo" provided it is not followed by "bar". (So the example of "foot" matches, however only "foo" is the matching part).
Lookbehind assertions
A lookbehind assertion tests that the characters preceding the current point in the regular expression match some string. For example:
Regexp: (?<=John).+ leaves the room
Match : John Smith leaves the room
In this case we are matching on someone whose name starts with John is leaving the room, however the "John" part is not part of the match.
Negative lookbehind assertions
A negative lookbehind assertion is useful for excluding things. For example, you might want to match on "<someone> wins a prize" but only if the someone is not John.
Regexp: (?<!John) wins a prize
Match : Nick wins a prize
Regexp: (?<!John) wins a prize
Match : John wins a prize
(No match)
This worked, but the person who won the prize is not part of the match. To fix that we will add another wildcard to the front, and then use the negative lookbehind assertion to test that it wasn't John...
Regexp: (.+)(?<!John) wins a prize
Match : John wins a prize
(No match)
Regexp: (.+)(?<!John) wins a prize
Match : Nick wins a prize
Wildcard '1' = 'Nick'
Note: lookahead and lookbehind assertions can be collectively referred to as "lookaround" assertions. That is, the condition for matching the text is governed by what is around it, in addition to what the text itself is.
Comments
When doing a complex regular expression it might be helpful to put comments in it, as guidance to yourself what each part is for, like this:
Regexp: (?# Match hitpoints)(\d{1,4}) hp (?# Match mana)(\d{1,4}) m
Match : 13 hp 24 m
Wildcard '1' = '13'
Wildcard '2' = '24'
In this example the sequence (?# ...) is a comment. |