Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to "verify" your details, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.
 Entire forum ➜ MUSHclient ➜ Tips and tricks ➜ Regular expression tips

Regular expression tips

This subject is now closed.     Refresh page


Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Fri 31 Dec 2004 10:38 PM (UTC)

Amended on Wed 10 Aug 2016 08:02 PM (UTC) by Nick Gammon

Message
This page can be quickly reached from the link: http://www.gammon.com.au/regexp


Now that MUSHclient has a regular expression helper for use when editing aliases and triggers, it might be time to revisit what the various things do in regular expressions.

The examples below will illustrate the major things available in regular expressions (regexps), along with an example showing the results. The matching text will be underlined.

Testing site


The site https://regex101.com/ lets you type in a regexp and test it interactively. It also has good documentation about the various constructs you can use inside a regexp.

Ordinary text


Ordinary text just matches itself...


Regexp: patter
Match : You hear the patter of little feet


Any character


A period (".") matches any character except newline (a line break). You can use it to match unknown letters ...


Regexp: p.tter
Match : You hear the patter of little feet


This matched the same as above, however any single letter can be subsituted where the dot is ...


Regexp: p.tter
Match : You hear the pitter of little feet


To match "." literally, you need to escape it with a backslash ...


Regexp: mailbox\.
Match : You see a mailbox. It is closed.


Note: if you need to include line breaks in a "dot" match (for example, in a multi-line trigger), then you need to set the "dot-all" option "(?s)". See below for more information about that.


Character in set or range


You can specify a set of matching characters ...


Regexp: p[aeiou]tter
Match : You hear the pitter of little feet


This example matches a vowel (aeiou).

However a non-vowel does not match ...


Regexp: p[aeiou]tter
Match : You hear the pxtter of little feet
(No match)


By using a hyphen you can specify a range:


Regexp: [A-G]olf
Match : I see a Golf cart


However a letter out of the range A-G does not match:


Regexp: [A-G]olf
Match : I see a Molf cart
(No match)


If you want to use "-" itself in the set, it must be in the first or last position (that is, just after the opening [ or just before the closing ] ) or it is considered a range specifier.

Some useful sets



  • [A-Z] - upper-case letters
  • [a-z] - lower-case letters
  • [A-Za-z] - letters
  • [0-9] - digits (same as \d - see below)
  • [0-9+-] - digits including a plus or minus sign
  • [0-9A-Fa-f] - hexadecimal numbers
  • [A-Za-z0-9] - letters or digits


Character not in set or range


If we use a set (square brackets) but start it with a carat (^) then we get the inverse of the set ...

This will match anything ending in "ad" unless it starts with "b" or "s".


Regexp: [^bs]ad
Match : He was sad
(No match)

Regexp: [^bs]ad
Match : He was bad
(No match)


However "mad" matches:


Regexp: [^bs]ad
Match : He was mad


If you want to use "^" itself in the set, then simply put it anywhere except in the first position.

Special pre-defined character sets



  • \d - any decimal digit
  • \D - any character that is not a decimal digit
  • \s - any whitespace character (eg. space, tab, linefeed, carriage-return)
  • \S - any character that is not a whitespace character
  • \w - any "word" character
  • \W - any "non-word" character

    A "word" character is a letter, digit or underscore (in other words, the set: [A-Za-z0-9_] )



Regexp: \d+
Match:  You see 20 coins


The "+" symbol is explained just below. It means "one or more of the previous item".

You can use the pre-defined sets inside other sets, like this:


Regexp: [\d,-]+
Match:  You get -123,456 experience for doing that 


Quantifiers


The general way of expressing a number of matches is to use curly brackets, like this:


Regexp: a{4}
Match : baaaaaaaa


This matched exactly 4 of "a".

You can specify a range, eg. 1 to 3 matches:


Regexp: ta{1,3}d
Match : tadpole

Regexp: ta{1,3}d
Match : taadpole

Regexp: ta{1,3}d
Match : taaadpole

Regexp: ta{1,3}d
Match : taaaadpole
(No match)


The above would match up to 3 of "a". The last example fails because there are 4 "a" between "t" and "d".

You can also specify zero matches:


Regexp: ta{0,3}d
Match : td

Regexp: ta{0,3}d
Match : tad


The first example matched even with zero "a" between "t" and "d".

A sequence like "{5,}" means "5 or more matches".

Shortcut quantifiers


There are three special characters that can be used for common quantifier cases:


 *    is equivalent to {0,}
 +    is equivalent to {1,}
 ?    is equivalent to {0,1}


Example of a shortcut:


Regexp: ta?p
Match : tp

Regexp: ta?p
Match : tap

Regexp: ta?p
Match : taap
(No match)


In this case the a? sequence matches zero or one occurrence of "a".

Greedy matches


The default is for sequences to be "greedy" which basically means they will take as many characters are they can. For example:


Regexp: ta{1,8}
Match : taaaaaaaaa da!


However by following the quantifier by a question mark the match matches the least it can:


Regexp: ta{1,8}?
Match : taaaaaaaaa da!


In that example we specified a minimum of 1 "a" and that is what we got (rather than the maximum).

Note that the "non-greedy" character "?" has a different meaning to the "?" which means "0 or 1 quantity".

This rather contrived example shows the different uses in a single regexp:


Regexp: ba??
Match : baaaa

Regexp: ba?
Match : baaaa


The first example above matches "a??" which means "0 or 1 of a, non-greedy". Well, zero occurrences is the least greedy, so it simply returns "b" as the match.

The second example above matches "a?" which means "0 or 1 of a, greedy". The greediest it can get is to match once, so it returns "ba" as the match.

You can also make the entire regular expression (from a particular point) non-greedy by using the (?U) configuration sequence. For example:


Regexp: (?U)ta{1,8}
Match : taaaaaaaaa da!


Escape next character


If you need to literally match one of the special characters (like ".", "[", "(", "{" and so on) then you need to precede it with a backslash:


Regexp: You go north\.
Match : You go north. You see a bear.


In this case we "escape" the dot with a backslash so we literally match the dot.

The following characters need to be escaped, outside square brackets (sets):


  • \ - general escape character with several uses
  • ^ - assert start of string (or line, in multiline mode)
  • $ - assert end of string (or line, in multiline mode)
  • . - match any character except newline (by default)
  • [ - start character class definition (set)
  • | - start of alternative branch (this OR that)
  • ( - start subpattern (wildcard or group)
  • ) - end subpattern (wildcard or group)
  • ? - extends the meaning of "(", also 0 or 1 quantifier, also quantifier minimizer
  • * - 0 or more quantifier
  • + - 1 or more quantifier, also "possessive quantifier"
  • { - start min/max quantifier


Inside square brackets (sets) the only characters that need to be escaped are:


  • \ - general escape character
  • ^ - negate the class, but only if the first character
  • - - indicates character range, unless first or last character
  • [ - POSIX character class (only if followed by POSIX syntax)
  • ] - terminates the character class


Groups and tagged expressions


You can group things together by putting them into round brackets. The default behaviour is for each group to be returned as a "wildcard" for the trigger or alias...


Regexp: tell (.+) (.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick hi'
Wildcard '2' = 'there'


However you can see a problem here, the first wildcard is "Nick hi" where we really want it to be the name. So we need to make it less greedy:


Regexp: tell (.+?) (.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'


Named wildcards


You can name your wildcards to make it easier to use them later (and not worry about what number each one is) by using the ?P<name> syntax, like this:


Regexp: tell (?P<who>.+?) (?P<what>.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'
Wildcard 'who' = 'Nick'
Wildcard 'what' = 'hi there'


In this example we still have wildcards numbered 1 and 2, but they are also saved as the names "who" and "what".

From version 4.03 onwards of MUSHclient, there are two alternatives for named wildcards, namely: ?<name> and ?'name'.


Regexp: tell (?<who>.+?) (?'what'.+)
Match : tell Nick hi there
Wildcard '1' = 'Nick'
Wildcard '2' = 'hi there'
Wildcard 'who' = 'Nick'
Wildcard 'what' = 'hi there'


Choices


You can use the "or" symbol ("|") to choose between alternatives, usually inside a group ...


Regexp: You are being chased by a (bat|dog|gorilla) which looks hungry!
Match : You are being chased by a gorilla which looks hungry!
Wildcard '1' = 'gorilla'

Regexp: You are being chased by a (bat|dog|gorilla) which looks hungry!
Match : You are being chased by a bat which looks hungry!
Wildcard '1' = 'bat'


Quantities of groups


Groups can also be followed by a quantifier:


Regexp: You see a (dog|cat){1,3} inside
Match : You see a dog inside
Wildcard '1' = 'dog'

Regexp: You see a (dog|cat){1,3} inside
Match : You see a dogdogcat inside
Wildcard '1' = 'cat'



Interestingly, in the second example the wildcard is still "cat" (not "dogdogcat") because it has remembered the last match for the group. If you want to capture the whole group (ie. all three occurrences) then make a second group around the quantifier, like this:


Regexp: You see a ((dog|cat){1,3}) inside
Match : You see a dogdogcat inside
Wildcard '1' = 'dogdogcat'
Wildcard '2' = 'cat'


Now the first wildcard (the outer one) returns the full matching group, and the second wildcard just returns the last one.

Assertions


Assertions are interesting - they let you test for things without those things actually appearing in the match. Here is an example, we want to match the word "dog" on its own:


Regexp: dog
Match : I see a doggy here


This hasn't worked, because we have matched "dog" inside "doggy".

We can put in the word break ourselves, but then it becomes part of the match, which we might not want:


Regexp: dog[ ]
Match : I see a dog here


Notice how the space after the word dog is also part of the match. This might not seem so bad, but what if the character after "dog" is not a space? For example:


Regexp: dog[ ]
Match : I see a dog.
(No match)


So we fix up the match to match on "dog" followed by anything that is not a word character, like this:


Regexp: dog[^\w]
Match : I see a dog.

Regexp: dog[^\w]
Match : I see a dog
(No match)


Our first test isn't so great - we get the period as part of the match, which we didn't really want. The second test is worse, it doesn't match, because the end of the line is not considered to match a non-word character.

The solution is to use a "word boundary" assertion, like this:


Regexp: dog\b
Match : I see a dog.

Regexp: dog\b
Match : I see a dog


This has worked in both cases. The \b asserts that we want a end-of-word boundary, but the boundary is not actually part of the match. It also works for end-of-line too.

Strictly speaking, to find the word "dog" on its own you need to assert for word boundary at the start as well. For example:


Regexp: dog\b
Match : I see a hounddog.

Regexp: \bdog\b
Match : I see a dog

Regexp: \bdog\b
Match : I see a hounddog
(No match)


The second and third examples only matched "dog" on its own, because it required both ends to be on a word boundary.

Start of line, end of line


Other common assertions are:


  • ^ - assert start of line

    
    Regexp: ^You see
    Match : You see strange things here
    
    Regexp: ^You see
    Match : You are in a large room. You see bread here.
    (No match)
    


  • $ - assert end of line

    
    Regexp: exits (north|south|east|west)$
    Match : Juliet exits east
    Wildcard '1' = 'east'
    
    Regexp: exits (north|south|east|west)$
    Match : There are exits east of here
    (No match)
    



Lookahead assertions


Extending this idea a bit, we can assert that text is to follow our match, but not be part of the match. For example:


Regexp: foo(?=bar)
Match : foobar

Regexp: foo(?=bar)
Match : foot
(No match)


In this example we are asserting we want "foo" to be followed by "bar", however "bar" is not to be part of the match.

We can turn that around with a negative assertion ...


Regexp: foo(?!bar)
Match : foobar
(No match)

Regexp: foo(?!bar)
Match : foot


In this case we want "foo" provided it is not followed by "bar". (So the example of "foot" matches, however only "foo" is the matching part).

Lookbehind assertions


A lookbehind assertion tests that the characters preceding the current point in the regular expression match some string. For example:


Regexp: (?<=John).+ leaves the room
Match : John Smith leaves the room


In this case we are matching on someone whose name starts with John is leaving the room, however the "John" part is not part of the match.

Negative lookbehind assertions


A negative lookbehind assertion is useful for excluding things. For example, you might want to match on "<someone> wins a prize" but only if the someone is not John.


Regexp: (?<!John) wins a prize
Match : Nick wins a prize

Regexp: (?<!John) wins a prize
Match : John wins a prize
(No match)


This worked, but the person who won the prize is not part of the match. To fix that we will add another wildcard to the front, and then use the negative lookbehind assertion to test that it wasn't John...


Regexp: (.+)(?<!John) wins a prize
Match : John wins a prize
(No match)

Regexp: (.+)(?<!John) wins a prize
Match : Nick wins a prize
Wildcard '1' = 'Nick'


Note: lookahead and lookbehind assertions can be collectively referred to as "lookaround" assertions. That is, the condition for matching the text is governed by what is around it, in addition to what the text itself is.

Comments


When doing a complex regular expression it might be helpful to put comments in it, as guidance to yourself what each part is for, like this:


Regexp: (?# Match hitpoints)(\d{1,4}) hp (?# Match mana)(\d{1,4}) m
Match : 13 hp 24 m
Wildcard '1' = '13'
Wildcard '2' = '24'


In this example the sequence (?# ...) is a comment.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #1 on Fri 31 Dec 2004 10:57 PM (UTC)

Amended on Wed 10 Aug 2016 08:03 PM (UTC) by Nick Gammon

Message
Forum code fixer


In the process of writing the above I needed to programmatically "fix up" forum codes (that is, replace [ by \[, ] by \], and \ by \\). This small Lua snippet did the job:


function fix_forum_codes (s)
  return (string.gsub (s, "([\\%[%]])", "\\%1"))
end  


This fixes [, ] and \ by putting a \ in front of them.

It may look a bit obscure, but it also uses a regexp (although the native Lua kind) which uses % as its escape character. Also the backslashes had to be doubled to get a literal backslash inside the strings.

It is looking for a set of the above three characters (hence the square brackets inside the match string), however what is confusing is that the things we are searching for are themselves square brackets, so they have a % in front of them. Finally the whole expression is grouped with round brackets, so it can be used in the replacement string as %1.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #2 on Fri 31 Dec 2004 11:48 PM (UTC)

Amended on Wed 10 Aug 2016 08:03 PM (UTC) by Nick Gammon

Message
Posix character classes


You can use various Posix character classes instead of the normal sets. For example, to match on alphanumeric sequences:


Regexp: [[:alnum:]]+
Match : You see food

Regexp: [[:alnum:]]+
Match : 22 fish


Note the double square brackets. The first (outer) ones define a set, and inside the set you are specifying the Posix class.

Next, straight alpha (that is, A-Z or a-z):


Regexp: [[:alpha:]]+
Match : You see food

Regexp: [[:alpha:]]+
Match : 22 fish


This is different from the earlier example, the numbers no longer match.


The "punct" class matches punctuation:


Regexp: [[:punct:]]+
Match : I see --many-- things


You can do similar things by using the carat again, like this:


Regexp: [[:^alpha:]]+
Match : blah^%$#blah


That example matches the non-alphabetic characters following "blah".

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #3 on Sat 01 Jan 2005 01:17 AM (UTC)

Amended on Wed 10 Aug 2016 08:03 PM (UTC) by Nick Gammon

Message
Back references


You can also make a regular expression match on part of what it has already matched. For example:


Regexp: (.+) and \1
Match : fish and fish
Wildcard '1' = 'fish'

Regexp: (.+) and \1
Match : fish and chips
(No match)


This example matches "x and x" where "x" is the same word in both cases (eg. "fish and fish").

You can also use named subpatterns for back references:


Regexp: (?P<who>\w+) went East\. (?P=who) looks silly
Match : Peter went East. Peter looks silly
Wildcard '1' = 'Peter'
Wildcard 'who' = 'Peter'

Regexp: (?P<who>\w+) went East\. (?P=who) looks silly
Match : Peter went East. John looks silly
(No match)


In this case the named subpattern (who) is used to test for another instance of it later on.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #4 on Sat 01 Jan 2005 02:01 AM (UTC)

Amended on Wed 10 Aug 2016 08:04 PM (UTC) by Nick Gammon

Message
Internal options


You can set options on-the-fly like as shown below. If the option is at the "outer" level it affects everthing onwards, however inside a group it only affects the group it is in.


Caseless matching



Regexp: (?i)fish
Match : fish and chips

Regexp: (?i)fish
Match : FISH and chips


The (?i) option sets caseless matching.

Extended syntax



Regexp: (?x) a b c
Match : abcdef


The (?x) option sets the extended syntax, which basically lets you use whitespace within reason inside your regexps, to make them more readable.

Ungreedy



Regexp: (?U)ba+
Match : baaad


The (?U) option lets you specify ungreedy matches (see above for a more detailed explanation about ungreedy matches).

Dot-all


The (?s) option make the "dot" character match everything (including newlines). It is useful inside a multi-line trigger, because if you want to match multiple lines of anything just using ".*" will not work, as that will stop at a newline. However "(?s).*" would match multiple lines.

Duplicate names


If you are using named patterns you can now have the same name multiple times, provided you put the "(?J)" option in your pattern. For example:


Match: (?J)(?P<name>Ni.+)|(?P<name>Fr.+)


This would match "Nick" or "Fred", and whichever one matched, the named pattern "name" would be set to it.

Unsetting options


An option can be unset by putting a "-" in front of it. For example:

(?-i) --> turn off caseless matching


Recursive patterns


If you want to get really fancy you can experiment with recursive and conditional patterns. Read the PCRE documentation for full details, but this will give you the flavour of it.

Say you want to match some nested pattern, like this:


foo (x (y (z) ) ) a (b)
    ^^^^^^^^^^^^^


The objective here is to match the opening bracket, and then find the appropriate closing bracket.

Our first attempt is not a big success:


Regexp: (?x) \( .+ \)
Match : foo (x (y (z) ) ) a (b)


This matches a bracket, with stuff inbetween, and a closing bracket, but it goes too far. This is because it is using a greedy match. We'll try again with a non-greedy one:


Regexp: (?x) \( .+? \)
Match : foo (x (y (z) ) ) a (b)


Unfortunately, this matches too little. It stops at the first closing bracket.

To make this work we need to use the nested (recursive) match:


Regexp: (?x) \( (?: (?>[^()]+) | (?R) )* \)
Match : foo (x (y (z) ) ) a (b)


First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (that is a correctly parenthesized substring). Finally there is a closing parenthesis. (This paragraph was taken from the PCRE documentation.)


Subroutines


Say you want to match a complex pattern a few times in a row. We earlier saw how you can use a backreference to match the same thing more than once, but this time we want to match the same pattern, not the same literal string.

Here is an example that might match a prompt line:


Regexp: ([0-9,]{1,4})hp (?1)m (?1)mv
Match : 24hp 14m 23mv
Wildcard '1' = '24'


The first pattern is the complicated one: ([0-9,]{1,4})

This matches between one and 4 digits, including commas.

However next time (mana) we simply want to apply the same match again, so we just write (?1) to "call" the same pattern again, and yet again for the movement points.

If we want to get all three values into named wildcards, we can still use the subroutine call, but add around it the named wildcards:


Regexp: (?P<hp>[0-9,]{1,4})hp (?P<mana>(?1))m (?P<move>(?1))mv
Match : 24hp 14m 23mv
Wildcard '1' = '24'
Wildcard '2' = '14'
Wildcard '3' = '23'
Wildcard 'mana' = '14'
Wildcard 'hp' = '24'
Wildcard 'move' = '23'


However if you don't care what the numbers are, a simpler method is to simply match on a repeated group. Here is an example that will match on typical prompts:


Regexp: ^<(\d+/\d+[a-zA-Z]+\s*)+>
Match : <1000/1000hp 100/100m 110/110mv 2000/31581xp> go north
Wildcard '1' = '2000/31581xp'


This is looking for "<" followed by one or more of:

(numbers)/(numbers)(letters)(spaces)

followed by ">" and then anything else.

This one is a good choice for using the extended syntax on, so we can see the regexp a bit better:


Regexp: (?x) ^ < ( \d+ / \d+ [a-zA-Z]+ \s* )+ >
Match : <1000/1000hp 100/100m 110/110mv 2000/31581xp> go north
Wildcard '1' = '2000/31581xp'


By putting (?x) at the start we can use spaces inside the regexp to make it clearer what each part is.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #5 on Mon 16 Apr 2007 11:25 PM (UTC)

Amended on Wed 15 Jul 2009 06:48 AM (UTC) by Nick Gammon

Message
The full documentation for the PCRE regular expressions is here:

http://mushclient.com/pcre/pcrepattern.html

This is an HTML version of the file RegularExpressions.txt that ships with MUSHclient in the docs subdirectory.

Another page describes common patterns in a tabular form:

http://www.gammon.com.au/mushclient/regexp.htm




If you are using the Lua string functions (eg. string.match) you need to use the Lua form of regular expressions, described here:

http://www.gammon.com.au/scripts/doc.php?general=lua_string

In particular, the Lua regular expression syntax is listed under string.find:

http://www.gammon.com.au/scripts/doc.php?lua=string.find

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,072 posts)  Bio   Forum Administrator
Date Reply #6 on Sun 22 Apr 2007 01:29 AM (UTC)

Amended on Sun 22 Apr 2007 01:30 AM (UTC) by Nick Gammon

Message
The regular expression matcher available in MUSHclient 4.05 onwards has more sophisticated support for matching Unicode sequences.

To quote from the PCRE page at http://mushclient.com/pcre/pcrepattern.html:




Three additional escape sequences to match character properties are available when UTF-8 mode is selected. They are:


  \p{xx}   a character with the xx property
  \P{xx}   a character without the xx property
  \X       an extended Unicode sequence



The property names represented by xx above are limited to the Unicode script names, the general category properties, and "Any", which matches any character (including newline). Other properties such as "InMusicalSymbols" are not currently supported by PCRE. Note that \P{Any} does not match any characters, so always causes a match failure.

Sets of Unicode characters are defined as belonging to certain scripts. A character from one of these sets can be matched using a script name. For example:


  \p{Greek}
  \P{Han}


Those that are not part of an identified script are lumped together as "Common". The current list of scripts is:


Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.


Each character has exactly one general category property, specified by a two-letter abbreviation. For compatibility with Perl, negation can be specified by including a circumflex between the opening brace and the property name. For example, \p{^Lu} is the same as \P{Lu}.

If only one letter is specified with \p or \P, it includes all the general category properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect:

\p{L}
\pL

The following general category property codes are supported:


  C     Other
  Cc    Control
  Cf    Format
  Cn    Unassigned
  Co    Private use
  Cs    Surrogate

  L     Letter
  Ll    Lower case letter
  Lm    Modifier letter
  Lo    Other letter
  Lt    Title case letter
  Lu    Upper case letter

  M     Mark
  Mc    Spacing mark
  Me    Enclosing mark
  Mn    Non-spacing mark

  N     Number
  Nd    Decimal number
  Nl    Letter number
  No    Other number

  P     Punctuation
  Pc    Connector punctuation
  Pd    Dash punctuation
  Pe    Close punctuation
  Pf    Final punctuation
  Pi    Initial punctuation
  Po    Other punctuation
  Ps    Open punctuation

  S     Symbol
  Sc    Currency symbol
  Sk    Modifier symbol
  Sm    Mathematical symbol
  So    Other symbol

  Z     Separator
  Zl    Line separator
  Zp    Paragraph separator
  Zs    Space separator


The special property L& is also supported: it matches a character that has the Lu, Ll, or Lt property, in other words, a letter that is not classified as a modifier or "other".

The long synonyms for these properties that Perl supports (such as \p{Letter}) are not supported by PCRE, nor is it permitted to prefix any of these properties with "Is".

No character that is in the Unicode table has the Cn (unassigned) property. Instead, this property is assumed for any code point that is not in the Unicode table.

Specifying caseless matching does not affect these escape sequences. For example, \p{Lu} always matches only upper case letters.




For example, you could match one or more currency symbols:


Match: \p{Sc}+


Or you could match on Latin text:


Match: \p{Latin}+


Note that "dog" is considered Latin text but "1234" is not.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


58,787 views.

This subject is now closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.