Gammon Forum : MUSHclient : Lua : Using the string library: find, gfind and gsub

Log on

Gammon Forum

Entire forum

MUSHclient

Lua

Using the string library: find, gfind and gsub

Using the string library: find, gfind and gsub

It is now over 60 days since the last post. This thread is closed. Refresh page

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Sat 29 Oct 2005 04:07 AM (UTC)

Amended on Thu 08 Mar 2007 02:55 AM (UTC) by Nick Gammon

Message

I found the Lua manual a little confusing when it described the find, and find-and-replace functions in the Lua string library, so I am doing my own documentation. :)

string.find

This is your basic "string within a string" finder. You pass it a source string, a pattern to look for, and an optional starting point (defaulting to 1).

eg.



print (string.find ("mend both legs at once", "legs")) --> 11 14

This example returns the start and end points of the word "legs" (columns 11 to 14).



print (string.find ("mend both legs at once", "goats")) --> nil

A return value of nil means the pattern was not found. Since nil is considered false in an if test you can simply write something like this:


test = "mend both legs at once"

if string.find (test, "legs") then
  print "found!"
else
  print "not found"
end -- if

You can also specify a starting column, if you want to skip part of the initial string:



print (string.find ("mend both legs at once", "legs", 5)) --> 11 14

print (string.find ("mend both legs at once", "legs", 15)) --> nil

The first example still returned exactly the same numbers, since "legs" was found past column 5. However supplying column 15 meant it wasn't found.

Plain matches

You can also specify a fourth argument, the "plain" argument (true or false). If true, then the search pattern is considered a plain pattern, not a regular expression. To specify this you must also give the third argument (start position).



print (string.find ("I see %a here", "%a"))  --> 1 1

print (string.find ("I see %a here", "%a", 1, true)) --> 7 8

In the first case we are searching for %a, but %a is a special pattern (see below) meaning "all letters". Hence it matched on column 1, since that had a letter in it.

In the second case we have turned "plain match" on (and also specified column 1 as the starting position). Now it matches literally %a at column 7.

The "plain" argument would be very handy for situations where you let the user specify a search pattern, where the thing they are searching for is quite likely to contain periods, brackets, question marks, and so on.

Returning captured strings

If you set up "captures" (see below under "Patterns") the captured string(s) are also returned:



print (string.find ("mend both legs at once", "(l..s)")) --> 11 14 legs

print (string.find ("sword hits Nick", "(%a+) hits (%a+)")) --> 1 15 sword Nick

If you are mainly interested in what was captured (if anything) rather than where it is, you can use a dummy variable (like _ ) to discard the columns and simply retrieve the captured data:


_, _, what = string.find ("You are struck (glancing)", "(%b())")
print (what) --> (glancing)

Note for Lua 5.1

Under Lua 5.1, you can use string.match which only returns the matching text (and not the columns), so this example could be written as:


what = string.match ("You are struck (glancing)", "(%b())")
print (what) --> (glancing)

Patterns

Before I move on, let's look at looking for other patterns. We can use regular expressions inside find and replace calls, however these are the Lua patterns, not the ones MUSHclient users are accustomed to.



print (string.find ("mend both legs at once", "l..s")) --> 11 14

In this example the "." character matches any single character.

Now let's try a repeated sequence:



print (string.find ("balls bells bills", "b.+s")) --> 1 17

A problem here is that we don't necessarily want the match to span the entire line. This is called a "greedy" match, as it matched as much as it could. By using "-" instead of "+" we have a non-greedy match.



print (string.find ("balls bells bills", "b.-s")) --> 1 5

The standard patterns you can search for are:


 . --- (a dot) represents all characters. 
%a --- represents all letters. 
%c --- represents all control characters. 
%d --- represents all digits. 
%l --- represents all lowercase letters. 
%p --- represents all punctuation characters. 
%s --- represents all space characters. 
%u --- represents all uppercase letters. 
%w --- represents all alphanumeric characters. 
%x --- represents all hexadecimal digits. 
%z --- represents the character with representation 0.

Important - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits.

There are some "magic characters" (such as %) that have special meanings. These are:



^ $ ( ) % . [ ] * + - ?

If you want to use those in a pattern (as themselves) you must precede them by a % symbol.

eg. %% would match a single %

As with normal MUSHclient regular expressions you can build your own pattern classes by using square brackets, eg.



[abc] ---> matches a, b or c

[a-z] ---> matches lowercase letters (same as %l)

[^abc] ---> matches anything except a, b or c

[%a%d] ---> matches all letters and digits

[%a%d_] ---> matches all letters, digits and underscore

[%[%]] ---> matches square brackets (had to escape them with %)

The repetition characters are:


+  ---> 1 or more repetitions (greedy)
*  ---> 0 or more repetitions (greedy)
-  ---> 0 or more repetitions (non greedy)
?  ---> 0 or 1 repetition only

The standard "anchor" characters apply:


^  ---> anchor to start of subject string
$  ---> anchor to end of subject string

You can also use round brackets to specify "captures", similar to normal MUSHclient regular expressions:



You see (.*) here

Here, whatever matches (.*) becomes the first pattern.

You can also refer to matched substrings (captures) later on in an expression:



print (string.find ("You see dogs and dogs", "You see (.*) and %1")) --> 1 21 dogs

print (string.find ("You see dogs and cats", "You see (.*) and %1")) --> nil

This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case).

As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg.



print (string.find ("You see dogs and cats", "You .* ()dogs .*")) --> 1 21 9

What this is saying is that the word "dogs" starts at column 9.

Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this:


print (string.find ("I see a (big fish (swimming) in the pond) here",
       "%b()"))  --> 9 41

After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)".

string.gsub

The simple use of gsub is to replace one thing by another, eg.



print (string.gsub ("nick eats fish", "fish", "chips")) --> nick eats chips 1

The "1" at the end is the 2nd result returned from gsub, which tells us how many substitutions it did. eg.



print (string.gsub ("fish eats fish", "fish", "chips")) --> chips eats chips 2

Of course, since the matching string can be a pattern we can do something like replace all vowels with a dot:



print (string.gsub ("nick eats fish", "[AEIOUaeiou]", ".")) --> n.ck ..ts f.sh 4

Here we see that 4 vowels have been replaced.

We can also set up a capture (with round brackets) and refer to the captured data in the replacement string:



print (string.gsub ("nick eats fish", "([AEIOUaeiou])", "(%1)")) --> n(i)ck (e)(a)ts f(i)sh 4

In this case we are putting all vowels into brackets.

We can discard the replacement string, and simply use string.gsub to count things for us:



_, n = string.gsub ("nick eats fish", "[AEIOUaeiou]", "")

print (n) --> 4

In this case we use the short variable name "_" as a dummy variable, and concentrate on the 2nd returned result "n", which is the count of substitutions.

Replacement function

Next we can pass a function to gsub rather than a simple string. In this case the function is called for each matched instance in the source string. Starting simply:


function f (s)
  print ("found " .. s)
end -- f

string.gsub ("Nick is taking a walk today", "%a+", f)

Output

found Nick
found is
found taking
found a
found walk
found today

In the above example I am searching for one or more alphabetic characters (words in other words), and for each one found the function "f" is called, which prints the found word.

Since Lua supports anonymous inline functions the above example can be written more shortly:


string.gsub ("Nick is taking a walk today", "%a+", 
  function (s)
    print ("found " .. s)
  end
  )

This has the same output, but saves having to define a function "f" in advance.

Given this capability, we can start getting fancy, by doing a lookup inside the supplied function. For each call of the function, you are expected to return a replacement value for the matching string. So I will make an example that does a table lookup, and replaces "nice" by "windy", and "walk" by "stroll".


replacements = { 
   ["nice"] = "windy",
   ["walk"] = "stroll",
   }
   
s = "a nice long walk"

result = string.gsub (s, "%a+", 
  function (str)
  return replacements [str] or str
  end
  )

print (result) --> a windy long stroll

This example looks up in the replacements table, and if a match is found returns that, otherwise (by using the short-circuit boolean evaluation) returns the original string instead.

Note for Lua 5.1

Under Lua 5.1 you can simply provide a table of target/replacement strings, so the gsub could be written more neatly like this:


result = string.gsub (s, "%a+", replacements)

Replacement count

Finally we can supply a fourth argument, the maximum number of replacements we want done (which might be one, of course). For example:



print (string.gsub ("I see a see saw", "see", "view")) --> I view a view saw 2

In this case all instances of "see" have become "view". Now let's limit it to the first one:



print (string.gsub ("I see a see saw", "see", "view", 1)) --> I view a see saw 1

string.gfind

Finally, gfind offers a way of looping over a source string, and doing something with the each matching instance, assuming we don't want to actually modify the string.

For example, to take a string and build each word in it into a table:


words = {}
for w in string.gfind ("nick takes a stroll", "%a+") do
  table.insert (words, w)
end -- for

tprint (words)

Output

1=nick
2=takes
3=a
4=stroll

Effectively gfind is an iterator that can be used in a for loop.

By using captures we can return more than one thing from gfind, so we can write something like this which might be used to decode configuration parameters:


config = {}
for key, value in string.gfind ("name=nick, height=100", 
                                "(%a+)=([%a%d]+)") do
  config [key] = value
end -- for

tprint (config)

Output

height=100
name=nick

Note for Lua 5.1

The function string.gfind has been renamed string.gmatch under Lua 5.1.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Reply #1 on Sat 29 Oct 2005 04:20 AM (UTC)

Amended on Mon 14 Nov 2005 03:01 AM (UTC) by Nick Gammon

Message

Breaking a string into lines

Last time I looked at gsub I was trying to find a quick way of breaking a string into table entries at newlines (for speedwalks I think), and resorted to using the rex library.

However now I think it can easily also be done with gfind or gsub. Here is a snippet that illustrates both ways:


s = [[This is my test string
with multiple lines
in it]]


-- method A - gsub

t1 = {}
string.gsub (s, "[^\n]+", 
  function (str)
    table.insert (t1, str)
  end
  )

tprint (t1)

-- method B - gfind

t2 = {}
for w in string.gfind (s, "[^\n]+") do
  table.insert (t2, w)
end -- for

tprint (t2)

Output

1=This is my test string
2=with multiple lines
3=in it
1=This is my test string
2=with multiple lines
3=in it

Both methods work, the gfind one is arguably faster as it doesn't try to do replacements. They both look for a sequence of multiple characters excluding a newline. This means the matching pattern will be whatever preceeds the newline. This is passed to the function (or the for loop in the 2nd case) and the resulting line is inserted into a table.

[EDIT]

Also see:

http://www.gammon.com.au/forum/bbshowpost.php?bbsubject_id=6079

This describes a new utils.split function that is designed to do exactly that.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Reply #2 on Sat 29 Oct 2005 06:37 AM (UTC)

Amended on Sat 29 Oct 2005 08:03 AM (UTC) by Nick Gammon

Message

Using gsub to fix up patterns

As mentioned earlier you can use the "plain" argument to string.find to do an exact match, but what about if you want to use gsub to find and replace an arbitrary text without worrying about special characters? Since gsub and gfind don't have a "plain" parameter we need to be able to fix up their patterns in advance.

We can use gsub itself to do this. A simple replacement will fix up non alpha-numeric sequences by putting % in front of them, like this:


match = "%a"
replace_with = "%z%1"

-- fix find and replace patterns

pattern = string.gsub (match, "(%W)", "%%%1")
repl = string.gsub (replace_with, "(%W)", "%%%1")

-- do the gsub

print (string.gsub ("%a %b %c", pattern, repl)) --> %z%1 %b %c 1

Admitedly this is a bit of a contrived example, but it shows that we can search for %a and replace it with %z%1 without worrying about the %a or %1 being treated as special cases.

If we print pattern and repl we can see what became of applying gsub to their original arguments:


print (pattern) --> %%a
print (repl) --> %%z%%1

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Reply #3 on Sat 29 Oct 2005 08:39 AM (UTC)

Amended on Wed 07 Dec 2005 04:28 AM (UTC) by Nick Gammon

Message

Frontier pattern

Whilst browsing through the Lua source I noticed there was another type of pattern match you can do: %f[...]

This is annotated in the source as a "frontier" match. This is not documented, and it was hard to find what it did. But after perusing the Lua mailing list I got an explanation that the frontier match finds a transition from "not in set" to "in set".

This is probably not fantastically useful, but for the sake of completeness, and in case I ever wonder about it again, I'll do an example. :)


string.gsub ("goats blood", "()%f[%w]", 
  function (s)
    print ("found " .. s)
  end
  )

Output

found 1
found 7

Let's analyse what that means. I put in a () match to find where the match was occurring. Looking at the source string:


goats blood
12345678901

It seems it is matching on the boundaries between "not letters" and "letters", which is what you would expect from the description. Now, taking the inverse:


string.gsub ("goats blood", "()%f[%W]", 
  function (s)
    print ("found " .. s)
  end
  )

Output

found 6
found 12

Now the boundaries from "letter" to "not letter" are at columns 6 and 12 (column 12 not actually being in the string as it is only 11 characters long).

I can't offhand think of a use for this, which is probably why it is undocumented. :)

[EDIT]

See further down for a practical use for the frontier pattern. Detecting word boundaries, even where words are at the start or end of the string to be matched.

Briefly, the frontier pattern is useful for finding things like word boundaries, including at the start and end of the target string.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Reply #4 on Sun 30 Oct 2005 09:17 PM (UTC)

Amended on Sun 30 Oct 2005 10:11 PM (UTC) by Nick Gammon

Message

Timing comparisons

After all of this I was wondering which method is faster for reasonably simple finds - Lua's internal find, or the PCRE find which is in the rex (regular expression) library?

I have set up a fairly simple regular expression, the sort you might want to match on fairly often, and some text to match against ...


match = "(.*) says, (.*)"
target = "Nick Gammon says, I wonder what's happening now?"
count = 2000000

Lua's string.find


start = os.time ()
for i = 1, count do
  test = string.find (target, match)
end -- for
period = os.time () - start

print ("Time taken (string.find) = ", period, " secs")

Output

Time taken (string.find) =  10  secs

PCRE library, compiling each time

Lua's find is interpretive, that is you don't precompile the string. However the PCRE (rex library) find is done in 2 stages - compile regular expression and execute it. Thus you would expect it to be faster if you compile once and execute many times.

However, to compare initially with combining them, in case precompiling isn't practical, let's try that first ...


start = os.time ()
for i = 1, count do
 result = rex.new (match):match (target)
end -- for
period = os.time () - start

print ("Time taken (rex.new + re:match) = ", period, " secs")

Output

Time taken (rex.new + re:match) =  23 secs

Interestingly, that takes over 2 times a long as the internal Lua find.

PCRE library, compiling once only

Finally, let's precompile the PCRE regexp, to see if that speeds things up ...


re = rex.new (match) -- precompile regexp

start = os.time ()
for i = 1, count do
  result = re:match (target)  -- test for match
end -- for
period = os.time () - start

print ("Time taken (re:match) = ", period, " secs")

Output

Time taken (re:match) =  11  secs

Still slightly slower than the internal method.

Summary

For the regular expression I tested, and the target string, in order of speed we had:

Lua internal find - 10 seconds for 2,000,000 matches
PCRE find, precompiled - 11 seconds for 2,000,000 matches
PCRE find, compiled each time - 23 seconds for 2,000,000 matches

So it would seem that if the internal Lua find-and-replace fits your needs it is definitely faster, especially if you can't precompile the regular expression. However to put those figures into context, 10 seconds for 2,000,000 matches is 200,000 matches per second, probably many more than you would do in practice.

However if you need the PCRE matching for its extra power, you definitely get a speed improvement by precompiling the regexp, and reusing it.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date Reply #5 on Sun 30 Oct 2005 10:26 PM (UTC)

Message

Timing breaking a string into a table

Now let's look at the example shown earlier up of using string.gfind to break a string with newlines into it, into table entries.

rex library using gmatch


count = 1000000

sw = EvaluateSpeedwalk ("6n 3e 4s 3(ne)")

start = os.time ()
for i = 1, count do
  
  lines = {}

  rex.new ("(.+)"):gmatch (sw, 

  -- build speedwalk lines into a table

  function (m) 
    table.insert (lines, m)
  end)

end -- for 
period = os.time () - start

print ("Time taken (re:gmatch) = ", period, " secs")

Output

Time taken (re:gmatch) =  69  secs

Lua string.gfind


count = 1000000

sw = EvaluateSpeedwalk ("6n 3e 4s 3(ne)")

start = os.time ()
for i = 1, count do
  
  lines = {}
  
  for w in string.gfind (sw, "[^\n]+") do
    table.insert (lines, w)
  end -- for

end -- for 
period = os.time () - start

print ("Time taken (string.gfind) = ", period, " secs")

Output

Time taken (string.gfind) =  34  secs

This test definitely shows that the Lua matching is twice as fast in situations like this.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by

Nick Gammon Australia (22,982 posts)

bio Forum Administrator

Date

Reply #6 on Tue 01 Nov 2005 09:36 PM (UTC)

Message

For information about the PCRE regular expression library see the forum post for Regular Expressions.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by

Nick Gammon Australia (22,982 posts)

bio Forum Administrator

Date

Reply #7 on Mon 14 Nov 2005 03:05 AM (UTC)

Message

Thanks to Ked's suggestion, there is now a utils.split function that splits strings. See:

http://www.gammon.com.au/forum/bbshowpost.php?bbsubject_id=6079

On the timing test I did further up, this took 10 seconds to do 1000000 iterations, compared to 34 seconds using the string.gfind method described earlier.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

Posted by Nick Gammon Australia (22,982 posts) bio Forum Administrator

Date

Reply #8 on Wed 07 Dec 2005 02:38 AM (UTC)

Amended on Wed 07 Dec 2005 04:27 AM (UTC) by Nick Gammon

Message

Nice use of the Frontier pattern match

Finally, found a use for the %f frontier pattern!

In the course of documenting the global find-and-replace in the notepad, I wanted to show how you might fix up all caps words into lower-case, but leave everything else alone. For example, change:



AAA aaa BBB aaaBBB BBBaaa CCCCC

to:



aaa aaa bbb aaaBBB BBBaaa ccccc

Let's assume we are calling a function that changes the matched text to lower-case (see below for that function).

The first attempt is not a big success:



Find: %u+

Result:



aaa aaa bbb aaabbb bbbaaa ccccc

Why? Because it matches all upper-case sequences, even those inside other words.

Let's add an extra test, that the word must be followed by a non-letter:



Find: %u+%A

Result:



aaa aaa bbb aaabbb BBBaaa CCCCC

This is better, but the word that started in lower-case and changed to upper-case still got converted.

So we throw in a second test, that the word must be preceded by a non-letter:



Find: %A%u+%A

Result:



AAA aaa bbb aaaBBB BBBaaa CCCCC

This is better again, only the all upper-case word in the middle got converted. But! The two words on the ends are not converted (AAA and CCCCC). This is because AAA at the start is not preceded by a non-letter. It is not preceded by anything. Similarly CCCCC at the end is not followed by a non-letter.

Enter the frontier pattern:



Find: %f[%a]%u+%f[%A]

Result:



aaa aaa bbb aaaBBB BBBaaa ccccc

Finally! Perfect results.

We detect the transition from non-letters to letters %f[%a], which is also the start of the string, and finish off with %f[%A] which is the transition between letters to non-letters (which also occurs at the end of the string).

If you want to experiment yourself, the other things in the global replace box that need to be filled in are:


Find: %f[%a]%u+%f[%A]
Replace: string.lower
Line by Line: no
Call Function: yes

- Nick Gammon

www.gammon.com.au, www.mushclient.com

top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

59,024 views.

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.