Gammon Forum : Programming : General : lpeg code translate to lpeg re

Entire forum ➜ Programming ➜ General ➜ lpeg code translate to lpeg re

lpeg code translate to lpeg re

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3 4 5

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #30 on Wed 24 Jan 2018 04:27 AM (UTC)

Message

Albert Chan said:

My re patterns with split does no lookahead, but has a bigger big O constant
Bypassing lpeg match-time-capture overhead might make it faster:

...

Yes, but then you may as well do a greedy string.match. :)

It's also faster:


Test                        string.match   Albert
------------------------------------------------
foo and bar and whatever          0.528    1.288
foo and bar                       0.338    0.830
XandY                             0.279    0.863
foo                               0.170    0.344
Xand                              0.243    0.696
andY                              0.246    0.696
and                               0.226    0.665
                                  0.120    0.313
cccccccccccc ... bbbbbbbbbbbb    98.661  112.866

That final test string was:


 string.rep ("c", 5000) .. " and " .. string.rep ("a", 5000) .. " and " .. string.rep ("b", 5000)

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #31 on Wed 24 Jan 2018 05:12 AM (UTC)

Amended on Wed 24 Jan 2018 06:48 AM (UTC) by Albert Chan

Message

"(g <- 'and' / . g)+" is tail recursive. It just advance the position.

my guess is after last 'and', it will recurse to end-of-string, then FAIL.
So it will return position of last 'and' (or nil if there was no 'and' anywhere)

it does no backtrack, thus no need to worry backtrack overflow limit.
(as far as i know, i am still a newbie ...)

"g <- . g / 'and'" is much more elegant to describe greedy match.
On average it need only to backtrack half the string length.
The top pattern, however, ALWAYS need to examine the whole string.
Your benchmark suggested lua might do greedy matches by backtacking.

Sadly lpeg 1.0.1 have this backtrack overflow limit problem.
And worse, its performance is very bad for long string even if no overflow.

you are correct that string.match is much faster for simple matching.
All of my code is for learning lpeg re, since many claimed that lpeg
can do everything lua pattern can do.

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #32 on Wed 24 Jan 2018 05:28 AM (UTC)

Amended on Wed 24 Jan 2018 05:34 AM (UTC) by Nick Gammon

Message

Quote:

it does no backtrack, thus no need to worry backtrack overflow limit.

It must backtrack. In order to know it has the last "and" it therefore has to overshoot it and look for another one.

Therefore the loop finally fails when you get to ". g" where "g" does not end in "and" and therefore it has to backtrack to the last pattern (g) which had an "and" in it.

In other words, it matches either "and" or ". g". If g doesn't have an "and" in it then it would therefore need to be ". ." which would not be true at the end of the subject.

Therefore at end of subject, it has to backtrack to the last g which had an "and" thus that match is:


<some stuff>and

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #33 on Wed 24 Jan 2018 05:49 AM (UTC)

Message

To put it another way, say the string is:


"foo and " .. string.rep ("a", 1000)

The only way the match can know that there is not a final "and" is to check the entire string. Thus it has to backtrack to the last one it found.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #34 on Wed 24 Jan 2018 06:18 AM (UTC)

Amended on Wed 24 Jan 2018 06:24 AM (UTC) by Albert Chan

Message

you may be right about the 1 backtracking back to last valid g.

my original re pattern "{ g <- .g / &'and' } 'and' {.*}" fail with string
length of only 200. From what I've read, Roberto refer this as
non-blind greedy repetition. My guess is since every backtrack is
non-blind, it must be saved, thus filling the backtrack stack.

my newer re pattern "(g <- 'and' / .g)+" might be so-called blind ?
it will only need 1 backtrack after a FAIL, since previous g is valid
He named this restricted backtracking.

say, the string has 2 'and', at pos (after 'and') 10, 20

backtrack stack = [10] -> [20] (replace 10 instead of growing stack)

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #35 on Wed 24 Jan 2018 06:39 AM (UTC)

Message

I think the important thing is that your later pattern does right-recursion which can be turned into tail recursion, thus avoiding overflowing the stack.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #36 on Wed 24 Jan 2018 02:10 PM (UTC)

Amended on Wed 24 Jan 2018 10:24 PM (UTC) by Albert Chan

Message

Here is another re pattern that does NOT use recursive grammar.
It has similar performance as my split version (no lookahead)

Without recursive grammar, backtrack overhead may be smaller.
I optimize it furthur by not directly matching 'and'


pat = re.compile( 
      "((!'and' .)* ...)+ -> drop3 {.*}",
      { drop3 = function(s) return string.sub(s, 1, -4) end }
)

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #37 on Wed 24 Jan 2018 11:25 PM (UTC)

Message

Instead of "..." you could use ".^3".

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #38 on Wed 24 Jan 2018 11:41 PM (UTC)

Amended on Thu 25 Jan 2018 02:33 AM (UTC) by Albert Chan

Message

just think of a faster lpeg re string (if not start with 'a', cannot be 'and')

"((!'and' . [^a]*)* .^3 )+ -> drop3 {.*}"

or, in recursive grammar (fastest for my luajit setup)

"(g <- 'and' / . [^a]* g)+ -> drop3 {.*}"

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #39 on Thu 25 Jan 2018 02:05 AM (UTC)

Amended on Thu 25 Jan 2018 03:24 AM (UTC) by Albert Chan

Message

some questions answered with a debug version of lpeg.dll:

RE: "(g <- 'and' / .g)+" effect on backtrack stack
-- it is tail call with jmp, will not overflow backtrack stack

RE: "(g <- .g / &'and')"
-- as expected, it recursive, filling backtrack stacks
-- the pcode is very short 14 lines vs above 26 lines
-- there is a instruction "behind 3" to not consume the 'and'',
--> will be nice if i can use behind 3 instead of drop3 function.

RE: efficiency of ... vs .^3
-- both generate: lpeg.P(1) * 1 * 1 == lpeg.P(3)

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #40 on Thu 25 Jan 2018 02:12 PM (UTC)

Amended on Thu 25 Jan 2018 04:53 PM (UTC) by Albert Chan

Message

this match_nd version "remove" all a's, and match only 'nd':

"( [^a]*.'a'* (g <- 'nd' / .  [^a]*.'a'* g)+ ) -> drop3 {.*}"

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #41 on Fri 26 Jan 2018 04:36 PM (UTC)

Amended on Fri 26 Jan 2018 05:27 PM (UTC) by Albert Chan

Message

without recursive grammar (and no helper function!), this is VERY efficient:
i introduce &. to undo the last and, "lookahead" without wasting time


"{ [^a]* (!'and'.)* (.^3 [^a]* (!'and'.)* &.)* } .^3 {.*}"

doing the same benchmark in reply 24, this take 5.0 us per iteration.
my old lookahead (which validate only 1 char advance) take 6.7 us

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #42 on Fri 26 Jan 2018 09:40 PM (UTC)

Message

Yes, but you will wonder in a year's time when you review your code why you did it that way. :)

I tend to find with LPEG and RE that if I come back to my examples after a few months (as I did in this thread) that it takes the brain a little while to remember what all those symbols mean, and how some of the more obscure constructions work.

Having said that, I'm working on improving my post about LPEG to explain some of the ideas you introduced, like the grammar to match "this and that and whatever".

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #43 on Sat 27 Jan 2018 12:27 AM (UTC)

Message

One of the things that troubled me with "drop3" was you would have to adjust that every time you changed the word "and" to something else. With this pattern you can avoid that:


pat = re.compile( 
      "{ (g <-  { 'and' }  / . g)+ } -> drop {.*}",
       { drop = function(s, cap) return s:sub (1, -#cap -1) end } )

By putting a capture around our delimiter ('and') we can then determine its length inside the function and trim that many characters from the capture.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #44 on Sat 27 Jan 2018 12:33 AM (UTC)

Message

So now you can match multiple delimiters with a single "drop" function, like this:


require "re"

target = "this fubar that grand canyon whatever"

pat = re.compile( [[
       { (g <-  { 'fubar' }         / . g)+ } -> drop 
       { (g <-  { 'grand canyon' }  / . g)+ } -> drop 
       {.*}
       ]],
       { drop = function(s, cap) 
                return s:sub (1, -#cap -1) 
                end } )

print (lpeg.match (pat, target))

Output:


this   that   whatever

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

216,153 views.

This is page 3, subject is 5 pages long: 1 2 3 4 5

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top