Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are
spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the
password reset link.
Due to spam on this forum, all posts now need moderator approval.
Entire forum
➜ Programming
➜ General
➜ lpeg code translate to lpeg re
lpeg code translate to lpeg re
|
It is now over 60 days since the last post. This thread is closed.
Refresh page
Pages: 1
2
3 4
5
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #30 on Wed 24 Jan 2018 04:27 AM (UTC) |
Message
|
Albert Chan said:
My re patterns with split does no lookahead, but has a bigger big O constant
Bypassing lpeg match-time-capture overhead might make it faster:
...
Yes, but then you may as well do a greedy string.match. :)
It's also faster:
Test string.match Albert
------------------------------------------------
foo and bar and whatever 0.528 1.288
foo and bar 0.338 0.830
XandY 0.279 0.863
foo 0.170 0.344
Xand 0.243 0.696
andY 0.246 0.696
and 0.226 0.665
0.120 0.313
cccccccccccc ... bbbbbbbbbbbb 98.661 112.866
That final test string was:
string.rep ("c", 5000) .. " and " .. string.rep ("a", 5000) .. " and " .. string.rep ("b", 5000)
|
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #31 on Wed 24 Jan 2018 05:12 AM (UTC) Amended on Wed 24 Jan 2018 06:48 AM (UTC) by Albert Chan
|
Message
| "(g <- 'and' / . g)+" is tail recursive. It just advance the position.
my guess is after last 'and', it will recurse to end-of-string, then FAIL.
So it will return position of last 'and' (or nil if there was no 'and' anywhere)
it does no backtrack, thus no need to worry backtrack overflow limit.
(as far as i know, i am still a newbie ...)
"g <- . g / 'and'" is much more elegant to describe greedy match.
On average it need only to backtrack half the string length.
The top pattern, however, ALWAYS need to examine the whole string.
Your benchmark suggested lua might do greedy matches by backtacking.
Sadly lpeg 1.0.1 have this backtrack overflow limit problem.
And worse, its performance is very bad for long string even if no overflow.
you are correct that string.match is much faster for simple matching.
All of my code is for learning lpeg re, since many claimed that lpeg
can do everything lua pattern can do. | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #32 on Wed 24 Jan 2018 05:28 AM (UTC) Amended on Wed 24 Jan 2018 05:34 AM (UTC) by Nick Gammon
|
Message
|
Quote:
it does no backtrack, thus no need to worry backtrack overflow limit.
It must backtrack. In order to know it has the last "and" it therefore has to overshoot it and look for another one.
Therefore the loop finally fails when you get to ". g" where "g" does not end in "and" and therefore it has to backtrack to the last pattern (g) which had an "and" in it.
In other words, it matches either "and" or ". g". If g doesn't have an "and" in it then it would therefore need to be ". ." which would not be true at the end of the subject.
Therefore at end of subject, it has to backtrack to the last g which had an "and" thus that match is:
|
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #33 on Wed 24 Jan 2018 05:49 AM (UTC) |
Message
| To put it another way, say the string is:
"foo and " .. string.rep ("a", 1000)
The only way the match can know that there is not a final "and" is to check the entire string. Thus it has to backtrack to the last one it found. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #34 on Wed 24 Jan 2018 06:18 AM (UTC) Amended on Wed 24 Jan 2018 06:24 AM (UTC) by Albert Chan
|
Message
| you may be right about the 1 backtracking back to last valid g.
my original re pattern "{ g <- .g / &'and' } 'and' {.*}" fail with string
length of only 200. From what I've read, Roberto refer this as
non-blind greedy repetition. My guess is since every backtrack is
non-blind, it must be saved, thus filling the backtrack stack.
my newer re pattern "(g <- 'and' / .g)+" might be so-called blind ?
it will only need 1 backtrack after a FAIL, since previous g is valid
He named this restricted backtracking.
say, the string has 2 'and', at pos (after 'and') 10, 20
backtrack stack = [10] -> [20] (replace 10 instead of growing stack) | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #35 on Wed 24 Jan 2018 06:39 AM (UTC) |
Message
| I think the important thing is that your later pattern does right-recursion which can be turned into tail recursion, thus avoiding overflowing the stack. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #36 on Wed 24 Jan 2018 02:10 PM (UTC) Amended on Wed 24 Jan 2018 10:24 PM (UTC) by Albert Chan
|
Message
| Here is another re pattern that does NOT use recursive grammar.
It has similar performance as my split version (no lookahead)
Without recursive grammar, backtrack overhead may be smaller.
I optimize it furthur by not directly matching 'and'
pat = re.compile(
"((!'and' .)* ...)+ -> drop3 {.*}",
{ drop3 = function(s) return string.sub(s, 1, -4) end }
)
| Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #37 on Wed 24 Jan 2018 11:25 PM (UTC) |
Message
| Instead of "..." you could use ".^3". |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #38 on Wed 24 Jan 2018 11:41 PM (UTC) Amended on Thu 25 Jan 2018 02:33 AM (UTC) by Albert Chan
|
Message
| just think of a faster lpeg re string (if not start with 'a', cannot be 'and')
"((!'and' . [^a]*)* .^3 )+ -> drop3 {.*}"
or, in recursive grammar (fastest for my luajit setup)
"(g <- 'and' / . [^a]* g)+ -> drop3 {.*}"
| Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #39 on Thu 25 Jan 2018 02:05 AM (UTC) Amended on Thu 25 Jan 2018 03:24 AM (UTC) by Albert Chan
|
Message
| some questions answered with a debug version of lpeg.dll:
RE: "(g <- 'and' / .g)+" effect on backtrack stack
-- it is tail call with jmp, will not overflow backtrack stack
RE: "(g <- .g / &'and')"
-- as expected, it recursive, filling backtrack stacks
-- the pcode is very short 14 lines vs above 26 lines
-- there is a instruction "behind 3" to not consume the 'and'',
--> will be nice if i can use behind 3 instead of drop3 function.
RE: efficiency of ... vs .^3
-- both generate: lpeg.P(1) * 1 * 1 == lpeg.P(3) | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #40 on Thu 25 Jan 2018 02:12 PM (UTC) Amended on Thu 25 Jan 2018 04:53 PM (UTC) by Albert Chan
|
Message
| this match_nd version "remove" all a's, and match only 'nd':
"( [^a]*.'a'* (g <- 'nd' / . [^a]*.'a'* g)+ ) -> drop3 {.*}"
| Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #41 on Fri 26 Jan 2018 04:36 PM (UTC) Amended on Fri 26 Jan 2018 05:27 PM (UTC) by Albert Chan
|
Message
| without recursive grammar (and no helper function!), this is VERY efficient:
i introduce &. to undo the last and, "lookahead" without wasting time
"{ [^a]* (!'and'.)* (.^3 [^a]* (!'and'.)* &.)* } .^3 {.*}"
doing the same benchmark in reply 24, this take 5.0 us per iteration.
my old lookahead (which validate only 1 char advance) take 6.7 us | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #42 on Fri 26 Jan 2018 09:40 PM (UTC) |
Message
| Yes, but you will wonder in a year's time when you review your code why you did it that way. :)
I tend to find with LPEG and RE that if I come back to my examples after a few months (as I did in this thread) that it takes the brain a little while to remember what all those symbols mean, and how some of the more obscure constructions work.
Having said that, I'm working on improving my post about LPEG to explain some of the ideas you introduced, like the grammar to match "this and that and whatever". |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #43 on Sat 27 Jan 2018 12:27 AM (UTC) |
Message
| One of the things that troubled me with "drop3" was you would have to adjust that every time you changed the word "and" to something else. With this pattern you can avoid that:
pat = re.compile(
"{ (g <- { 'and' } / . g)+ } -> drop {.*}",
{ drop = function(s, cap) return s:sub (1, -#cap -1) end } )
By putting a capture around our delimiter ('and') we can then determine its length inside the function and trim that many characters from the capture. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #44 on Sat 27 Jan 2018 12:33 AM (UTC) |
Message
| So now you can match multiple delimiters with a single "drop" function, like this:
require "re"
target = "this fubar that grand canyon whatever"
pat = re.compile( [[
{ (g <- { 'fubar' } / . g)+ } -> drop
{ (g <- { 'grand canyon' } / . g)+ } -> drop
{.*}
]],
{ drop = function(s, cap)
return s:sub (1, -#cap -1)
end } )
print (lpeg.match (pat, target))
Output:
|
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).
To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.
147,262 views.
This is page 3, subject is 5 pages long:
1
2
3 4
5
It is now over 60 days since the last post. This thread is closed.
Refresh page
top