Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are
spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the
password reset link.
Due to spam on this forum, all posts now need moderator approval.
Entire forum
➜ Programming
➜ General
➜ lpeg code translate to lpeg re
lpeg code translate to lpeg re
|
It is now over 60 days since the last post. This thread is closed.
Refresh page
Pages: 1
2
3
4 5
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #45 on Sat 27 Jan 2018 12:37 AM (UTC) Amended on Sat 27 Jan 2018 12:40 AM (UTC) by Nick Gammon
|
Message
| Interestingly, in the above example, the delimiters are not returned as captures (which may well be a good thing). However you can get them like this:
require "re"
target = "this fubar that grand canyon whatever"
pat = re.compile( [[
{| -- table capture
{ (g <- { 'fubar' } / . g)+ } -> drop
{ (g <- { 'grand canyon' } / . g)+ } -> drop
{.*} -- rest of line
|} -- end table capture
]],
{ drop = function(s, cap)
return s:sub (1, -#cap -1), cap -- trim first capture, return second capture
end } )
t = lpeg.match (pat, target)
for k, v in ipairs (t) do
print (k, v)
end -- for
This prints:
1 this
2 fubar
3 that
4 grand canyon
5 whatever
|
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #46 on Sat 27 Jan 2018 01:45 AM (UTC) Amended on Sat 27 Jan 2018 03:11 AM (UTC) by Albert Chan
|
Message
| To make above even more general, you may want to use
last capture, instead of second capture to drop characters.
function drop(s, ...)
local last = select(select('#', ...), ...)
return s:sub(1, #s - #last), last
end
pat = re.compile( "{(g<- {%A+}/ .g)+} -> drop", {drop=drop})
= pat:match "this-is--a----test"
this-is--a
---- | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #47 on Sat 27 Jan 2018 01:51 AM (UTC) |
Message
| Would it necessarily be the last capture? |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #48 on Sat 27 Jan 2018 02:13 AM (UTC) Amended on Sat 27 Jan 2018 02:14 AM (UTC) by Albert Chan
|
Message
| i read somewhere that the order of capture is the same as the number of open paranthesis count.
(or braces in lpeg re case)
if that is true, last capture is the last argument in drop | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #49 on Sat 27 Jan 2018 05:31 AM (UTC) Amended on Thu 08 Feb 2018 07:55 PM (UTC) by Albert Chan
|
Message
| I patched lpeg, so now lpeg.B(-n) = n unconditional backtracks
For convenience, re pattern %b = lpeg.B(-1)
I can now do true greedy match and still tail recursive.
pat = re.compile "{.* %b^3 (g <- &'and' / %b g)}"
All my previous code were using multiple non-greedy match to
simulate a greedy match. This re pattern does true greedy match.
For long strings, it beat string.match in performance !
(correction: only if 'and' is close to end-of-string)
:-) | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #50 on Sat 27 Jan 2018 09:47 PM (UTC) Amended on Sun 28 Jan 2018 01:26 AM (UTC) by Albert Chan
|
Message
| The drop function idea is very nice !
Lua string.match have a tough time doing greedy match ON the separator %A+
t = "this-is--a----test"
= string.match(t, "^(.*)(%A+)%a*$") -- (.*) too greedy
this-is--a---
-
Lua string library can do it, but in multiple steps, slower and ugly.
lpeg has the nice property of adjusting how greedy we wanted.
pat = re.compile("{(g <- {%A+}/ .g)+} -> drop", {drop=drop})
| Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #51 on Sun 28 Jan 2018 04:06 AM (UTC) |
Message
| You can achieve that with a frontier pattern however:
a, b = string.match(t, "^(.*)(%f[%A]%A+%f[^%A])%a*$")
print (a)
print (b)
Results:
The frontier pattern asserts a change from "not in set" to "in set". Thus we detect the first hyphen. The inverse frontier pattern asserts the end of the hyphens. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #52 on Sun 28 Jan 2018 12:54 PM (UTC) Amended on Mon 29 Jan 2018 12:31 AM (UTC) by Albert Chan
|
Message
| wow, lua has %f pattern ?
It almost reached lpeg drop version performance !
With a tiny change, lua pattern even beat lpeg (again !)
benchmark: "this-is--a----test" -> "this-is--a", "----"
time(us) match function pattern
5.66 string.match "(.*)%f[%A](%A+)"
7.20 string.match "^(.*)(%f[%A]%A+%f[%a])%a*$"
6.57 lpeg.match "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}
Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #53 on Sun 28 Jan 2018 08:57 PM (UTC) Amended on Sun 28 Jan 2018 09:06 PM (UTC) by Nick Gammon
|
Message
| I've documented it here: http://www.gammon.com.au/scripts/doc.php?lua=string.find
And here: https://www.gammon.com.au/forum/?id=6034&reply=3#reply3
And here: https://www.gammon.com.au/forum/?id=6034&reply=8#reply8
Quote:
Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil
I think this is behaving as intended. Looking at the source, a pattern match is doing this:
previous = (s == ms->src_init) ? '\0' : *(s-1);
if (matchbracketclass(uchar(previous), p, ep-1) ||
!matchbracketclass(uchar(*s), p, ep-1)) return NULL;
In other words, it is testing for previously not "in set" to "in set" now.
Also, the first line says that if we are at the start of the subject then we replace the previous character (which doesn't exist) with 0x00.
So since you specified %f[%A] then "not in set" would be an alphabetic character. And 0x00 is not alphabetic, thus it fails that test.
If you turn the logic around then it works:
print (string.match("whatever----", "(.*)%f[%a](%a+)")) --> "whatever"
Now we are looking for a transition to a letter from "not a letter" and 0x00 is not a letter. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #54 on Sun 28 Jan 2018 09:04 PM (UTC) |
Message
| You can make your pattern work by allowing for the 0x00:
print (string.match("----whatever", "(.*)%f[^%a%z](%A+)"))
Rather than using [%A] I used [^%a%z] which will allow for the 0x00 character. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #55 on Sun 28 Jan 2018 09:21 PM (UTC) Amended on Mon 29 Jan 2018 10:23 AM (UTC) by Albert Chan
|
Message
| %f[^%a%z] trick work !
Parsing above class range takes time, however, killing performance.
maybe pre-built class ranges ?
= lpeg.pcode( re.compile "[^%a\x00]" ) -- with my patched lpeg
[]
00: set [\x01-\x40\x5b-\x60\x7b-\xff] --> just cut and paste
09: end
benchmark: "this-is--a----test" -> "this-is--a", "----"
time(us) match function pattern
4.65 string.match "(.*)%f[\x01-\x40\x5b-\x60\x7b-\xff](%A+)"
6.67 string.match "(.*)%f[^%a%z](%A+)"
8.26 string.match "^(.*)(%f[^%a%z]%A+%f[%a])%a*$"
6.57 lpeg.match "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}
| Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #56 on Thu 08 Feb 2018 06:58 PM (UTC) Amended on Thu 08 Feb 2018 11:26 PM (UTC) by Albert Chan
|
Message
| I discovered a lpeg re pattern that beat string.match, even with pre-built frontier pattern class
The trick is to use frontier pattern idea in lpeg (backtrack before %A+)
benchmark: "this-is--a----test" -> "this-is--a", "----"
3.63 pat = re.compile "{ %a* (%A+ %a* &%A)* } {%A+}"
the same idea can be use for previous "(.*)and(.*), without drop function
pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})
note: ">&'and' " is shorthand for "(g <- &'and' / .[^a]* g)" | Top |
|
Posted by
| Nick Gammon
Australia (23,133 posts) Bio
Forum Administrator |
Date
| Reply #57 on Fri 09 Feb 2018 02:53 AM (UTC) |
Message
|
pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})
That doesn't compile for me. The symbol ">" is not defined in the re documentation. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #58 on Fri 09 Feb 2018 03:03 AM (UTC) Amended on Fri 09 Feb 2018 06:38 PM (UTC) by Albert Chan
|
Message
| '>' is just a shorthand for the grammar (g <- patt / . [^head-chars-of-patt]* g)
it need a search and replace (see notes in reply 56) | Top |
|
Posted by
| Albert Chan
(55 posts) Bio
|
Date
| Reply #59 on Fri 09 Feb 2018 06:37 PM (UTC) Amended on Fri 09 Feb 2018 06:40 PM (UTC) by Albert Chan
|
Message
| above code (reply 57) will compile with my patched lpeg, which does the expansion automatically.
i just posted the source in github (comment welcome)
https://github.com/achan001/LPeg-anywhere
| Top |
|
The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).
To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.
147,259 views.
This is page 4, subject is 5 pages long:
1
2
3
4 5
It is now over 60 days since the last post. This thread is closed.
Refresh page
top