Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ Programming ➜ General ➜ lpeg code translate to lpeg re

lpeg code translate to lpeg re

It is now over 60 days since the last post. This thread is closed.     Refresh page


Pages: 1  2  3  4 5  

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #45 on Sat 27 Jan 2018 12:37 AM (UTC)

Amended on Sat 27 Jan 2018 12:40 AM (UTC) by Nick Gammon

Message
Interestingly, in the above example, the delimiters are not returned as captures (which may well be a good thing). However you can get them like this:


require "re"

target = "this fubar that grand canyon whatever"

pat = re.compile( [[
       {|  -- table capture
       { (g <-  { 'fubar' }         / . g)+ } -> drop 
       { (g <-  { 'grand canyon' }  / . g)+ } -> drop 
       {.*}  -- rest of line
       |}  -- end table capture
       ]],
       { drop = function(s, cap) 
                return s:sub (1, -#cap -1), cap  -- trim first capture, return second capture
                end } )

t = lpeg.match (pat, target)
for k, v in ipairs (t) do
  print (k, v)
end -- for


This prints:


1 this 
2 fubar
3  that 
4 grand canyon
5  whatever

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #46 on Sat 27 Jan 2018 01:45 AM (UTC)

Amended on Sat 27 Jan 2018 03:11 AM (UTC) by Albert Chan

Message
To make above even more general, you may want to use
last capture, instead of second capture to drop characters.

function drop(s, ...)
  local last = select(select('#', ...), ...)
  return s:sub(1, #s - #last), last
end

pat = re.compile( "{(g<- {%A+}/ .g)+} -> drop", {drop=drop})

= pat:match "this-is--a----test"
this-is--a
----
Top

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #47 on Sat 27 Jan 2018 01:51 AM (UTC)
Message
Would it necessarily be the last capture?

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #48 on Sat 27 Jan 2018 02:13 AM (UTC)

Amended on Sat 27 Jan 2018 02:14 AM (UTC) by Albert Chan

Message
i read somewhere that the order of capture is the same as the number of open paranthesis count.
(or braces in lpeg re case)

if that is true, last capture is the last argument in drop
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #49 on Sat 27 Jan 2018 05:31 AM (UTC)

Amended on Thu 08 Feb 2018 07:55 PM (UTC) by Albert Chan

Message
I patched lpeg, so now lpeg.B(-n) = n unconditional backtracks

For convenience, re pattern %b = lpeg.B(-1)
I can now do true greedy match and still tail recursive.

pat = re.compile "{.* %b^3 (g <- &'and' / %b g)}"

All my previous code were using multiple non-greedy match to
simulate a greedy match. This re pattern does true greedy match.

For long strings, it beat string.match in performance !
(correction: only if 'and' is close to end-of-string)

:-)
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #50 on Sat 27 Jan 2018 09:47 PM (UTC)

Amended on Sun 28 Jan 2018 01:26 AM (UTC) by Albert Chan

Message
The drop function idea is very nice !
Lua string.match have a tough time doing greedy match ON the separator %A+

t = "this-is--a----test"
= string.match(t, "^(.*)(%A+)%a*$")      -- (.*) too greedy
this-is--a---
-

Lua string library can do it, but in multiple steps, slower and ugly.

lpeg has the nice property of adjusting how greedy we wanted.

pat = re.compile("{(g <- {%A+}/ .g)+} -> drop", {drop=drop})
Top

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #51 on Sun 28 Jan 2018 04:06 AM (UTC)
Message
You can achieve that with a frontier pattern however:


a, b = string.match(t, "^(.*)(%f[%A]%A+%f[^%A])%a*$")
print (a)
print (b)


Results:


this-is--a
----


The frontier pattern asserts a change from "not in set" to "in set". Thus we detect the first hyphen. The inverse frontier pattern asserts the end of the hyphens.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #52 on Sun 28 Jan 2018 12:54 PM (UTC)

Amended on Mon 29 Jan 2018 12:31 AM (UTC) by Albert Chan

Message
wow, lua has %f pattern ?
It almost reached lpeg drop version performance !
With a tiny change, lua pattern even beat lpeg (again !)

benchmark: "this-is--a----test" -> "this-is--a", "----"

time(us)  match function  pattern
5.66      string.match    "(.*)%f[%A](%A+)"
7.20      string.match    "^(.*)(%f[%A]%A+%f[%a])%a*$"

6.57      lpeg.match      "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}

Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil
Top

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #53 on Sun 28 Jan 2018 08:57 PM (UTC)

Amended on Sun 28 Jan 2018 09:06 PM (UTC) by Nick Gammon

Message
I've documented it here: http://www.gammon.com.au/scripts/doc.php?lua=string.find

And here: https://www.gammon.com.au/forum/?id=6034&reply=3#reply3

And here: https://www.gammon.com.au/forum/?id=6034&reply=8#reply8

Quote:

Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil


I think this is behaving as intended. Looking at the source, a pattern match is doing this:


      previous = (s == ms->src_init) ? '\0' : *(s-1);
      if (matchbracketclass(uchar(previous), p, ep-1) ||
         !matchbracketclass(uchar(*s), p, ep-1)) return NULL;


In other words, it is testing for previously not "in set" to "in set" now.

Also, the first line says that if we are at the start of the subject then we replace the previous character (which doesn't exist) with 0x00.

So since you specified %f[%A] then "not in set" would be an alphabetic character. And 0x00 is not alphabetic, thus it fails that test.

If you turn the logic around then it works:


print (string.match("whatever----", "(.*)%f[%a](%a+)"))  --> "whatever"


Now we are looking for a transition to a letter from "not a letter" and 0x00 is not a letter.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #54 on Sun 28 Jan 2018 09:04 PM (UTC)
Message
You can make your pattern work by allowing for the 0x00:


print (string.match("----whatever", "(.*)%f[^%a%z](%A+)"))


Rather than using [%A] I used [^%a%z] which will allow for the 0x00 character.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #55 on Sun 28 Jan 2018 09:21 PM (UTC)

Amended on Mon 29 Jan 2018 10:23 AM (UTC) by Albert Chan

Message
%f[^%a%z] trick work !
Parsing above class range takes time, however, killing performance.
maybe pre-built class ranges ?

= lpeg.pcode( re.compile "[^%a\x00]" ) -- with my patched lpeg
[]
00: set [\x01-\x40\x5b-\x60\x7b-\xff] --> just cut and paste
09: end

benchmark: "this-is--a----test" -> "this-is--a", "----"

time(us)  match function  pattern
4.65      string.match    "(.*)%f[\x01-\x40\x5b-\x60\x7b-\xff](%A+)"
6.67      string.match    "(.*)%f[^%a%z](%A+)"
8.26      string.match    "^(.*)(%f[^%a%z]%A+%f[%a])%a*$"

6.57      lpeg.match      "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #56 on Thu 08 Feb 2018 06:58 PM (UTC)

Amended on Thu 08 Feb 2018 11:26 PM (UTC) by Albert Chan

Message
I discovered a lpeg re pattern that beat string.match, even with pre-built frontier pattern class

The trick is to use frontier pattern idea in lpeg (backtrack before %A+)
benchmark: "this-is--a----test" -> "this-is--a", "----"

3.63       pat = re.compile "{ %a* (%A+ %a* &%A)* } {%A+}"

the same idea can be use for previous "(.*)and(.*), without drop function

pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})

note: ">&'and' " is shorthand for "(g <- &'and' / .[^a]* g)"
Top

Posted by Nick Gammon   Australia  (23,133 posts)  Bio   Forum Administrator
Date Reply #57 on Fri 09 Feb 2018 02:53 AM (UTC)
Message

pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})


That doesn't compile for me. The symbol ">" is not defined in the re documentation.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #58 on Fri 09 Feb 2018 03:03 AM (UTC)

Amended on Fri 09 Feb 2018 06:38 PM (UTC) by Albert Chan

Message
'>' is just a shorthand for the grammar (g <- patt / . [^head-chars-of-patt]* g)

it need a search and replace (see notes in reply 56)
Top

Posted by Albert Chan   (55 posts)  Bio
Date Reply #59 on Fri 09 Feb 2018 06:37 PM (UTC)

Amended on Fri 09 Feb 2018 06:40 PM (UTC) by Albert Chan

Message
above code (reply 57) will compile with my patched lpeg, which does the expansion automatically.

i just posted the source in github (comment welcome)
https://github.com/achan001/LPeg-anywhere
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


147,259 views.

This is page 4, subject is 5 pages long:  [Previous page]  1  2  3  4 5  [Next page]

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.