Gammon Forum : Programming : General : lpeg code translate to lpeg re

Entire forum ➜ Programming ➜ General ➜ lpeg code translate to lpeg re

lpeg code translate to lpeg re

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3 4 5

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #45 on Sat 27 Jan 2018 12:37 AM (UTC)

Amended on Sat 27 Jan 2018 12:40 AM (UTC) by Nick Gammon

Message

Interestingly, in the above example, the delimiters are not returned as captures (which may well be a good thing). However you can get them like this:


require "re"

target = "this fubar that grand canyon whatever"

pat = re.compile( [[
       {|  -- table capture
       { (g <-  { 'fubar' }         / . g)+ } -> drop 
       { (g <-  { 'grand canyon' }  / . g)+ } -> drop 
       {.*}  -- rest of line
       |}  -- end table capture
       ]],
       { drop = function(s, cap) 
                return s:sub (1, -#cap -1), cap  -- trim first capture, return second capture
                end } )

t = lpeg.match (pat, target)
for k, v in ipairs (t) do
  print (k, v)
end -- for

This prints:


1 this 
2 fubar
3  that 
4 grand canyon
5  whatever

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #46 on Sat 27 Jan 2018 01:45 AM (UTC)

Amended on Sat 27 Jan 2018 03:11 AM (UTC) by Albert Chan

Message

To make above even more general, you may want to use
last capture, instead of second capture to drop characters.


function drop(s, ...)
  local last = select(select('#', ...), ...)
  return s:sub(1, #s - #last), last
end

pat = re.compile( "{(g<- {%A+}/ .g)+} -> drop", {drop=drop})

= pat:match "this-is--a----test"
this-is--a
----

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #47 on Sat 27 Jan 2018 01:51 AM (UTC)

Message

Would it necessarily be the last capture?

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #48 on Sat 27 Jan 2018 02:13 AM (UTC)

Amended on Sat 27 Jan 2018 02:14 AM (UTC) by Albert Chan

Message

i read somewhere that the order of capture is the same as the number of open paranthesis count.
(or braces in lpeg re case)

if that is true, last capture is the last argument in drop

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #49 on Sat 27 Jan 2018 05:31 AM (UTC)

Amended on Thu 08 Feb 2018 07:55 PM (UTC) by Albert Chan

Message

I patched lpeg, so now lpeg.B(-n) = n unconditional backtracks

For convenience, re pattern %b = lpeg.B(-1)
I can now do true greedy match and still tail recursive.


pat = re.compile "{.* %b^3 (g <- &'and' / %b g)}"

All my previous code were using multiple non-greedy match to
simulate a greedy match. This re pattern does true greedy match.

For long strings, it beat string.match in performance !
(correction: only if 'and' is close to end-of-string)

:-)

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #50 on Sat 27 Jan 2018 09:47 PM (UTC)

Amended on Sun 28 Jan 2018 01:26 AM (UTC) by Albert Chan

Message

The drop function idea is very nice !
Lua string.match have a tough time doing greedy match ON the separator %A+


t = "this-is--a----test"
= string.match(t, "^(.*)(%A+)%a*$")      -- (.*) too greedy
this-is--a---
-

Lua string library can do it, but in multiple steps, slower and ugly.

lpeg has the nice property of adjusting how greedy we wanted.


pat = re.compile("{(g <- {%A+}/ .g)+} -> drop", {drop=drop})

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #51 on Sun 28 Jan 2018 04:06 AM (UTC)

Message

You can achieve that with a frontier pattern however:


a, b = string.match(t, "^(.*)(%f[%A]%A+%f[^%A])%a*$")
print (a)
print (b)

Results:


this-is--a
----

The frontier pattern asserts a change from "not in set" to "in set". Thus we detect the first hyphen. The inverse frontier pattern asserts the end of the hyphens.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #52 on Sun 28 Jan 2018 12:54 PM (UTC)

Amended on Mon 29 Jan 2018 12:31 AM (UTC) by Albert Chan

Message

wow, lua has %f pattern ?
It almost reached lpeg drop version performance !
With a tiny change, lua pattern even beat lpeg (again !)


benchmark: "this-is--a----test" -> "this-is--a", "----"

time(us)  match function  pattern
5.66      string.match    "(.*)%f[%A](%A+)"
7.20      string.match    "^(.*)(%f[%A]%A+%f[%a])%a*$"

6.57      lpeg.match      "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}

Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #53 on Sun 28 Jan 2018 08:57 PM (UTC)

Amended on Sun 28 Jan 2018 09:06 PM (UTC) by Nick Gammon

Message

I've documented it here: http://www.gammon.com.au/scripts/doc.php?lua=string.find

And here: https://www.gammon.com.au/forum/?id=6034&reply=3#reply3

And here: https://www.gammon.com.au/forum/?id=6034&reply=8#reply8

Quote:

Sadly, lua %f is undocumented and buggy (on luajit 1.1.8 anyway)
= string.match("----whatever", "(.*)%f[%A](%A+)")
nil

I think this is behaving as intended. Looking at the source, a pattern match is doing this:


      previous = (s == ms->src_init) ? '\0' : *(s-1);
      if (matchbracketclass(uchar(previous), p, ep-1) ||
         !matchbracketclass(uchar(*s), p, ep-1)) return NULL;

In other words, it is testing for previously not "in set" to "in set" now.

Also, the first line says that if we are at the start of the subject then we replace the previous character (which doesn't exist) with 0x00.

So since you specified %f[%A] then "not in set" would be an alphabetic character. And 0x00 is not alphabetic, thus it fails that test.

If you turn the logic around then it works:


print (string.match("whatever----", "(.*)%f[%a](%a+)"))  --> "whatever"

Now we are looking for a transition to a letter from "not a letter" and 0x00 is not a letter.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #54 on Sun 28 Jan 2018 09:04 PM (UTC)

Message

You can make your pattern work by allowing for the 0x00:


print (string.match("----whatever", "(.*)%f[^%a%z](%A+)"))

Rather than using [%A] I used [^%a%z] which will allow for the 0x00 character.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #55 on Sun 28 Jan 2018 09:21 PM (UTC)

Amended on Mon 29 Jan 2018 10:23 AM (UTC) by Albert Chan

Message

%f[^%a%z] trick work !
Parsing above class range takes time, however, killing performance.
maybe pre-built class ranges ?

= lpeg.pcode( re.compile "[^%a\x00]" ) -- with my patched lpeg
[]
00: set [\x01-\x40\x5b-\x60\x7b-\xff] --> just cut and paste
09: end

benchmark: "this-is--a----test" -> "this-is--a", "----"

time(us)  match function  pattern
4.65      string.match    "(.*)%f[\x01-\x40\x5b-\x60\x7b-\xff](%A+)"
6.67      string.match    "(.*)%f[^%a%z](%A+)"
8.26      string.match    "^(.*)(%f[^%a%z]%A+%f[%a])%a*$"

6.57      lpeg.match      "{(g <- {%A+} / %a+ g)+} -> drop ", {drop = drop}

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #56 on Thu 08 Feb 2018 06:58 PM (UTC)

Amended on Thu 08 Feb 2018 11:26 PM (UTC) by Albert Chan

Message

I discovered a lpeg re pattern that beat string.match, even with pre-built frontier pattern class

The trick is to use frontier pattern idea in lpeg (backtrack before %A+)

benchmark: "this-is--a----test" -> "this-is--a", "----"

3.63       pat = re.compile "{ %a* (%A+ %a* &%A)* } {%A+}"

the same idea can be use for previous "(.*)and(.*), without drop function


pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})

note: ">&'and' " is shorthand for "(g <- &'and' / .[^a]* g)"

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #57 on Fri 09 Feb 2018 02:53 AM (UTC)

Message


pat = re.compile("{ >&%z (%z >&%z)* } %z {.*}", {z='and'})

That doesn't compile for me. The symbol ">" is not defined in the re documentation.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #58 on Fri 09 Feb 2018 03:03 AM (UTC)

Amended on Fri 09 Feb 2018 06:38 PM (UTC) by Albert Chan

Message

'>' is just a shorthand for the grammar (g <- patt / . [^head-chars-of-patt]* g)

it need a search and replace (see notes in reply 56)

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #59 on Fri 09 Feb 2018 06:37 PM (UTC)

Amended on Fri 09 Feb 2018 06:40 PM (UTC) by Albert Chan

Message

above code (reply 57) will compile with my patched lpeg, which does the expansion automatically.

i just posted the source in github (comment welcome)

https://github.com/achan001/LPeg-anywhere

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

216,142 views.

This is page 4, subject is 5 pages long: 1 2 3 4 5

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top