Gammon Forum : Programming : General : lpeg code translate to lpeg re

Entire forum ➜ Programming ➜ General ➜ lpeg code translate to lpeg re

lpeg code translate to lpeg re

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3 4 5

Posted by Albert Chan (55 posts) Bio

Date

Reply #15 on Sun 21 Jan 2018 07:02 AM (UTC)

Amended on Mon 22 Jan 2018 06:41 AM (UTC) by Nick Gammon

Message

redoing your benchmark on my machine, my version is slightly faster.
I guess it depends on many factors, machine setups ...

i now have lpeg re pattern that return without the extra "and"

lpeg re pattern :


"{ g <- . g / &'and' } 'and' {.*}"

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #16 on Sun 21 Jan 2018 01:37 PM (UTC)

Amended on Mon 22 Jan 2018 06:41 AM (UTC) by Nick Gammon

Message

To summarize what I have learn so far, upto function have 2 versions:


local C, P, V = lpeg.C, lpeg.P, lpeg.V

function upto_first(what)            -- non-greedy version
  return C( (-P(what) * 1)^0 ) * P(what)
end

function upto_last(what)             -- greedy version
  return C {1 * V(1) + #P(what)} * P(what)
end

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #17 on Sun 21 Jan 2018 05:26 PM (UTC)

Amended on Tue 23 Jan 2018 07:44 PM (UTC) by Albert Chan

Message

this is a response to reply #3: lpeg upto version in lpeg re:


function upto(what)    -- non-greedy match
  return string.format( [[ {(! %q .)*} %q ]], what, what)
end

pat = re.compile( upto('and') .. "{.*}" )


= pat:match("this and that and whatever")
this
that and whatever

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #18 on Mon 22 Jan 2018 01:33 AM (UTC)

Amended on Mon 22 Jan 2018 06:43 AM (UTC) by Nick Gammon

Message

discovered a way to combine reply #16 2 upto functions as one:


function upto(what, greedy)
    local t1, t2 = #P(what), 1 * V(1)
    if greedy then t1, t2 = t2, t1 end
    return C {t1 + t2} * P(what)
end

function all() return C(P(1)^0) end


text = 'this or that of whatever'

= (upto('and', false) * all()):match(text)    -- lua pattern "(.-)and(.*)
this
that and whatever

= (upto('and', true) * all()):match(text)     -- lua pattern "(.*)and(.*)
this and that
whatever

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #19 on Mon 22 Jan 2018 04:00 PM (UTC)

Amended on Tue 23 Jan 2018 07:45 PM (UTC) by Albert Chan

Message

My previous upto have problem with long strings in greedy mode.

= upto('and', true):match( ('*and'):rep(50) )
backtrack stack overflow (current limit is 400)
...

Is this a lpeg bug ? only handle max string length of 199 ?
Above example only need 3 backtracks !

below is same upto function that avoid backtrack overflow:


function upto(what, greedy)
    what = P(what)
    local head = 1 - what
    if greedy then head = what * -(head^0 * -1) + head^1 end
    return C(head^0) * what
end

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #20 on Mon 22 Jan 2018 09:15 PM (UTC)

Message

See LPEG reference

Quote:

lpeg.setmaxstack (max)

Sets a limit for the size of the backtrack stack used by LPeg to track calls and choices. (The default limit is 400.) Most well-written patterns need little backtrack levels and therefore you seldom need to change this limit; before changing it you should try to rewrite your pattern to avoid the need for extra space. Nevertheless, a few useful patterns may overflow. Also, with recursive grammars, subjects with deep recursion may also need larger limits.

You have a recursing grammar, so this can be an issue for you.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Albert Chan (55 posts) Bio

Date

Reply #21 on Mon 22 Jan 2018 10:05 PM (UTC)

Amended on Mon 22 Jan 2018 10:27 PM (UTC) by Albert Chan

Message

retry recursive upto_last with bigger stack indeed solve the problem

lpeg.setmaxstack(201 * 2) -- max string length 200 ???

= upto_last('and'):match( ('*and'):rep(50) )
*and*and*and ...

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #22 on Tue 23 Jan 2018 03:15 PM (UTC)

Amended on Tue 23 Jan 2018 07:48 PM (UTC) by Albert Chan

Message

the benchmark numbers in reply 12 might be flawed

my re pattern looks for 'and' from end-of-string, whereas your re grammar start from the beginning.
The timings will be affected by location of last 'and'

you might want to try benchmark with this non-backtract pattern:
(this new re pattern is fast and can handle long strings)


pat = re.compile "{g <- &('and' (!'and' .)* !.) / .g} 'and' {.*}"

= pat:match "this and that and whatever"
this and that
whatever

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #23 on Tue 23 Jan 2018 10:29 PM (UTC)

Message

I added a table capture ( "{| ... |}" ) to your pattern to I could check more accurately what the matches were. For the first match (with the two "and"s) your pattern is still slower. Perhaps if you post exactly how you are timing yours?


====================
Testing: foo and bar and whatever
----------
Nick
1="foo and bar "
2=" whatever"
Time taken = 11.454 us
----------
Albert
1="foo and bar "
2=" whatever"
Time taken = 43.860 us
====================
Testing: foo and bar
----------
Nick
1="foo "
2=" bar"
Time taken = 5.029 us
----------
Albert
1="foo "
2=" bar"
Time taken = 4.749 us
====================
Testing: XandY
----------
Nick
1="X"
2="Y"
Time taken = 4.749 us
----------
Albert
1="X"
2="Y"
Time taken = 5.587 us
====================
Testing: foo
----------
Nick
no match
Time taken = 3.073 us
----------
Albert
no match
Time taken = 3.632 us
====================
Testing: Xand
----------
Nick
1="X"
2=""
Time taken = 4.749 us
----------
Albert
1="X"
2=""
Time taken = 4.470 us
====================
Testing: andY
----------
Nick
1=""
2="Y"
Time taken = 5.029 us
----------
Albert
1=""
2="Y"
Time taken = 6.425 us
====================
Testing: and
----------
Nick
1=""
2=""
Time taken = 4.470 us
----------
Albert
1=""
2=""
Time taken = 4.749 us
====================
Testing: 
----------
Nick
no match
Time taken = 3.073 us
----------
Albert
no match
Time taken = 3.352 us

I tried averaging out the results over 20 trials, and removing the table capture. This time I get somewhat faster results:


====================
Testing: foo and bar and whatever
----------
Nick
foo and bar 
 whatever
Time taken = 1.355 us
----------
Albert
foo and bar 
 whatever
Time taken = 2.500 us
====================
Testing: foo and bar
----------
Nick
foo 
 bar
Time taken = 0.726 us
----------
Albert
foo 
 bar
Time taken = 0.629 us
====================
Testing: XandY
----------
Nick
X
Y
Time taken = 0.615 us
----------
Albert
X
Y
Time taken = 0.545 us
====================
Testing: foo
----------
Nick
no match
Time taken = 0.461 us
----------
Albert
no match
Time taken = 0.405 us
====================
Testing: Xand
----------
Nick
X

Time taken = 0.601 us
----------
Albert
X

Time taken = 0.531 us
====================
Testing: andY
----------
Nick

Y
Time taken = 0.629 us
----------
Albert

Y
Time taken = 0.573 us
====================
Testing: and
----------
Nick


Time taken = 0.824 us
----------
Albert


Time taken = 1.257 us
====================
Testing: 
----------
Nick
no match
Time taken = 0.475 us
----------
Albert
no match
Time taken = 0.405 us

(Time is average time).

Code to reproduce:


require "re"
require "tprint"
local ITERATIONS = 20
local result1, result2

print (string.rep ("*", 80))

c = re.compile [[
   parse     <- {noDelim} lastDelim  -- look for all up to the last delimiter followed by the last part
   delim     <- 'and'                      -- our delimiter
   noDelim   <- (!lastDelim .)*            -- zero or more characters without the last delimiter
   lastDelim <- delim {(!delim .)*} !.     -- the delimiter without any more delimiters and then end of subject
]]

pat = re.compile "{g <- &('and' (!'and' .)* !.) / .g} 'and' {.*}"  -- Albert Chan pattern

function showResults (result1, result2, start, finish)
  if not result1 then
    print ("no match")
  else
    if type (result1) == 'table' then
      tprint (result1)
    else
      print (result1)
      print (result2)
    end -- if
  end -- if
  print (string.format ("Time taken = %0.3f us", ((finish - start) / ITERATIONS) * 1e6))
end -- showResults 

function test (which)

  print (string.rep ("=", 20))
  print ("Testing:", which)

  print (string.rep ("-", 10))
  print "Nick"

  start = utils.timer ()
  for i = 1, ITERATIONS do
    result1, result2 = lpeg.match (c, which)
  end  -- for
  finish = utils.timer ()

  showResults (result1, result2, start, finish)

  print (string.rep ("-", 10))
  print "Albert"

  start = utils.timer ()
  for i = 1, ITERATIONS do
    result1, result2 = lpeg.match (pat, which)
  end -- for
  finish = utils.timer ()

  showResults (result1, result2, start, finish)

end -- test

tests = {
 "foo and bar and whatever",
 "foo and bar",
 "XandY",
 "foo",
 "Xand",
 "andY",
 "and",
 "",
}

for _, v in ipairs (tests) do
  test (v)
end -- for

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #24 on Tue 23 Jan 2018 11:27 PM (UTC)

Amended on Wed 24 Jan 2018 02:19 AM (UTC) by Albert Chan

Message

This was unexpected. Multiple returns benchmark suggest both patterns are similar.

my machine is pentium 3 866Mhz, WinXP, luajit 1.1.8, lpeg 1.0.1


"foo and bar and whatever"
                        my pentium        your machine
iterations              1 million            20
time in us           Albert   Nick       Albert    Nick
return tables         14.4    15.9        43.9     11.5
return value           6.7     8.2         2.5      1.4
cost of table          7.7     7.7        41.4     10.1

i have since discovered a version without lookahead,
but for your really short string test, the timings is likely worse.
Algorithmetically this is faster, but there is an extra split call


pat = re.compile(
   "(g <- 'and' / .g)+ => split",
   { split = function(s,i) return true, s:sub(1, i-4), s:sub(i) end }
)

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #25 on Wed 24 Jan 2018 01:05 AM (UTC)

Message

Good questions. I've modified the test to summarize results.

With 100 iterations, and my test first:


Test                               Nick   Albert
------------------------------------------------
foo and bar and whatever          0.933    1.011
foo and bar                       0.629    0.548
XandY                             0.799    0.494
foo                               0.363    0.341
Xand                              0.673    0.886
andY                              0.503    0.615
and                               0.483    0.455
                                  0.413    0.467

And your test first:


Test                             Albert     Nick
------------------------------------------------
foo and bar and whatever          1.109    0.914
foo and bar                       0.556    0.626
XandY                             0.489    0.525
foo                               0.352    0.369
Xand                              0.478    0.511
andY                              0.478    0.520
and                               0.472    0.503
                                  0.338    0.344

Overall it seems faster the more iterations in testing, however it seems that for the ones where my grammar is faster, it is still faster, and vice-versa.

The differences seem less over more iterations.

With your newest version without lookahead:


Test                               Nick   Albert
------------------------------------------------
foo and bar and whatever          0.967    1.193
foo and bar                       0.584    0.765
XandY                             0.489    0.701
foo                               0.369    0.346
Xand                              0.106    0.735
andY                              0.517    1.137
and                               0.508    0.729
                                  0.508    0.335

I suspect the differences may be in how you are testing. Are you using MUSHclient or running a test using a standalone program? If you are using C or C++ with a more modern compiler than what I have, that might alter the results.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #26 on Wed 24 Jan 2018 01:05 AM (UTC)

Message

I see you are using luajit which in itself will probably give somewhat different results to me because of the way it works.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #27 on Wed 24 Jan 2018 01:10 AM (UTC)

Message

Testing with a longer string:


string.rep ("c", 1000) .. " and " .. string.rep ("a", 1000) .. " and " .. string.rep ("b", 1000)

I get timings:


Nick: 49.353   Albert: 22.584

Yours seems better for longer strings.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Albert Chan (55 posts) Bio

Date

Reply #28 on Wed 24 Jan 2018 02:13 AM (UTC)

Amended on Wed 24 Jan 2018 02:40 AM (UTC) by Albert Chan

Message

My re patterns with split does no lookahead, but has a bigger big O constant
Bypassing lpeg match-time-capture overhead might make it faster:


do
    local sub, match = string.sub, lpeg.match
    local pat = re.compile "(g <- 'and' / .g)+"
    function split_last_and(text)
        local i = match(pat, text)
        if i==nil then return nil end
        return sub(text, 1, i-4), sub(text, i)
    end
end

= split_last_and "this and that and whatever"
this and that
whatever

Top

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #29 on Wed 24 Jan 2018 04:21 AM (UTC)

Message

I'm curious to know how your grammar works so well without overflowing the recursion stack. I have to assume that the grammar:


(g <- 'and' / . g)+

... does tail-recursion so that effectively turns the recursion into an iteration. Do you agree?

(In other words, the right-recursion into g can be replaced by a tail call, which simply goes to the start of g without adding to the call stack).

In assembler the equivalent would be:


foo     equ  *   # start of foo
# do stuff, and under certain conditions:
        jsr foo  # call foo again
        ret      # leave foo

This can be replaced by:


foo     equ  *   # start of foo
# do stuff, and under certain conditions:
        jmp foo  # call foo again

The first version adds to the call stack for each recursion, the second version does not, but is functionally equivalent.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

216,142 views.

This is page 2, subject is 5 pages long: 1 2 3 4 5

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top