[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to "verify" your details, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.
[Folder]  Entire forum
-> [Folder]  MUSHclient
. -> [Folder]  General
. . -> [Subject]  RegExp VM

RegExp VM

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Pages: 1 2  3  

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Sat 24 Apr 2004 06:04 PM (UTC)

Amended on Sat 24 Apr 2004 07:15 PM (UTC) by Shadowfyr

Message
I am breaking this off from the thread "General: Triggers across lines", since it is getting a tad insane over there. lol

Having had a bit of a sleepless night, where I kept running to the computer having remembered something I forgot, I eventually came up with this:

VM operation/parts:

CodeArray - Holds the array of 'commands' that get executed.
RegExpArray - Holds the chunks of RegExp the preparser seperates out.
PC - Integer holding the 'program counter' that keeps track of the current instruction to be executed.

The PC always starts at 0 initially, but increases by 1 for each successful RegExp match. When such a match happens, the VM returns with a TRUE value and whatever wildcards where generated for the line. The next time the trigger is tested, it starts execution and the 'current' PC or if the PC is now greater than the number of instructions, it is reset to 0 and testing starts over from there.

Instructions and logic:

test RegExpPtr
  if *matched* then
    PC = PC + 1
    return TRUE
  else
    PC = 0
    return FALSE
  end if

opt RegExpPtr, <bailout>
  if *matched* then
    if <bailout> then
      PC = 0
    else
      PC = PC + 1
    end if
    return TRUE
  else
    PC = PC + 1
  end if

rst
  count = 0

lp <max>, <AddrChng>, RegExpPtr
  if RegExpPtr <> 0 then
    if *matched* then
      PC = PC + 1
      return TRUE
    else
      count = count + 1
      if count > <max> and <max> > 0 then
        PC = 0
        return FALSE
      else
        PC = PC + <AddrChng>
      end if
    end if
  else
    count = count + 1
    if count > <max> then
      PC = 0
    else
      PC = PC + <AddrChng>
    end if
  end if

Examples of how the pre-parser would build the execution code for these:

Example 1 "^You see\:$(.*\n){1,10}^$">

test 1
rst
opt 2, 0
lp 10, -1, 3

Example 2 "^Fred says: .*">

test 1

Example 3 "^You see\:$(.*\n){1,}^$">

test 1
test 2
opt 2, 0
test 3

Example 4 "^You see\:$(.*\n){5,}^$">

test 1
rst
test 2
lp 5, -1, 0
opt 2, 0
test 3

Example 5 "^You see\:$(.*\n){5,10}^$">

test 1
rst
test 2
lp 5, -1, 0
rst
opt 2, 0
lp 10, -1, 3

Example 6 "^You see\:$(.*\n){,10}">

test 1
rst
opt 2, 0
lp 10, -1, 0

Example 7 "^You see\:$(Nothing\n)|(.*\n){1,}">

test 1
opt 2, 1
opt 3, 0
lp 0, -2, 0

Now.. If I could just find someone to design the needed pre-parser. lol But this does do both what you want Nick and follows the actual syntax, without gludging extra settings and a look-back buffer on to make it work, which is what I don't like. Unfortunately designing the pre-parser is much more complicated when you have to deal with situations like example 7. :( The rest are relatively straight forward.

Oh.. And I also considered using a set of 'opcodes' that included "test, inc, cmp, beg, bne, lp, ret <boolean>, rst", but realized that this wasn't really necessary, since all the logic really needed could be incorperated directly into the test, opt, inc and lp functions themselves.
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Sat 24 Apr 2004 09:26 PM (UTC)
Message
I know you are trying to solve a problem here, but I don't think writing our own regexp machine will be a great idea. Apart from the "reinventing the wheel" aspect, I see from the PCRE changelog that the first documented item was in September 1997, and the most recent one in May 2003. That is almost 6 years of development to get a nice, stable, powerful routine.

The problem is, the only thing I think you really solve is not having to "count the lines". Other issues, like omitting from log, or output, would still apply, unless you are going to defer *all* trigger processing if a multi-line trigger happens to be active.

For the trivial case of (say):

You see Nick\nHe is hungry.

Sure, I can count the \n characters and automatically compute 2 as the number of lines, but as soon as it changes to:

You see (Nick\n)*He is hungry.

... I can't. I don't know how many times the \n will repeat.

Then if you want to write a regexp parser that takes streaming text "on the fly" then you have the problem it may never terminate. Here is an example:

(?s)^You see:(.*)

The (?s) says to consider the dot character to include newlines.

Now since this never terminates, it could absorb 1000 lines of output and still be going strong. With my approach you at least have to put a "limiting count". Say you chose 20, then it terminates after 20 lines, no matter what is in the output.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #2 on Sat 24 Apr 2004 09:28 PM (UTC)
Message
If anyone is looking for it, the previous thread was Triggers across lines.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #3 on Sat 24 Apr 2004 11:22 PM (UTC)
Message
Umm. I am not looking to replace the PCRE. PCRE simply isn't able to deal with text on the fly, which is why you are messing with a look-back in the first place. As far as normal regexp triggers are concerned there is 'no' change, it just makes the call with the current line and the only regexp segment the VM can retrieve, instead of calling it directly.

For triggers that match multiple lines, each *line* gets split out and treated as a seperate call to the existing PCRE system, the only difference is the breaking up of stuff that matches on multiple lines into a set of smaller regular expression that don't need to test all 10, 20, 50, 5000 or how ever many lines they match in order to succeed. Each pass through the triggers functions as though the trigger was only looking for 'that specific line' which the VM knows it needs to test. Yours can only do as many lines as you ask, since by definition, it is limited to the number of lines in the look-back buffer you can test against, the bigger the buffer, the more overhead and the slower the client gets. This isn't practical.

This doesn't replace the PRCE code you already use, it just does a bit of prework, so you can get around the fact that the mud text isn't a static file, while yours 'forces' it to be static, but cannot work unless the number of lines is >= to the number you tell it to expect. Mine adds a tiny bit of everhead to every regexp as they are tested, lets just say 10 clock cycles to make a guess. Assuming that testing each line in yours also took 10 cycles, yours automatically adds 10 * N, where N is the number of lines in the buffer it has to recheck every single time the trigger is tested.

The clear difference is in how the data is treated. From the perspective of users, mine works like any other trigger. Each time it matches something it make a call to the script engine or whatever. Yes, as you pointed out this does mean you have to deal with the data 'as it arrives' and store it in some cases, but that is the way people already do it. Their is no real reason why you couldn't, instead of testing the buffer every single time, just add to one that would get returned as the entire matched text. Then you get the same result as yours, but people can *choose* if they want to deal with the data when it arrives, store it themselves or use a split command to pull the stuff out.

In most cases I wouldn't even care what the entire matched text was. I am probably looking for specific information on each line and I want to deal with them when they happen, not after the fact, when I may not even have a clue which wilcard the 55th item is even in, given a complex case where there may be say 4 different types of lines, each with 2-4 wildcards, depending on the line matched and maybe in the worst case, no way to check the wildcards themselves to 'guess' at which ones came from what line.

Example:

cloves
shield of cloves
10 torches
quiver, with 10 arrows

All these 'could' concievably exist in the same inventory, they all require radically different wild cards to match and it is possible that another item called just 'shield' could end up in there. Exactly which out of the 9 wildcards returned by:

((shield) of (.*)|(\d+) (.*)|(.*), with (\d+) (.*)|(.*))*

would each of the resulting 9 wildcards (counting an item called shield in there) belongs to which line? There is no way to tell without paring the entire set of lines in the script all over again, which completely defeats the purpose of using wildcards in the first place to try to seperate them out. (though this does suggest a minor flaw in my opt function, it needs to jump to the lp line, instead of continuing with the next segment...) Now it may be stupid to even use such a wildcard system in the first place, but mine would do it right, since it doesn't wait to find every single line before passing the wildcards on to be dealt with by a script or anything else you may use them for. My concept also doesn't have the line coloring or omit from output issues.
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #4 on Sun 25 Apr 2004 12:03 AM (UTC)
Message
OK, so you want to break up a regexp (as a pre-processing step) into mini-regexps, and apply them on the fly to lines as they arrive? Ingenious, certainly, but I'm not sure the boundary will always be clear.

I don't see how it solves the colouring (etc.) problems, because you still don't know until you get the last line if the whole thing will match or not.

eg.

You have
.*
.*
.*
in your inventory.

Now until we get "in your inventory" we don't know if the expression is going to match, and thus we can't start colouring the first line. Meanwhile, another trigger might have omitted it.

Also, the regexp parser has things like recursion, look-back, negative assertions etc., that don't fit neatly into the "let's decide to move on a line" concept.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #5 on Sun 25 Apr 2004 12:35 AM (UTC)

Amended on Sun 25 Apr 2004 12:38 AM (UTC) by Shadowfyr

Message
Hmm.. Yeah. On second thought you are right. It doesn't exactly fix the issue with those things.

As for recursion.. If it is recursion within a line, then that wouldn't require spliting, same with negative assertions. The division step is for dividing them into logical groups that can be parsed seperately. I doubt recursion would occur across lines, or maybe I just don't know what you mean by that in this case. Look-backs... Is an issue (maybe), but its one of those things that again, doesn't exactly lend itself to non-static information. But you are right, one reason I am reluctant to even attempt to design the code to break one into pieces is because I am afraid I would miss some odd, but critical instance where it needs to be divided, but doesn't happen. I have trouble everytime I need a binary search and can't find an old copy of one I created previously, this is a lot more complicated. lol
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #6 on Sun 25 Apr 2004 12:38 AM (UTC)
Message
I think I see the difference in what we are trying to do.

You are designing a "state machine" - effectively like the existing system of multiple triggers (say for doing a stat roller, or catching an inventory), where you have something like:

state A
state B
state C
done

Thus, your idea is that this can be expressed along these lines:

state A$state B$state C

The trigger system would break that up into 3 "states" (effectively one trigger that transitions from A to B to C when necessary) and gradually matches incoming text.

Basically like you currently do with multiple triggers that enable and disable each other.

This is fine, and probably workable - although there are still lingering issues about whether or not you colour line A, however in your system once state A matches you probably go ahead and colour, omit, or whatever you plan to do.

My system is more of "take a batch of lines as a block and apply a regular expression to it to see if it matches".

I think my system is conceivably more flexible for obscure cases (eg. if the "overall match" depends on whether line C arrives or not), however possibly slower as it has to try rematching every line. Mind you, the regexp system should quickly exit if it can't find a match on the start of the block.

Assuming I am right about this, I still like the idea of the new system (especially as it is now coded). It makes minimal changes to the existing behaviour, and does give you an easy way of doing things like stat rollers.

Any problems with speed can be managed by simply disabling the multi-line trigger until you think it is needed. For instance, keep it disabled until you do an inventory list, or disable it after stats are rolled, or whatever.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #7 on Sun 25 Apr 2004 12:41 AM (UTC)
Message
I haven't actually attempted a recursive regexp, but if you read Regular expressions supported by PCRE they give examples of using recursion to match indefinitely nested brackets (something that probably won't happen on a MUD).

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #8 on Sun 25 Apr 2004 12:53 AM (UTC)

Amended on Sun 25 Apr 2004 12:59 AM (UTC) by Shadowfyr

Message
Yep. Basically what I mean. I don't really agree entirely with your definition of yours "It makes minimal changes to the existing behaviour". I think that it breaks that behaviour in that you have to give it more information than normal triggers need to successfully process the same lines. The fact that it is also, due to this same issue, not directly translatable from other clients or systems that use such triggers, but don't need the extra "make sure you eat all your peas" step of telling it how many lines to expect, I see as a minor annoyance at best and a complete pain in some nastier cases. For example, a trigger that could concievably match output from another prior list if it looks 'too far' back. Ok, this probably means the original trigger was really badly designed, but it could happen in cases where you have 5-6 variable length lists, you are only capturing the actual items lines in the trigger, there is no ovbvious "begin with this line" part to the trigger and you end up needing to test between 3 to 100 lines in the last list.

I think both our designs probably do have obscure cases like this which will make them fail miserably. Or at least I know there are ones your will fail at and suspect mine may have a few. I don't know quite enough about what could go wrong with regexp to be sure in my case. Maybe at some point I can build such a state machine that works and we can see which one blows up. lol For now.. I suppose yours does well enough, though its one flaw is that most of the cases people have asked about multiline are ones you "want" to leave turned on. The fact that these are 2-3 lines total needed to match would likely to not be too far off what my own design would cause in added overhead, so.... I still don't like it, but for now, I guess I can live with it. lol
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #9 on Sun 25 Apr 2004 12:57 AM (UTC)
Message
Hmm. To tell you the truth.. Recursion in regexp scares me. lol I tend to agree, it doesn't look like something you can expect to have pop up a lot on muds either.
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #10 on Sun 25 Apr 2004 01:38 AM (UTC)
Message
Quote:

For example, a trigger that could concievably match output from another prior list if it looks 'too far' back.


Yes, I found that, but using \z at the end of the regexp "anchors" it to the tail-end of the text block. Thus it will always find the most recent.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #11 on Sun 25 Apr 2004 01:47 AM (UTC)
Message
Quote:

I think that it breaks that behaviour in that you have to give it more information than normal triggers need to successfully process the same lines.


Well, single line triggers by definition don't need to be told the number "1", however multiple line triggers need to be told just how "multiple" they are. I think that is reasonably natural.

Say if someone tells you to buy "a box of matches" you know to buy one box, however if they tell you to buy "many" boxes of matches you might reply "how many?".

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #12 on Sun 25 Apr 2004 06:01 PM (UTC)

Amended on Sun 25 Apr 2004 06:04 PM (UTC) by Shadowfyr

Message
True enough. This is really only true in rare cases where they don't specify a specific number. Your example is one case where a user may want more than a predetermined number of lines. This however is a rare case. In such cases, limiting it makes sense and is 100% necessary. However, in cases where there is an explicit and undeniable number of lines pre-defined by the triggers match text, you still require the user to specify the exact number of lines to look in to find it. That is what I mean by broken, since its should already know in those cases how many to match and they user can actually cause the perfectly functional trigger to fail completely, just by inadvertently specifying one line less than it would have match on its own. It is a minor flaw, but definitely a flaw.
[Go to top] top

Posted by Nick Gammon   Australia  (23,045 posts)  [Biography] bio   Forum Administrator
Date Reply #13 on Sun 25 Apr 2004 09:46 PM (UTC)

Amended on Sun 25 Apr 2004 09:47 PM (UTC) by Nick Gammon

Message
Well, if I have to count \n characters for you I need to know *whether* to count them or not. In this example it will work:

You see X \n You see Y

That gives me 2 lines (because of one \n).

However this will not work:

You see X (.*\n)* You see Y

Because the \n is inside a repeat group the counting will be wrong. Now to know that I have to parse the regular expression, which could be harder than it looks, and even then it doesn't tell me how many lines to insert, I just discover that it *isn't* 2.

Or this case:

You see (?s)(.*) kobolds

This doesn't even have a \n in it. However the (?s) tells it to include newlines in the "dot" character.

Or what about this:

You see [^a]* here.

According to the specs [^a] *will* match newlines (it matches everything except "a" *including* newlines).

How do you plan to work around that? Have another check box?

Count lines for me? [X]
If not, how many lines: [ 15 ]

Seems a bit silly. I've added another question, and I still need you to specify a line count for those other cases.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Shadowfyr   USA  (1,787 posts)  [Biography] bio
Date Reply #14 on Sun 25 Apr 2004 10:25 PM (UTC)
Message
Heh. I said it was a flaw, I didn't imply that I had a clue how to fix it, short of not doing it that way in the first place. lol I know why you need it, I just consider it an annoyance you have to live with for your method.
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


69,437 views.

This is page 1, subject is 3 pages long: 1 2  3  [Next page]

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]