Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ Programming ➜ General ➜ Lexical Trie class

Lexical Trie class

It is now over 60 days since the last post. This thread is closed.     Refresh page


Pages: 1 2  

Posted by David Haley   USA  (3,881 posts)  Bio
Date Thu 05 Oct 2006 08:10 AM (UTC)
Message
Hi everybody,

I've just released a class that implements a lexical trie. From the documentation:

Quote:
A trie is a tree-like data structure to store words, where each branch of the tree is tagged with a letter and indicates the addition of that vector. At tree nodes, a boolean flag indicates if the word up to that point is in fact contained in the trie.


I found a website that describes it fairly well. It provides an implementation in C and an "older" (sic) one in C++. I don't know what features the interface has; I don't think it has things like regular expressions and so forth. In any case I just thought the page was interesting to explain what a trie is:
http://linux.thai.net/~thep/datrie/datrie.html

The graph should show fairly clearly what a trie looks like in memory.



In brief, the advantage to a trie is that it stores a list of words with a look-up time proportional not to the number of words, but to the length of the word you're looking up. Contrast with a binary search array which has log(n) time, where n is the number of words -- if you have 2^12 (=4096) words, you will need -- in the worst case -- 12 steps to look up any word. In a trie, looking up 'hello' will cost you 5 steps.

Furthermore, the way a trie is structured makes it very easy (and efficient) to do things like prefix matching or regular expression matching. If you had a sorted array, this would be much more complicated; for regular expressions in particular it would be much less efficient to find all matches.


Anyhow, it's late, I'm tired, so enough of all that. :-) Without further ado, here is the website with a download link:
http://david.the-haleys.org/index.php?page=lexical-trie



Please note that the provided license:
http://david.the-haleys.org/dev/lexical-trie/LICENSE
does not allow the trie to be used in commercial applications without my permission.

I hereby give permission to Gammon Software Solutions (and/or Nick Gammon, as appropriate) to use my lexical trie class in his commercial software products, including but not limited to the MUSHclient application, at no fee or cost.
But, it would be nice -- but not required! -- to have any eventual enhancements fed back to my source code. :-)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #1 on Thu 05 Oct 2006 08:49 AM (UTC)
Message
Quote:

I hereby give permission to Gammon Software Solutions (and/or Nick Gammon, as appropriate) to use my lexical trie class in his commercial software products ...


Thanks for that!

I'll take a look, it might save a considerable amount of memory.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #2 on Thu 05 Oct 2006 08:55 AM (UTC)
Message
It's too late to play tonight, but my first bit of feedback is the first function in your file: StringToLower.

From my page:

http://www.gammon.com.au/forum/?id=2896

The following function might be faster as it uses a single STL function (transform) operating over a string, rather than working on individual letters.


// returns a lower case version of the string 
string tolower (const string & s)
  {
string d (s);

  transform (d.begin (), d.end (), d.begin (), (int(*)(int)) tolower);
  return d;
  }  // end of tolower

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #3 on Thu 05 Oct 2006 09:27 PM (UTC)
Message
Hmm. I might be confused about how STL's transform worked; I was under the impression that it iterated over the sequence given (str.begin() to str.end()), applying the transformation function (tolower in this case), saving the results by iterating over the 'result' iterator (in this case, str.begin() again).

Isn't that basically the same thing? If I remember correctly, I use the 'reserve' function which guarantees that the string is of the right size, to avoid needless allocations.

Still, I suspect you might be right because the STL probably is optimized for this kind of stuff, so when I get the chance I will benchmark this and see what happens. :-)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #4 on Thu 05 Oct 2006 10:50 PM (UTC)

Amended on Thu 05 Oct 2006 10:51 PM (UTC) by Nick Gammon

Message
It is doing the same thing, although transform takes a begin and end iterator (the source) and an output iterator (the destination).

Your version, which I will reproduce here so people know what we are talking about, will be calling the operator[] for the string, which is a function call each time around the loop. Also, it evaluates str.length () each time through. I was suggesting that letting STL do that inside its own loop might be faster:


#include <ctype.h>
#include <string>

// Helper function: StringToLower
static std::string StringToLower(const std::string & str)
{
  std::string result;
  result.reserve(str.length());
  
  for (std::string::size_type i = 0; i < str.length(); i++)
    {
    result[i] = tolower(str[i]);
    }
    
  return result;
}

int main (void)
  {
  std::string s ("STL is VERY useful");

  for (int i = 0; i < 10000000; i++)
    StringToLower (s);

  return 0;
  } // end of main

test
$ g++ -O3 test1.cpp -o test1
$ time ./test1

real    0m13.735s
user    0m13.690s
sys     0m0.010s



And the one using transform:


#include <ctype.h>
#include <string>

// returns a lower case version of the string
std::string tolower (const std::string & s)
  {
std::string d (s);

  std::transform (d.begin (), d.end (), d.begin (), (int(*)(int)) tolower);
  return d;
  }  // end of tolower

int main (void)
  {
  std::string s ("STL is VERY useful");

  for (int i = 0; i < 10000000; i++)
    tolower (s);

  return 0;
  } // end of main

test
$ g++ -O3 test2.cpp -o test2
$ time ./test2

real    0m10.510s
user    0m10.510s
sys     0m0.000s



The "transform" version ran 3 seconds faster, or in 77% of the time of the original. Not stunning results, but a useful margin.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #5 on Fri 06 Oct 2006 12:06 AM (UTC)
Message
Ah, I wasn't thinking about the [] operator, silly me. That would definitely make a difference. The optimizer should realize that str.length() is a constant number -- after all str itself is constant and the method is a constant method.

I made a version of the test where the call to str.length() was replaced by a pre-computed call, but I got very strange benchmarks...


test1 -- my original
real    1m2.187s
user    1m1.265s
sys     0m0.046s


test2 -- my original with pre-computed str.length
real    0m59.016s
user    0m58.234s
sys     0m0.046s


test3 -- your version
real    1m36.719s
user    1m35.593s
sys     0m0.061s


I find it very surprising that my computer is showing such slower speeds than yours (roughly six times slower). My computer is pretty fast; it's a 3 GHz hyper-threading processor.

Still, my modified version appears to be slightly faster than my unmodified version. My assumption would be that it's due to the cost of iteration that the STL version has.

I'm also surprised that my computer shows your version as being slower than my versions; maybe my computer is biased. :-) I was so surprised that I ran all the tests twice and got the same results, modulo a second or two.

What version of g++ are you using? For my current computer, my laptop:

g++ (GCC) 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


You might have g++ 4+. I have 4.1.1 (I think) on my new desktop; I will test this on that system when I get home today.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #6 on Fri 06 Oct 2006 01:01 AM (UTC)

Amended on Fri 06 Oct 2006 01:02 AM (UTC) by Nick Gammon

Message
My processor isn't particularly fast, I am running on my "old PC", which I use for Linux, because you need much faster PCs to run Windows <grin>.


$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 6
cpu MHz         : 864.484
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1723.59


As you can see, it is much slower than 3 Ghz, maybe a quarter of that speed.

My gcc version:


$ gcc --version
gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)


Did you compile like I did with the -O3 (optimization) on?

I tried moving the str.length () out of the loop in my test, didn't make much difference (went down to 13.68 seconds).


And, it is strange your test shows my version runs slower than your version.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #7 on Fri 06 Oct 2006 05:25 PM (UTC)
Message
Yes, I used the exact same compilation string that you posted. I haven't had the chance to run this on my Linux machine, but I will do that in about 4 hours.

The fact that moving the str.length() out of the loop did not change much is reassuring; it's a good indication that the optimizer is probably smart enough to realize that it doesn't need to make that function call every single run of the loop.

You know, it occurs to me that I might have been running this test when my laptop was on battery, in which case it dramatically slashes the CPU performance. I will try this again on the laptop when it's plugged in and see what happens. I find it astonishing that my computer takes about 6 times as long as yours does!

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #8 on Fri 06 Oct 2006 09:46 PM (UTC)
Message
Quote:

I might have been running this test when my laptop was on battery, in which case it dramatically slashes the CPU performance


That wouldn't surprise me. It also might account for why your test returned results which were the opposite of mine.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #9 on Fri 06 Oct 2006 10:18 PM (UTC)
Message
I ran all of this on my Linux server, here are the results:

Quote:

[david@binarygoblin ~]$ g++ -O3 -Wall --pedantic test1.cpp -o test1     
[david@binarygoblin ~]$ g++ -O3 -Wall --pedantic test2.cpp -o test2 
[david@binarygoblin ~]$ g++ -O3 -Wall --pedantic test3.cpp -o test3 
[david@binarygoblin ~]$ time ./test1

real    0m4.493s
user    0m4.476s
sys     0m0.012s
[david@binarygoblin ~]$ time ./test2

real    0m4.424s
user    0m4.420s
sys     0m0.000s
[david@binarygoblin ~]$ time ./test3

real    0m6.435s
user    0m6.428s
sys     0m0.000s
[david@binarygoblin ~]$ g++ --version & uname -a & cat /proc/cpuinfo 
[1] 2493
[2] 2494
g++ (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Celeron(R) CPU 2.40GHz
stepping        : 9
cpu MHz         : 2393.243
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 4794.42

[1]-  Done                    g++ --version
[david@binarygoblin ~]$ Linux binarygoblin.stanford.edu 2.6.16-1.2096_FC5 #1 Wed Apr 19 05:14:36 EDT 2006 i686 i686 i386 GNU/Linux

[2]+  Done                    uname -a
[david@binarygoblin ~]$ 




Just for kicks I ran it again on the laptop and got basically the same results:
Quote:
David@ksilyan ~
$ time ./test1

real    0m48.172s
user    0m47.686s
sys     0m0.061s

David@ksilyan ~
$ time ./test2

real    0m47.812s
user    0m47.530s
sys     0m0.015s

David@ksilyan ~
$ time ./test3

real    1m34.156s
user    1m33.327s
sys     0m0.077s


These results are somewhat astonishing to me. I wonder if the Cygwin/Windows pairing is doing something very strange. If I get around to it I'll try using VC++ and see how long that takes.



In any case, on my Linux server it seems that my two versions are faster, but not by very much. I am not sure at all how to explain the discrepancy we are seeing. One thought is that the optimizer changed dramatically between 3.2 and 4.1, so that it now optimizes my code better than your code. Otherwise I have no explanation, it's just strange. :-)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #10 on Fri 06 Oct 2006 10:45 PM (UTC)
Message
Now I am getting confused. I tried using Cygwin on my laptop (plugged into the power) and got these results:


$ time ./test1

real    0m27.750s
user    0m27.421s
sys     0m0.031s

$ time ./test2

real    0m27.062s
user    0m26.812s
sys     0m0.000s

$ time ./test3

real    0m38.719s
user    0m38.202s
sys     0m0.000s


Like you, my version (test3) was slower, and much slower too than my old vintage Linux box. Shows Cygwin must be introducing quite an overhead, even for pure computations.

This laptop is a 1.83 GHz Pentium.

However my version of Cygwin was still version 3 of gcc:


$ gcc --version
gcc (GCC) 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)
Copyright (C) 2004 Free Software Foundation, Inc.


Looks like you can make a case for either version, depending on which platform you are on. :)

I think the transform looks slightly neater, that is about all I can say for it now.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #11 on Fri 06 Oct 2006 10:53 PM (UTC)
Message
Hmm. Cygwin must be doing something very strange, but I'm not sure what, because both of us are seeing very dramatic slow-downs using Cygwin.

I completely agree, by the way, that the transform version looks a lot neater. I like using the abstract concepts like that to make things look nice and tidy. That's why, for now, the trie uses STL vectors to store each node's list of children.
But, since these operations in the trie need to be really, really fast, I'm wondering if it might not be better to sacrifice a little niceness for the sake of efficiency. After all the niceness of the interface is preserved; who cares what the trie does internally as long as it does what the interface says it does? :-) But what people do care about is going fast, especially if you're storing tens of thousands of words and need to do complex look-ups rather often.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #12 on Fri 06 Oct 2006 11:06 PM (UTC)

Amended on Fri 06 Oct 2006 11:08 PM (UTC) by Nick Gammon

Message
At the risk of worrying this topic to death, I tried again on another PC which I normally use for Windows XP. However I booted using the Knoppix boot CD, which is a useful way of trying Linux on a PC without actually installing it. These are my results this time:


$ time ./test1

real    0m3.692s
user    0m3.689s
sys     0m0.003s
$ time ./test2

real    0m3.675s
user    0m3.674s
sys     0m0.002s
$ time ./test3

real    0m4.251s
user    0m4.249s
sys     0m0.002s

$ gcc --version
gcc (GCC) 3.3.6 (Debian 1:3.3.6-5)
Copyright (C) 2003 Free Software Foundation, Inc.

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 4
cpu MHz         : 3001.603
cache size      : 1024 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
...
bogomips        : 5996.54


Again, my version is slower, again it is much faster on Linux than Cygwin. This was on a 3 GHz Pc.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #13 on Sun 08 Oct 2006 05:39 AM (UTC)

Amended on Sun 08 Oct 2006 09:51 PM (UTC) by Nick Gammon

Message
Now for a proper test ...

First I make your Trie class have a Lua interface, as follows (added this to the bottom of the .cpp file):


//----------------------- begin Lua stuff ----------------------------

const char trie_handle[] = "trie_handle";

// make string table item
static void MakeTableItem (lua_State *L, const char * name, const string & str)
  {
  lua_pushstring (L, name);
  lua_pushstring (L, str.c_str ());
  lua_rawset(L, -3);
  }

// make number table item
static void MakeTableItem (lua_State *L, const char * name, const int n)
  {
  lua_pushstring (L, name);
  lua_pushnumber (L, n);
  lua_rawset(L, -3);
  }

static LexicalTrie * Ltrie_gettrie  (lua_State *L)
{
  LexicalTrie **ud = (LexicalTrie **) luaL_checkudata (L, 1, trie_handle);
  luaL_argcheck(L, *ud != NULL, 1, "trie userdata expected");
  return *ud;
  }

static int Ltrie_add(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);

  string word (luaL_checkstring (L, 2));
  pTrie->addWord (word);
  return 0;
  } // end of Ltrie_add

static int Ltrie_remove(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  string word (luaL_checkstring (L, 2));
  pTrie->removeWord (word);
  return 0;
  } // end of Ltrie_remove

static int Ltrie_contains(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  string word (luaL_checkstring (L, 2));
  lua_pushboolean (L, pTrie->containsWord (word));
  return 1;
  } // end of Ltrie_contains

static int Ltrie_prefix(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  string word (luaL_checkstring (L, 2));
  lua_pushboolean (L, pTrie->containsPrefix (word));
  return 1;
  } // end of Ltrie_prefix

static int Ltrie_save(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  std::set<std::string> out;

  pTrie->writeToSet (out);

  lua_createtable (L, out.size (), 0);
  
  int i = 1;

  // copy set into table
  for (std::set<std::string>::const_iterator it = out.begin (); 
       it != out.end (); it++)
         {
         lua_pushstring (L, it->c_str ());
         lua_rawseti (L, -2, i++);
         }

  return 1;    // the table
  } // end of Ltrie_save

static int Ltrie_load(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  luaL_checktype (L, 2, LUA_TTABLE);
  lua_pushnil (L);  // first key
  while (lua_next (L, 2) != 0)
    {
    string word (luaL_checkstring (L, -1));
    pTrie->addWord (word);
    lua_pop (L, 1);   // remove word, leave key    
    }
  return 0;    
  } // end of Ltrie_load

static int Ltrie_count(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  lua_pushinteger (L, pTrie->numWords ());
  return 1;    // the count
  } // end of Ltrie_count

static int Ltrie_match(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  string regexp = (luaL_checkstring (L, 2));
  std::set<std::string> results;

  pTrie->matchRegExp (regexp, results);
  
  lua_createtable (L, results.size (), 0);
  
  int i = 1;

  // copy set into table
  for (std::set<std::string>::const_iterator it = results.begin (); 
       it != results.end (); it++)
         {
         lua_pushstring (L, it->c_str ());
         lua_rawseti (L, -2, i++);
         }


  return 1;    // the table
  } // end of Ltrie_match

static int Ltrie_corrections(lua_State *L)
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  string regexp = (luaL_checkstring (L, 2));
  int distance = (luaL_checkinteger (L, 3));
  std::set<LexicalTrie::Correction> results;

  pTrie->suggestCorrections (regexp, distance, results);
  
  lua_createtable (L, results.size (), 0);
  
  int i = 1;

  // copy set into table
  for (std::set<LexicalTrie::Correction>::const_iterator it = results.begin (); 
       it != results.end (); it++)
         {
         lua_createtable (L, 0, 2);
         MakeTableItem (L, "word", it->suggestedWord_);
         MakeTableItem (L, "distance", it->editDistance_);
         lua_rawseti (L, -2, i++);
         }

  return 1;    // the table
  } // end of Ltrie_corrections

// done with the trie, delete it
static int Ltrie_gc (lua_State *L) {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  delete pTrie;
  // set userdata to NULL, so we don't try to use it now
  LexicalTrie **ud = (LexicalTrie **) luaL_checkudata (L, 1, trie_handle);
  *ud = NULL;
  return 0;
  }  // end of Ltrie_gc

// tostring helper
static int Ltrie_tostring (lua_State *L) 
  {
  LexicalTrie *pTrie = Ltrie_gettrie (L);
  lua_pushstring(L, "trie");
  return 1;
}  // end of Ltrie_tostring

//----------------------- create a new trie object ----------------------------

static int Ltrie_new(lua_State *L)
{
  LexicalTrie *pTrie = new LexicalTrie (luaL_optnumber (L, 1, 1)); // case-sensitive flag
  LexicalTrie **ud = (LexicalTrie **)lua_newuserdata(L, sizeof (LexicalTrie *));
  luaL_getmetatable(L, trie_handle);
  lua_setmetatable(L, -2);
  *ud = pTrie;    // store pointer to this trie in the userdata

  return 1;
}  // end of Ltrie_new



static const luaL_reg triemeta[] = {
  {"add",        Ltrie_add},         // add one word
  {"remove",     Ltrie_remove},      // remove one word
  {"contains",   Ltrie_contains},    // is word there?
  {"prefix",     Ltrie_prefix},      // is prefix there?
  {"save",       Ltrie_save},        // save trie to table
  {"load",       Ltrie_load},        // load trie from table
  {"count",      Ltrie_count},       // number of words in the trie
  {"match",      Ltrie_match},       // match regular expression
  {"corrections",Ltrie_corrections}, // return table of corrections
  {"__gc",       Ltrie_gc},
  {"__tostring", Ltrie_tostring},
  {NULL, NULL}
};


/* Open the library */

static const luaL_reg trielib[] = {
  {"new",     Ltrie_new},
  {NULL, NULL}
};

static void createmeta(lua_State *L, const char *name)
{
  luaL_newmetatable(L, name);   /* create new metatable */
  lua_pushliteral(L, "__index");
  lua_pushvalue(L, -2);         /* push metatable */
  lua_rawset(L, -3);            /* metatable.__index = metatable */
}

LUALIB_API int luaopen_trie(lua_State *L)
{
  createmeta(L, trie_handle);
  luaL_register (L, NULL, triemeta);
  lua_pop(L, 1);
  luaL_register (L, "trie", trielib);
  return 1;
}




I also added a couple of extra methods, because when outputting the trie I wanted to put it into a table, not a ostream. These are the new methods:


void LexicalTrie::writeToSet(std::set<std::string> & out)
  {
  root_->writeToSet(out);
  }

void LexicalTrie::TrieNode::writeToSet(std::set<std::string> & output, const std::string & soFar)
{
  if ( isWord_ )
    output.insert(soFar);

  for ( std::vector<LetterTriePair>::iterator it = letters_.begin();
      it != letters_.end(); it++ )
  {
    it->trie_->writeToSet(output, soFar + it->letter_);
  }
}



These were very similar to your writeToStream methods.


The code above illustrates how you can interface a C++ class to Lua fairly simply.

To use it, you make a new instance of the trie object, like this:


tr = trie.new ()


Then you use that new object, and call its methods (implemented by a metatable, and using __index to call the functions in the metatable):


tr:add ("nick")
tr:add ("gammon")
tr:load {"all", "good", "boys", "deserve", "fruit" }

print ("is nick there", tr:contains ("nick")) --> true
print ("is andrew there", tr:contains ("andrew")) --> false

require "tprint"

tprint (tr:save ())


The code above illustrates adding to the trie, loading a table, testing for membership, and saving back as a table.

I'm not quite sure how the "corrections" part is supposed to work. Given the above entries, which save as:


1="all"
2="boys"
3="deserve"
4="fruit"
5="gammon"
6="good"
7="nick"


If I enter:


tprint (tr:corrections("go", 3))


I get:


1:
  "distance"=2
  "word"="good"
2:
  "distance"=3
  "word"="all"
3:
  "distance"=3
  "word"="boys"
4:
  "distance"=3
  "word"="good"


The word "good" appears twice, I'm not quite sure why.




Speed and memory usage

The other question is, does this save time or space over the straight Lua tables? I'm not sure that it does.

In my implementation of the spell checker test code, I wanted to be able to lookup by metaphone code, so I made a trie per metaphone - that is, 9,251 tries.

The time taken to load the dictionary was about double, compared to straight Lua tables, and the memory usage was over double as well.

This doesn't totally surprise me, because for each letter stored in a trie, you need an instance of TrieNode, namely 16 bytes. Admittedly the trie structure means you don't need a node for each letter in the dictionary, but for short runs of words, the space saved by the trie is probably lost by the extra space for the node structures.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,166 posts)  Bio   Forum Administrator
Date Reply #14 on Sun 08 Oct 2006 07:40 AM (UTC)
Message
Quote:

In brief, the advantage to a trie is that it stores a list of words with a look-up time proportional not to the number of words, but to the length of the word you're looking up. Contrast with a binary search array which has log(n) time, where n is the number of words -- if you have 2^12 (=4096) words, you will need -- in the worst case -- 12 steps to look up any word. In a trie, looking up 'hello' will cost you 5 steps.


When you first posted this, the reference web site was down, but it is up now and I took a look at it.

I want to point out here that the Lua table lookups are also quite fast, being a hash lookup.

Also, I think that the trie lookup (eg. for "hello") is not quite 5 steps (depending on how you define "step") because at each node there are alternatives. So, for instance if there was "hi", "ha" and "hello", it was to walk the alternatives following the "h" (which are "a", "i", and "e") in order to move onto the next letter.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


47,403 views.

This is page 1, subject is 2 pages long: 1 2  [Next page]

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.