Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are
spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the
password reset link.
Due to spam on this forum, all posts now need moderator approval.
Entire forum
➜ Programming
➜ General
➜ Lexical Trie class
It is now over 60 days since the last post. This thread is closed.
Refresh page
Pages: 1
2
| Posted by
| David Haley
USA (3,881 posts) Bio
|
| Date
| Reply #15 on Sun 08 Oct 2006 10:10 PM (UTC) |
| Message
|
Quote: This doesn't totally surprise me, because for each letter stored in a trie, you need an instance of TrieNode, namely 16 bytes. Admittedly the trie structure means you don't need a node for each letter in the dictionary, but for short runs of words, the space saved by the trie is probably lost by the extra space for the node structures.
Yes, the trie is not a good data structure for small numbers of words. The big savings come from sharing word prefixes. (In fact you can also make 'super-tries' that share prefixes and suffixes. This gives very good savings due to very common suffixes like 'ing', 'ion', etc.)
Quote: Also, I think that the trie lookup (eg. for "hello") is not quite 5 steps (depending on how you define "step") because at each node there are alternatives. So, for instance if there was "hi", "ha" and "hello", it was to walk the alternatives following the "h" (which are "a", "i", and "e") in order to move onto the next letter.
I was speaking in terms of algorithmic complexity. At most you have 26 (or 52, with case sensitivity) potential children at every node. This number is not dependent on the size of the problem, whether the 'problem' is considered to be the number of words or the length of the word to find (or both). Therefore the complexity of word lookup is linear in the length of the word, in terms of algorithmic complexity.
Of course in practice these constant costs can be considerable. There is lots of potential for speed-up in my trie implementation. I don't think I do a binary search of the children array, for example, despite having them sorted.
The trie really shines when you have very large numbers of words to store and you need very quick lookup. Of course a trie uses more memory than a straight array of the words, but the lookup is much faster because it does not depend on the number of words in your dictionary but on the length of the word to search for. As for Lua's hash lookups, I'm not sure how that compares in practice, because I suspect they'd have collisions and have to resolve those. Have you done any lookup speed tests? |
David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone
http://david.the-haleys.org | | Top |
|
| Posted by
| Nick Gammon
Australia (23,166 posts) Bio
Forum Administrator |
| Date
| Reply #16 on Mon 09 Oct 2006 06:50 AM (UTC) |
| Message
| Using the Lua table lookup, I found the word "aerodynamically" out of a table of 9251 metaphones 10,000,000 times in 2.5 seconds.
The thing is, it will be the same figure for the trie, because I was storing each trie as a sub-table per metaphone.
Even looking each word up directly, I loaded up the dictionary of 76,821 words into a straight Lua table (ie. no metaphones or anything), and it still looked up "aerodynamically" 10,000,000 times in 2 seconds.
So, for a common paragraph of, say, 20 words, the lookup time will be minimal. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | | Top |
|
| Posted by
| Nick Gammon
Australia (23,166 posts) Bio
Forum Administrator |
| Date
| Reply #17 on Mon 09 Oct 2006 06:55 AM (UTC) |
| Message
|
Quote:
I was speaking in terms of algorithmic complexity. At most you have 26 (or 52, with case sensitivity) potential children at every node.
Yes you are right that the trie will favour fast lookups for short words. A binary search would be proportional to the log of the number of elements in the table, regardless of their length.
However the trie will be slightly slower with a large table because at a particular node, you may have to scan an average of 52/2 (26) potential children, particularly near the start.
It is a nice data structure, I admit that. It probably lends itself to situations where you have many prefixes that are the same. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | | Top |
|
| Posted by
| David Haley
USA (3,881 posts) Bio
|
| Date
| Reply #18 on Mon 09 Oct 2006 07:03 PM (UTC) |
| Message
| Well, it's not so much the number of words that matters in the trie, it's the branching factor at each node. Or, in other words, the disparity of the prefixes. Even if you have a few 'bottlenecks', that's not necessarily bad because one switch might then lead you down more or less 'straight' paths. The initial scan can be the tough one, especially if you really do use those 52 letters. (That is, though, somewhat unlikely!)
Another thing worth noting is that the Lua tables have been very heavily optimized over several years. My poor trie class has been minimally adapted after an assignment and hasn't undergone any optimization other than g++'s. :-) I'm actually rather happy that my performance isn't that different from the Lua tables, and that makes me feel better about the whole paradigm of using a trie. |
David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone
http://david.the-haleys.org | | Top |
|
| Posted by
| Nick Gammon
Australia (23,166 posts) Bio
Forum Administrator |
| Date
| Reply #19 on Tue 10 Oct 2006 01:14 AM (UTC) |
| Message
| It would be interesting to store a value at trie nodes - particularly the end-of-word nodes. That way the trie could actually be used as a table-lookup for information.
Right now, it is really like an STL set, but by having an information item, it could be more like an STL map. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | | Top |
|
| Posted by
| David Haley
USA (3,881 posts) Bio
|
| Date
| Reply #20 on Tue 10 Oct 2006 05:52 PM (UTC) |
| Message
| | Yes, this is one of the enhancements I was thinking of. It shouldn't really be that hard, either, except that the simple boolean case has some extra trickeries (i.e. false vs. true vs. not-in-trie). I'll explain more in a bit, have to run to class. :-) |
David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone
http://david.the-haleys.org | | Top |
|
| Posted by
| Nick Gammon
Australia (23,166 posts) Bio
Forum Administrator |
| Date
| Reply #21 on Tue 10 Oct 2006 11:58 PM (UTC) |
| Message
| | Well, you need to do something like Lua, and have nil for no value, and true/false for booleans. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | | Top |
|
| Posted by
| David Haley
USA (3,881 posts) Bio
|
| Date
| Reply #22 on Wed 11 Oct 2006 12:19 AM (UTC) |
| Message
| That's right. You would have to realize that saying trie["foobar"] = false is not the same thing as saying trie.erase("foobar"). In the first case, the word is still "present" in the trie, and if you retrieve it you will get false. But you haven't removed it from the trie, so notably the memory is still hanging around. The second case would actually remove the memory.
It's not really the trie's problem, though; it's the end-user who will have to define their own semantics.
This change will not be too hard to make. I'll templatize the trie sometime in the next few days or over the weekend. |
David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone
http://david.the-haleys.org | | Top |
|
| Posted by
| Nick Gammon
Australia (23,166 posts) Bio
Forum Administrator |
| Date
| Reply #23 on Wed 11 Oct 2006 04:52 AM (UTC) |
| Message
| It starts to look like you are making another set/map class, like is already available in STL.
Could be efficient on large trees, probably not much in it for small ones. |
- Nick Gammon
www.gammon.com.au, www.mushclient.com | | Top |
|
| Posted by
| David Haley
USA (3,881 posts) Bio
|
| Date
| Reply #24 on Wed 11 Oct 2006 07:21 AM (UTC) |
| Message
| Right. Tries have two properties that make them useful:
- Prefix sharing. This saves memory if you have lots of words in your dictionary. (lots of words.)
- Lookup time. Lookup time is a function of word length, not the number of words in your set. This is also very useful when you have a large number of words.
So a trie is a poor structure for small sets, where "small" is roughly less than a thousand, I would suppose. But it can be very useful in certain areas, namely when you need to do a lot of lookups over a large set of words quickly. Hence my original suggestion to use it in the auto-completion feature. (I'm not sure it's well suited at all to the way you are approaching spell-checking, although with some clever tricks your algorithm could probably be incorporated into the trie recursive walk, which would make it good again.) |
David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone
http://david.the-haleys.org | | Top |
|
The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).
To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.
47,402 views.
This is page 2, subject is 2 pages long:
1
2
It is now over 60 days since the last post. This thread is closed.
Refresh page
top