Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.
 Entire forum ➜ MUSHclient ➜ Plugins ➜ Full text search of a directory tree with boolean matching ... plugin

Full text search of a directory tree with boolean matching ... plugin

Posting of new messages is disabled at present.

Refresh page


Posted by Nick Gammon   Australia  (23,122 posts)  Bio   Forum Administrator
Date Sun 04 Nov 2012 03:27 AM (UTC)

Amended on Mon 05 Nov 2012 02:59 AM (UTC) by Nick Gammon

Message
Introduction


I've been finding recently that finding a particular file amongst the hundreds (or thousands) of source files, plugins, world files, Lua files, and so on, can be quite tedious.

There are times when I want to find something like a mapper but can't remember its exact name. Or if I can remember the name, can't remember where I put the file.

Some editors (such as Crimson Editor) have a "find in files" capability, but this can tend to return a lot of results (try searching for the word "define" in C source). And even if the results are reasonably relevant you have to open one file after another to find the one you really want. Plus you can't do things like "I want two words which are fairly close together".


Hence was born this plugin, the "source scanner".

Installation


Grab the latest copy from here:

https://github.com/nickgammon/plugins/blob/master/Source_scanner.xml

(To download, right-click on the Raw button on the GitHub page, and save the file Source_scanner.xml to your plugins directory.

You may need to install the windows_utils.dll file available from here:

http://www.gammon.com.au/files/mushclient/lua5.1_extras/windows_utils.zip

That is used to open the text editor and bring its window to the front.


What it does


The source (file) scanner has two modes of operation.


  1. Scan a directory tree, indexing all files matching certain file types (eg. C files, XML files, Lua files) based on the file suffix. This takes a few seconds.

  2. Query the resulting database for a boolean match. This is pretty fast.


The scanning process builds an SQLite3 database using the FTS (Full Text Search) type of table. This is the same method used by places like Google which let you find one or more words amongst many files.

As an example, first I index my Plugins folder (by typing "index"):



I get a confirmation:


Loaded 38 files in 0.1 seconds.


I try searching for "setvariable" ...


find setvariable




This illustrates that files with setvariable in it (not case-sensitive) are shown (by name) and also a snippet highlighting the searched-for word.

The snippet is very handy because you can quickly see which file you really want to edit. If you click on the file name (in blue) it opens the appropriate file in your desired editor (in my case, Crimson Editor).

Boolean searches


The full power comes from being able to specify boolean operations (eg. AND and OR).

Examples (from the help):


cat AND dog              --> both words 
cat dog                  --> both words, the "AND" is implied
fish OR bicycle          --> one or the other
cat NOT food             --> one word but not the other
bite NEAR me             --> one near the other (within 10 words)
disk NEAR/3 drive        --> one within 3 words of the other
"trouble brewing"        --> exact phrase
chip*                    --> prefix query, matches chip, chips, chipping etc.
fish NOT (bacon OR eggs) --> brackets can be used to clarify groupings
    
The words AND / OR / NEAR / NOT must be in upper case or they just match those words literally.

Note: words with underscores (eg. BUFFER_LENGTH) should be quoted because they are treated as two words.

find name <wildcard>     --> filter on file name, not contents


Here is another example, showing looking through some Arduino source files for the word "spi" near the word "transfer" and also with the word "begin" in the file:


f spi NEAR transfer begin




You can see that all the words we are looking for are highlighted.


Configuration


Near the start of the plugin are various configuration options. For example, you can search for files with two or more words in them, or one word but not another one. You can also just search for file names, if you know the name, but can't remember what directory you put it into.

You probably want to change this option:


-- file types, separate by spaces, commas, semicolons, whatever. 
-- We assume suffixes are alphanumeric
SUFFIXES = "cpp,c,h,xml,lua"


That controls what file types are included in the database. You might add "html" for example, for web pages. Or "mcl" for "MUSHclient worlds".

Another useful option to change is the number of tokens shown in a snippet. For example, changing from the default of 7 to 20, like this:


-- number of tokens to display around the snippet
SNIPPET_SIZE = -20


Now the same search as before shows a lot more detail around the chosen words:



That has its good and bad points. Good to see more detail, bad because it takes more room in the output window.

If you aren't using Crimson Editor, you could change these lines:


-- viewer program 
TEXT_VIEWER = "C:\Program Files\Crimson Editor\cedt.exe"
EDITOR_WINDOW_NAME = "Crimson Editor"


to:


TEXT_VIEWER = "C:\Windows\notepad.exe"
EDITOR_WINDOW_NAME = "Notepad"


Or to some other text editor of your choice.


Indexing


Indexing is done by typing the alias "index" which brings up a directory picker (illustrated above). Navigate to the "top" folder you want to index and click OK. If you look at the status bar of MUSHclient you will see each file name as it is indexed. This should only take a few seconds.

If you index, any existing data is discarded. This lets you re-index for searching a totally different folder, or take into account changes you might have made to your source.


Search for name


If your first search word is "name" then it looks in the file names rather than the file contents, eg.


find name ethernet




If you happen to want to search for files with "name" in them you could always quote "name".


How word searching works


The FTS algorithm used by SQLite3 indexes by words (not partial words). However the tokenization breaks words apart at punctuation, so that, for example BUFFER_SIZE is really considered two words, BUFFER and SIZE.

You can work around that somewhat by quoting words like BUFFER_SIZE (eg. "BUFFER_SIZE").

[EDIT] Found a way of avoiding that problem, see next post.

You can also search for word prefixes, that is SETVAR* would match SETVARIABLE. However you can't find the middle of a word (eg. *VAR* won't work as expected).


File types


This (fairly simple) plugin is designed for text files, not Word files, Powerpoint presentations, PDF files and so on. It just reads the files in as pure text, so would be most suited to .C, .CPP, .H, .XML, .LUA, .TXT and similar file types.

The MUSHclient world files are just straight text, so you could index those if you wanted to (.MCL files).

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,122 posts)  Bio   Forum Administrator
Date Reply #1 on Mon 05 Nov 2012 02:57 AM (UTC)
Message
Underscores now part of words


After extensive searching I found a little-known feature of SQLITE3, that lets you alter the delimiters for an FTS search.

Changing the creation of the database to this:


  -- omit the "tokenize" line to get normal tokenization
  -- that line omits underscores as tokens (for C source)
  -- the "X" is there because of SQLITE3 bug:
  -- http://sqlite.1065341.n5.nabble.com/FTS-simple-tokenizer-with-custom-delimeters-td43926.html
  
  dbcheck (db:exec [[
    DROP TABLE IF EXISTS source;
    CREATE VIRTUAL TABLE source USING FTS4(name, contents, size, date_written, 
    tokenize=simple X ' !"#$%&''()*+,-./:;<=>?@[\]^`{|}~'
    );
   ]])



The "tokenize" line specifies what are considered word breaks. I found all the non alpha-numeric characters, and then omitted underscore from the list.

The result is, now the searcher works properly with C (and C++) data names. Also it would work better with variable names and other tokens in MUSHclient world and other files.

The version of the plugin on GitHub has now been updated.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


7,421 views.

Posting of new messages is disabled at present.

Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.