Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ Programming ➜ General ➜ A Bayesian filter in Lua

A Bayesian filter in Lua

It is now over 60 days since the last post. This thread is closed.     Refresh page


Posted by Nick Gammon   Australia  (23,158 posts)  Bio   Forum Administrator
Date Fri 15 Sep 2006 11:08 PM (UTC)

Amended on Sat 16 Sep 2006 07:50 AM (UTC) by Nick Gammon

Message
Below is some Lua code (Lua 5.1, but could be adapted to Lua 5.0 with minor changes) that demonstrates analyzing a text string (or file) to see if it is "spam".

The general technique here is to build up a "corpus" which is a dictionary of words that you have previously decided are spam or not spam.

In the code below I generally refer to the words being "red" or "black" as the general idea can be applied to any sets of words that can be divided into groups, for example:


  • spam / not spam
  • profanity, or not
  • English or French
  • Technical writing / everyday speech


You process a file (or batch of text) and indicate whether this particular file (or string) is "red" or "black". Depending on which one, it is given a probability in the corpus, based on how many times it occurs in each group.

For example, the word "and" or "the" might occur in either group, but "make", "more" and "money" might happen more often in the spam group.

Once the corpus has been seeded, you can then supply any text and have it analyzed to see if the combined probability of every word leads us to believe it is spam or not.

First, the code:


--[[
  Routine to demonstrate dividing sentences into red/black groups
    (eg. spam, not spam)

  Based on "A plan for Spam" by Paul Graham:
 
    http://www.paulgraham.com/spam.html

  Some C excerpts based on publicly released code by Craig Morrison.

    http://sourceforge.net/users/craigbayes/

  Also see "The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and 
            How to GetPast It."    by William S. Yerazunis, PhD

    http://crm114.sourceforge.net/Plateau_Paper.pdf

  Author: Nick Gammon 
  Date:   16th. September 2006

  --]]
  
-- read/black analyzer in Lua
local corpus = {}
local word_regexp = "([%w]+)"

-- read in the corpus file
function ReadCorpus (name)
  for line in io.lines (name) do
    local word, red, black = string.match (line, 
            word_regexp .. ",%s+(%d+),%s+(%d+)")
    if word then
      corpus [word] = { red = red, black = black }
    end -- corpus line
  end -- read loop
end -- ReadCorpus

-- save the corpus file
function WriteCorpus (name)
  local fprev = io.stdout
  local f = io.output (name)
  for k, v in pairs (corpus) do
    f:write (string.format ("%s, %d, %d, %1.3f\n", 
             k, v.red, v.black, 
             CalcProbability (v.red, v.black)))
  end -- writing all
  f:close ()  -- close that file now
  io.output (prev) -- restore previous output file  
end -- WriteCorpus

-- add a string to the corpus
function AddToCorpus (s, red, black)
  for w in string.gmatch (s, word_regexp) do
     if corpus [w] then  -- already in corpus?
        corpus [w].red = corpus [w].red + red
        corpus [w].black = corpus [w].black + black
     else  -- add to corpus    
        corpus [w] = { red = red, black = black }
     end 
  end -- for
end -- AddToCorpus

local C1 = 2   -- weightings
local C2 = 1
local weight = 1
local MAX_WEIGHT = 2.0

-- calculate the probability one word is red or black
function CalcProbability (red, black)
 local pResult = ( (black - red) * weight )
                 / (C1 * (black + red + C2) * MAX_WEIGHT)
  return 0.5 + pResult
end -- CalcProbability

-- load a named file into the corpus
function LoadFile (name, red, black)
  local f = io.input (name)
  local s = f:read ("*a")
  f:close ()
  AddToCorpus (s, red, black)
end -- LoadFile

-- load red words (spam)
function LoadRed (name)
  LoadFile (name, 1, 0)
end -- LoadRed

-- load black words (ham)
function LoadBlack (name)
  LoadFile (name, 0, 1)
end -- LoadBlack

--   See: 
--     http://www.paulgraham.com/naivebayes.html
--   For a good explanation of the background, see:
--     http://www.mathpages.com/home/kmath267.htm.

-- calculate the probability a bunch of words are ham (black)
function SetProbability (probs, count)
  local n, inv = 1, 1
  local i = 0
  count = count or #probs
  for k, v in pairs (probs) do
    n = n * v
    inv = inv * (1 - v)
    i = i + 1
    if i >= count then
      break
    end -- done enough
  end 
  return  n / (n + inv)
end -- SetProbability

-- analyze a string for its probability of spam (red)
function Analyze (s)
  local words = {}
  
  -- break string into words, put into local table
  for w in string.gmatch (s, word_regexp) do
    words [w] = true
  end -- for
    
  -- pull out unique words, calculate probability each one is red
  local interesting = {}
  for k, v in pairs (words) do
    if corpus [k] then
      table.insert (interesting, CalcProbability (corpus [k].red, corpus [k].black))
    else
      table.insert (interesting, 0.5)  -- default if not in corpus
    end --  if in corpus or not
  end -- for

  -- sort so the "more interesting" ones are at the top 
  -- that is, either very low probability (eg 0.1, or very high, eg. 0.9)
      
  table.sort (interesting, function (a, b)
               return math.abs (0.5 - a) > math.abs (0.5 - b)
               end -- sequence function
              )
 -- return SetProbability (interesting, math.min (#interesting, 30))
  return SetProbability (interesting)
end -- Analyze
 
-- analyze a file
function AnalyzeFile (name)
  local f = io.input (name)
  local s = f:read ("*a")
  f:close ()
  print (string.format ("File %s is %2.3f %% likely to be ham",
         name, Analyze (s) * 100))
end -- AnalyzeFile

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,158 posts)  Bio   Forum Administrator
Date Reply #1 on Fri 15 Sep 2006 11:19 PM (UTC)
Message
Now to test it.

I had a couple of files obtained from spam filter sites, one was spam and one wasn't. So we load up the corpus, and print it:

LoadRed ("spam.txt")
LoadBlack ("ham.txt")
WriteCorpus ()


methods, 1, 0, 0.375
sending, 6, 0, 0.286
what, 8, 1, 0.325
These, 2, 0, 0.333
important, 1, 0, 0.375
that, 26, 13, 0.419
And, 2, 0, 0.333
EMAILS, 1, 0, 0.375
doesn, 0, 1, 0.625
WORKING, 1, 0, 0.375
about, 2, 0, 0.333
confidence, 1, 0, 0.375
MarkUnseen, 0, 1, 0.625
fold, 1, 0, 0.375
constant, 0, 1, 0.625

... and so on


A low probability (third number) indicates probably red (spam), and a high probability (closer to 1) indicates probably black (ham).

Now we can analyze those same files:


AnalyzeFile ("spam.txt") --> File spam.txt is 0.000 % likely to be ham
AnalyzeFile ("ham.txt") --> File ham.txt is 100.000 % likely to be ham


So far, so good, it seems to recognise its own test data.

Now we can try smaller strings:


print (string.format (" = %2.3f %% ham", 
Analyze ("You have just received information that") * 100))

--> = 4.084 % ham


To save reprocessing the corpus every time, you can save it to disk:


WriteCorpus ("corpus.txt")


And read it back in next time:


ReadCorpus ("corpus.txt")





How is this useful in MUD games you might ask? Well, you could use it to generally analyze any strings of text that you have examples of in advance that helps classify them.

For example, a profanity filter might use that approach to have words in its corpus that generally indicate swearing.

You might also use it in MUSHclient to try to work out whether text that arrived from the MUD was a room description, or list of exits. (A list of exits would probably have words like "north", "east", "out" and so on in it.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


11,961 views.

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.