Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ Programming ➜ General ➜ Lua utility to compare two directories

Lua utility to compare two directories

It is now over 60 days since the last post. This thread is closed.     Refresh page


Posted by Nick Gammon   Australia  (23,159 posts)  Bio   Forum Administrator
Date Wed 06 Feb 2008 02:53 AM (UTC)

Amended on Wed 06 Feb 2008 03:25 AM (UTC) by Nick Gammon

Message
I recently upgraded my network storage system (where you store files on a network drive rather than on individual PCs). The advantage of such a system is that you can access a file (eg. a photo) from any PC on the network, whether it is Mac, Windows or Linux.

However during the upgrade process I have been copying tens of thousands of files from the old folders to the new ones, and occasionally (after, like, an hour of copying) the copy will fail with some sort of error message.

When this happens the annoying thing is not knowing which files have been copied, and which ones still need copying, without manually inspecting hundreds of directories.

Hence, a quick Lua utility was born. (I know, there are no doubt many nice GUI utilities that will do this, but where is the fun if you can't re-invent the wheel from time to time?).

I wanted this to run under stand-alone Lua (that is, not under MUSHclient), so the first thing was to make a "scan the directory" utility. This reads a disk directory and returns a table of entries, of each file or folder in it.

This utility already exists in the MUSHclient source:

http://www.gammon.com.au/scripts/doc.php?lua=utils.readdir

Thus, I got the source and pulled out the relevant bits. I also left in the code to calculate MD5 hashes, in case one day I wanted to actually hash each file and check they were absolutely identical.

The directory scanner was placed in a file lua_utils.c as follows:


// To compile:  gcc  -mno-cygwin -shared -o lua_utils.dll lua_utils.c md5.c -llua

#define LUA_BUILD_AS_DLL
#define LUA_LIB

#include "lua.h"
#include "lauxlib.h"

#include "md5.h"
#include <io.h>
#include <errno.h>

typedef unsigned char UC;

// MD5 128-bit hashing algorithm
// see: http://www.cr0.net:8040/code/crypto/md5/

static int utils_md5 (lua_State *L)
  {
  unsigned char digest [16];
  // get text to hash
  size_t textLength;
  const char * text = luaL_checklstring (L, 1, &textLength);

  md5_context ctx;
  md5_starts (&ctx);
  md5_update (&ctx, (UC *) text, textLength);
  md5_finish (&ctx, digest);

  lua_pushlstring (L, digest, sizeof digest);

  return 1;  // number of result fields
  } // end of utils_md5

// make number table item
static void MakeNumberTableItem (lua_State *L, const char * name, const double n)
  {
  lua_pushstring (L, name);
  lua_pushnumber (L, n);
  lua_rawset(L, -3);
  }

// make boolean table item
static void MakeBoolTableItem (lua_State *L, const char * name, const int b)
  {
  if (b)
    {
    lua_pushstring (L, name);
    lua_pushboolean (L, b != 0);
    lua_rawset(L, -3);
    }
  }

static int getdirectory (lua_State *L)
  {
  // get directory name (eg. C:\mushclient\*.doc)
  size_t dirLength;
  const char * dirname = luaL_checklstring (L, 1, &dirLength);

  struct _finddatai64_t fdata;

  int h = _findfirsti64 (dirname, &fdata); // get handle

  if (h == -1L)    // no good?
    {
    lua_pushnil (L);

    switch (errno)
      {
      case EINVAL: lua_pushliteral (L, "Invalid filename specification"); break;
      default:     lua_pushliteral (L, "File specification could not be matched"); break;
      }
    return 2;   // return nil, error message
    }

  lua_newtable(L);    // table of entries
  
  do
    {

    lua_pushstring (L, fdata.name); // file name (will be key)
    lua_newtable(L);                // table of attributes

    // inside this new table put the file attributes

    MakeNumberTableItem (L, "size", (double) fdata.size);
    if (fdata.time_create != -1)    // except FAT
     MakeNumberTableItem (L, "create_time", fdata.time_create);
    if (fdata.time_access != -1)    // except FAT
      MakeNumberTableItem (L, "access_time", fdata.time_access);
    MakeNumberTableItem (L, "write_time",  fdata.time_write);
    MakeBoolTableItem   (L, "archive", fdata.attrib & _A_ARCH);
    MakeBoolTableItem   (L, "hidden", fdata.attrib & _A_HIDDEN);
    MakeBoolTableItem   (L, "normal", fdata.attrib & _A_NORMAL);
    MakeBoolTableItem   (L, "readonly", fdata.attrib & _A_RDONLY);
    MakeBoolTableItem   (L, "directory", fdata.attrib & _A_SUBDIR);
    MakeBoolTableItem   (L, "system", fdata.attrib & _A_SYSTEM);

    lua_rawset(L, -3);              // set key of table item (ie. file name)

    } while (_findnexti64 ( h, &fdata ) == 0);

  _findclose  (h);

  return 1;  // one table of entries
  } // end of getdirectory

// table of operations
static const struct luaL_reg utilslib [] = 
  {

  {"md5", utils_md5},
  {"readdir", getdirectory},

  {NULL, NULL}
  };

// register library

LUALIB_API int luaopen_utils(lua_State *L)
  {
  luaL_register (L, "utils", utilslib);
  return 1;
  }



The comment on the first line shows what to type under Cygwin to compile this file and get a DLL.


Next we need the actual Lua utility to scan the directories and report on what it finds:


-- Directory scanner 
-- Author: Nick Gammon
-- Date: 6 February 2008

ORIGINAL_ROOT = "x:/"
ORIGINAL_PATH = ""

COPY_ROOT = "y:/"
COPY_PATH = ORIGINAL_PATH

RESULTS_FILE = "results.txt"


assert (package.loadlib ("lua_utils.dll", "luaopen_utils")) ()

-- root is not stored (eg. z:/)
-- path is directory under root (eg. documents)
-- store is table to put results in
function process_dir (root, path, store)

  local filecount, foldercount, bytes = 0, 1, 0  -- we have one folder here
  
  print ("  -->", root .. path, "...")
  
  -- don't add slash to empty name
  local path_with_slash = path
  if path ~= "" then
    path_with_slash = path .. "/"
  end -- if slash needed

  local t = assert (utils.readdir (root .. path_with_slash .. "*"))
  
  for k, v in pairs (t) do
   
    if k ~= "." and k ~= ".." then
     
      -- recurse if directory
      if v.directory then
        local a, b, c = process_dir (root, path_with_slash .. k, store)
        filecount = filecount + a
        foldercount = foldercount + b
        bytes = bytes + c
      else
        store [path_with_slash .. k] = v.size 
        filecount = filecount + 1
        bytes = bytes + v.size
      end -- not directory
    
    end -- not special directories
  
  end -- each file

  return filecount, foldercount, bytes
end -- function process_dir

function show_table (t, heading, f)
  
  f:write (string.rep ("-", 70), "\n")
  f:write (heading, "\n")
  f:write "\n"
  
  if next (t) == nil then
    f:write " (none)\n"
  else
    for k, v in ipairs (t) do
      f:write  (" " .. v .. "\n")
    end -- for loop
  end -- if empty
  
  f:write  "\n"
end -- show_table

local tstart = os.time ()

local original = {}
local copy = {}

-- do original files
local original_count, original_folders, original_size = process_dir (ORIGINAL_ROOT, ORIGINAL_PATH, original)

-- do my supposed copy
local copy_count, copy_folders, copy_size = process_dir (COPY_ROOT, COPY_PATH, copy)

-- check all OK

local not_in_copy = {}
local not_in_original = {}
local wrong_size = {}

for k, v in pairs (original) do
  if copy [k] then
  
    -- check size same
    if v ~= copy [k] then
      table.insert (wrong_size, k)
    end -- wrong size
    
    -- remove from both tables - this file exists in both places
    original [k] = nil
    copy [k] = nil
  else
  
    -- found in original list but not in the copy
    table.insert (not_in_copy, k)
  end
end -- for loop

-- any left over were in the copy but weren't in the original
for k, v in pairs (copy) do
  table.insert (not_in_original, k)
end -- for loop

print "\n\nScanning done.\n\n"

print ("Original file count =", original_count)
print ("Original folder count =", original_folders)
print ("Bytes in original files =", original_size)

print ("Copy file count =", copy_count)
print ("Copy folder count =", copy_folders)
print ("Bytes in copies =", copy_size)

print ("\nDifference in file count =", original_count - copy_count)
print ("\nDifference in folder count =", original_folders - copy_folders)
print ("\nDifference in file sizes =", original_size - copy_size, "(bytes)")

print "\n\nSorting ...\n\n"

-- get in order to make scanning easier
table.sort (not_in_copy)
table.sort (not_in_original)
table.sort (wrong_size)

-- file for the results
local f = io.output (RESULTS_FILE) 

f:write ("Analysis of original directory: ", ORIGINAL_ROOT, ORIGINAL_PATH, "\n")
f:write ("Original directory had ", 
          original_count, " files, ", 
          original_folders, " folders, ", 
          original_size, " bytes.\n\n")
f:write ("Compared to copy directory:     ", COPY_ROOT, COPY_PATH, "\n")
f:write ("Copy directory had ", 
          copy_count, " files, ", 
          copy_folders, " folders, ", 
          copy_size, " bytes.\n")



print "\n\nAnalyzing...\n\n"

show_table (not_in_copy , "Files not in the copy (" .. COPY_ROOT .. COPY_PATH .. "):", f)
show_table (not_in_original , "Files not in the original (" .. ORIGINAL_ROOT .. ORIGINAL_PATH .. "):", f)
show_table (wrong_size , "Files which are different sizes:", f)

f:close ()  -- close that file now

print ("Done. Results in file:", RESULTS_FILE)

local tend = os.time ()
print ("Time taken for scan = " .. os.difftime (tend , tstart) .. " second(s).")




I saved this as dirscan.lua. (To use it, just type: lua dirscan.lua)

There are a few constants in upper case at the start of this file. These control what directories are scanned. The "root" ones (ORIGINAL_ROOT and COPY_ROOT) are intended to be the parts of the file system that will be different (eg. x:/somedir/somefile and y:/somedir/somefile). In this case the directory and file names are the same, but the x: and y: indicate we are looking at different drives.

The next part (ORIGINAL_PATH) is the directory to start in (eg. "documents", "music", "photos" etc.).

Finally RESULTS_FILE is the name of the file to write results to. A file is used in case the output is so lengthy it scrolls off the screen and disappears.

What the utility does is first scan ORIGINAL_ROOT/ORIGINAL_PATH and build a list of every file in it by recursing when it hits a subdirectory. This is stored in a table (original).

Then it scans COPY_ROOT/COPY_PATH and builds a second table (copy). During the scan it counts files, folders and file sizes.

Once both scans are finished it goes through the table of original files (by name) and checks that each one is in the "copy" table. It also checks the file size is the same.

The names of files that are not present in the copy are saved in another table (ready for sorting later on). Each matching file is deleted from both tables, ready for a check on which files are present in the copy but not the original.

Then a second scan is done of the copy table to see if some files are in the copy directory tree but not the original.

Finally a report is done, showing which files are only on one side or the other, or are the wrong sizes. The reports are sorted into alphabetic order, to make it obvious if a whole lot of files from a single directory are missing.

I found this a quick way of verifying a copy had been done without missing anything, or in the case of an error message, to work out where to start copying from.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,159 posts)  Bio   Forum Administrator
Date Reply #1 on Wed 06 Feb 2008 03:06 AM (UTC)
Message
If anyone wants to play with this without having to recompile, this download makes the relevant files available:

http://www.gammon.com.au/files/utils/dirscan.zip (98 Kb)

Inside that is the source for lua_utils.c, the file dirscan.lua, which does the actual scanning, and the precompiled lua_utils.dll.

All you need to make it work is an installed copy of Lua 5.1. One copy is at:

http://www.gammon.com.au/files/mushclient/lua5.1_extras/lua5.1.zip



- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,159 posts)  Bio   Forum Administrator
Date Reply #2 on Wed 06 Feb 2008 03:13 AM (UTC)
Message
You could take the general idea of recursive directory scanning, and modify it to find something more specific (eg. all .jpg files anywhere on your hard disk).

Or, by taking md5 sumchecks of each file (which would take a while of course) you could locate files which were identical, but perhaps salted away in different places.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,159 posts)  Bio   Forum Administrator
Date Reply #3 on Wed 06 Feb 2008 03:16 AM (UTC)

Amended on Wed 06 Feb 2008 03:17 AM (UTC) by Nick Gammon

Message
An example of output is (in results.txt):


Analysis of original directory: x:/general/
Original directory had 20348 files, 1731 folders, 15146925697 bytes.

Compared to copy directory:     z:/
Copy directory had 8560 files, 788 folders, 8111655461 bytes.
----------------------------------------------------------------------
Files not in the copy (z:/):

 AreaEditor134.zip
 BitTorrent-6.0.exe
 GoogleEarthMac.dmg
 HP Printer Installer.dmg

... and so on for quite a few lines ...

----------------------------------------------------------------------
Files not in the original (x:/general/):

 (none)

----------------------------------------------------------------------
Files which are different sizes:

 (none)




Meanwhile, during the scan, on the console (ie. stdout) you see the name of each directory as it is processed, followed by a summary:


Scanning done.


Original file count =   20348
Original folder count = 1731
Bytes in original files =       15146925697
Copy file count =       8560
Copy folder count =     788
Bytes in copies =       8111655461

Difference in file count =      11788

Difference in folder count =    943

Difference in file sizes =      7035270236      (bytes)


Sorting ...




Analyzing...


Done. Results in file:  results.txt
Time taken for scan = 104 second(s).



In this example you can see that the copy terminated prematurely, and 11788 files were not copied. Looking inside results.txt shows where to start copying to get those missing files.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Nick Gammon   Australia  (23,159 posts)  Bio   Forum Administrator
Date Reply #4 on Wed 06 Feb 2008 04:18 AM (UTC)
Message
There is a minor problem in the output file - the file names are sorted in alphabetic order, but case-sensitively, whereas Windows shows file lists not case-sensitive.

To fix that (and demonstrate the power of Lua) you can replace the 3 lines that do table.sort with:


local function sort_function (k1, k2)
  return k1:lower () < k2:lower ()
end -- sort_function

-- get in order to make scanning easier (case-independent)

table.sort (not_in_copy, sort_function)
table.sort (not_in_original, sort_function)
table.sort (wrong_size, sort_function)


This still does a sort, but supplies a custom comparison function. The comparison function forces its arguments to lower case before comparing, thus making the sort case independent.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


19,669 views.

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.