Wednesday, June 1, 2011

Finding duplicate files in a dired buffer



picture by Donald MacLeod

This is a an example of programming emacs in emacs-lisp just to give an idea of what you can put together in an hour or two. I was looking at a dired buffer with a bunch of photos in, and some were the same photo that I'd downloaded twice. So I started thinking about writing a utility in emacs to automatically find and remove the duplicate files. In this post I'll just show the code for finding the files and display their filenames in a buffer.

I've put the source on google code.

After downloading you can load the source into emacs and call `eval-buffer', then open up a dired buffer to try it out. For this to be useful you need some duplicated files, so make some if you need to.

Mark the files you want to check for duplicates. For example to mark all jpg files you would type %m to mark files matching a regexp and type .*\.jpg

Now execute the command `dired-show-marked-duplicate-files' and after a short delay (in my test 80 jpg photos took about 5 seconds) you'll see a buffer called 'Duplicated files' which contains a list of the files which have the same contents.

Next steps for this little project will be to give you an interactive way to delete the duplicated files. I haven't decided quite how I'd like that to work, drop me an email if you have an idea. I've been thinking about perhaps resetting which files are marked so that only the duplicates are marked. At that point you can then hit R to move them to another spot, or delete them with x.

Now some comments about the code involved...

Most of the work is done in the function dired-show-marked-duplicate-files. First line " (interactive)" makes it an interactive function, meaning the user of emacs can invoke it.

"(if (eq major-mode 'dired-mode)" will check that we're in the right kind of buffer, because it makes no sense to run this in another mode.

In order to find the duplicate files I just need to walk the list of marked files, generate the md5 value of the contents of each one and add it to hash table. The keys in the hash table will be the md5, and the values will be a list of files with that md5. Once we've done that, finding duplicates is a simple matter of walking the hash table keys and displaying any where the value has multiple entries.

"(let ((md5-map (make-hash-table :test 'equal :size 40)))" Creates the hash table, making sure we use 'equal to match our filenames.

"(let ((filenames (dired-get-marked-files)))" this gets the marked files as a list of filenames

The next little bit of code is just to store the item in the hash table after getting the md5. There's no function in emacs to get the md5 of a file, but you can get the md5 of a string, so I wrote a helper function for getting the contents of a file into a temporary buffer first.

(defun md5-file(filename)
"Open FILENAME, load it into a buffer and generate the md5 of its contents"
(interactive "f")
(with-temp-buffer
(insert-file-contents filename)
(md5 (current-buffer))))

Finally I want to display the results, so I create a buffer and then use maphash (walks the keys of a hash table executing a function on each) with a helper function `show-duplicate' which simply writes the values of the hash table entry into that buffer.

7 comments:

Alex said...

Nice example snippet of code that :-)

Anonymous said...

You could use (shell-command-to-string "md5sum FILENAME") for a huge speed up.

No need to copy everything in a buffer.

Justin said...

@anonymous I thought about that but I want it to work out the box in any environment, and I don't care about speed

Mathias said...

I would definitely remove all marks and then mark every dupe as you were thinking about. I would use the normal mark (what you get with `m', and what is shown with an asterisk) and not "delete marks", for having all options open (do I want to move, delete, rename or what).

Justin said...

@Mathias Thanks for the feedback! It seems like the way to go.

mark bun said...

Im using "Duplicate Files Deleter" an easy fix for duplicates.

Danewan Williams said...

In this condition I used DuplicateFilesDeleter effectively. This software will let you get a huge amount of space for your use by deleting the files that were at multiple locations.