Wednesday, June 1, 2011

Finding duplicate files in a dired buffer



picture by Donald MacLeod

This is a an example of programming emacs in emacs-lisp just to give an idea of what you can put together in an hour or two. I was looking at a dired buffer with a bunch of photos in, and some were the same photo that I'd downloaded twice. So I started thinking about writing a utility in emacs to automatically find and remove the duplicate files. In this post I'll just show the code for finding the files and display their filenames in a buffer.

I've put the source on google code.

After downloading you can load the source into emacs and call `eval-buffer', then open up a dired buffer to try it out. For this to be useful you need some duplicated files, so make some if you need to.

Mark the files you want to check for duplicates. For example to mark all jpg files you would type %m to mark files matching a regexp and type .*\.jpg

Now execute the command `dired-show-marked-duplicate-files' and after a short delay (in my test 80 jpg photos took about 5 seconds) you'll see a buffer called 'Duplicated files' which contains a list of the files which have the same contents.

Next steps for this little project will be to give you an interactive way to delete the duplicated files. I haven't decided quite how I'd like that to work, drop me an email if you have an idea. I've been thinking about perhaps resetting which files are marked so that only the duplicates are marked. At that point you can then hit R to move them to another spot, or delete them with x.

Now some comments about the code involved...

Most of the work is done in the function dired-show-marked-duplicate-files. First line " (interactive)" makes it an interactive function, meaning the user of emacs can invoke it.

"(if (eq major-mode 'dired-mode)" will check that we're in the right kind of buffer, because it makes no sense to run this in another mode.

In order to find the duplicate files I just need to walk the list of marked files, generate the md5 value of the contents of each one and add it to hash table. The keys in the hash table will be the md5, and the values will be a list of files with that md5. Once we've done that, finding duplicates is a simple matter of walking the hash table keys and displaying any where the value has multiple entries.

"(let ((md5-map (make-hash-table :test 'equal :size 40)))" Creates the hash table, making sure we use 'equal to match our filenames.

"(let ((filenames (dired-get-marked-files)))" this gets the marked files as a list of filenames

The next little bit of code is just to store the item in the hash table after getting the md5. There's no function in emacs to get the md5 of a file, but you can get the md5 of a string, so I wrote a helper function for getting the contents of a file into a temporary buffer first.

(defun md5-file(filename)
"Open FILENAME, load it into a buffer and generate the md5 of its contents"
(interactive "f")
(with-temp-buffer
(insert-file-contents filename)
(md5 (current-buffer))))

Finally I want to display the results, so I create a buffer and then use maphash (walks the keys of a hash table executing a function on each) with a helper function `show-duplicate' which simply writes the values of the hash table entry into that buffer.

Tuesday, April 26, 2011

Programmer tips for Mac OSX

Some tips for programmers on the Mac.

emacs: the best place to get emacs for Mac seems to be here http://emacsformacosx.com/ which is also the most no nonsense website design ever

Not really a Mac tip, but I stick my .emacs configuration file in a Dropbox folder, along with any emacs libraries and emacs lisp code I write. Then where-ever I install emacs I make a simple .emacs that points to the one in the Dropbox folder. This also forces me to make sure any platform specific emacs stuff is properly handled.

Clipboard: copy and paste between the terminal and other apps can be done with pbcopy and pbpaste. For example a long complicated command line you want to email to yourself, just do:

echo "long complicated bash command line you don't want to retype" | pbcopy

And then you can Command-V that into your email window. Going the other way is just as simple; Command-C the text you want and pop it into the terminal window with pbpaste.

Open: If you want open an application from the command line you can do it like this:

open -a SomeApp /Users/yourname/SomeFile.hai

You can open a folder in finder

open /Folder/

or

open /Folder/SomeFile.hai

to open that file with it's default application.

Check out the help 'man open', to see other stuff like how you pipe stdout into an application.

Finally check out this guys OpenTerminalHere script. This pops an icon in finder that lets you open a terminal window in the highlighted folder.








MovieRatings

A little side project I did when learning about Clojure was to grab movie ratings from Rotten Tomatoes, which I did a post about here:


This is just an update that I've posted the whole leiningen project onto github




Monday, April 25, 2011

Talking to mysql from Python on Mac OS X 10.6




image by Sam Pullara

Here's a problem I couldn't solve with Google, although it seems to be a moving target so YMMV.

I wanted to do some work driving a MySQL database with Python. On Windows and Linux I've used MySQLdb so I decided to do the same on Mac.

For reference I got mysql (client and server) through mac ports.

mysql5 --version
mysql5 Ver 14.14 Distrib 5.1.45, for apple-darwin10.4.0 (i386) using readline 6.1

and Python is the stock version:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

First download from here http://sourceforge.net/projects/mysql-python/ and extract the file somewhere...

mkdir ~/pythondb
cd ~/pythondb
tar -vxf ~/Downloads/MySQL-python-1.2.3.tar.gz

Then you need open up the site.cfg file and make a change as below. Your mysql_config5 maybe in a different spot. You can find out with the command 'which mysql_config5'.

# The path to mysql_config.
# Only use this if mysql_config is not on your PATH, or you have some weird
# setup that requires it.
mysql_config = /opt/local/bin/mysql_config5
Now execute the following:

python setup.py build
python setup.py install

If everything works you can now import MySQLdb in your python program and start interacting with mysql.

Saturday, January 15, 2011

Grabbing Rotten Tomatoes movie ratings with Clojure


flikr pic by Gammelmark

Currently I'm teaching myself Clojure from Stuart Halloway's excellent book Programming Clojure. Here's my first program that does something; a simple web page scraper to get the critics and audience ratings for movies off Rotten Tomatoes. Here's how it looks at the REPL:

rottentomatoes.core> (pmap-get-movie-ratings "lord of the rings")
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_return_of_the_king/
Audience 83
Critics 94
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_fellowship_of_the_ring/
Audience 92
Critics 92
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_two_towers/
Audience 92
Critics 96
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings/
Audience 74
Critics 50
movie url: http://www.rottentomatoes.com/m/master_of_the_rings_the_unauthorized_story_behind_jrr_tolkiens_the_lord_of_the_rings/
Audience 34
Critics null
movie url: http://www.rottentomatoes.com/m/jrr-tolkien-and-the-birth-of-the-lord-of-the-rings/
Audience 93
Critics null
movie url: http://www.rottentomatoes.com/m/jrr_tolkien_and_the_birth_of_the_lord_of_the_rings/
Audience 32
Critics null
movie url: http://www.rottentomatoes.com/m/more_at_imdbpro_creating_the_lord_of_the_rings_symphony_a_composers_journey_through_middle_earth/
Audience 100
Critics null
nil


I use leiningen to develop with Clojure (it's like Maven for Java), so if you want to build the project here's my project configuration that includes the dependencies used. I'm using swank-clojure which enables the REPL to function with emacs slime. http.async.client is a clojure API that builds on Netty and I use that for the GET requests to the Rotten Tomatoes server.


(defproject rottentomatoes "1.0.0-SNAPSHOT"
:description "Clojure code to grab movie ratings from Rotten Tomatoes"
:dependencies [
[org.clojure/clojure "1.2.0"]
[org.clojure/clojure-contrib "1.2.0"]
[http.async.client "0.2.1"]
]
:main rottentomatoes.core
:dev-dependencies [
[swank-clojure "1.2.1"]
]
)

And here's the code:

(ns rottentomatoes.core
(:gen-class)
(:require
[clojure.contrib.str-utils2 :as s]
[http.async.client :as c]))

(import [java.net URLEncoder]
[java.lang.Character])

(def *base-url* "http://www.rottentomatoes.com")
(def *search-end-point* "/search/full_search.php?search=")

(defn first-match-after [re1 re2 seq]
"Splits the sequence SEQ using RE1 then searches after the first match and before the next match for the first occurence of RE2"
(let [[_ _ after] (s/partition seq re1)]
(re-find re2 after)))

(defn response-status-code [resp]
(:code (c/status resp)))

(defn scoop-url [url]
"Use the http client to do a GET on the url"
(let [resp (c/GET url)]
(c/await resp)
[(response-status-code resp) (c/string resp)]))

;; Get movie urls
;; Does a search of Rotten Tomatoes for the search text, then scrapes the results
;; for the page for each movie. Returns a collection of the movie urls

(defn get-movie-urls [search-text]
(let [encoded-search-text (URLEncoder/encode search-text)
[code body] (scoop-url (str *base-url* *search-end-point* encoded-search-text))
]
(when (= code 200)
(let [[_ _ after] (s/partition body #"<span>Title</span>")]
(let [[_ & results] (s/partition after #"\"(/m/.*/)\"")]
(map #(str *base-url* (second %)) (take-nth 2 results)))))))

;; Given a movie url GET the page then scrape it for the citic and audience rating

(defn get-movie-rating [movie-url]
(let [[code body] (scoop-url movie-url)]
(if (= code 200)
{:critics (second
(first-match-after #"class=\"critic_side_container" #">([0-9]+)<" body))
:audience (second
(first-match-after #"class=\"fan_side" #">([0-9]+)<" body))})))

;; Finds the ratings for all Rotten Tomatoes movies that match the search string and prints them out

(defn get-movie-ratings [search-str]
(let [urls (get-movie-urls search-str)]
(when (> (count urls) 0)
(doseq [url urls]
(let [ratings (get-movie-rating url)]
(printf "movie url: %s\n\tAudience %s\n\tCritics %s\n" url (:audience ratings) (:critics ratings)))))))

;; Slight variant on above that uses pmap so that the requests are done in parallel

(defn pmap-get-movie-ratings [search-str]
(let [urls (get-movie-urls search-str)]
(when (> (count urls) 0)
(let [ratings (pmap #(get-movie-rating %) urls)
url-and-ratings (map vector urls ratings)]
(doseq [[url ratings] url-and-ratings]
(printf "movie url: %s\n\tAudience %s\n\tCritics %s\n" url (:audience ratings) (:critics ratings)))))))
I'm using the str-utils2 module for it's regex function partition, which will split a sequence up by regex matches. This made it easy to write the function `first-match-after', which finds a regex then finds the first occurrence of some text after that regex.

It was so easy to parallelize the requests. My first attempt at get-movie-ratings retrieved each movie page synchronously. By using pmap I was able to make it do the requests via thread pools, and thus return in a few seconds for many movie matches.

The code is much shorter than it would have been in Common Lisp, at least the way I program. I love the destructuring syntax, and that maps, vectors and lists can be returned from functions and manipulated without much effort.

I'm still new to Clojure so if you feel you can improve the code or have any feedback please let me know.

Tuesday, January 11, 2011

View Data from the Clojure REPL

Here's a nice debugging feature in Clojure. The inspect module lets you look at variables in a popup JFrame. The two examples below show how you quickly view data in a table or tree format. This is really handy to quickly view data from the REPL.

(require 'clojure.inspector)

(clojure.inspector/inspect-tree '(1 (a b) 2 (c d) 3 (e f )))

(clojure.inspector/inspect-table '((1 2 3) (a b c) (e f g)))


Wednesday, December 15, 2010

F# vs C#

Nice article comparing directly some code written in C# vs one in F#

http://sharp-gamedev.blogspot.com/2010/12/on-performance-of-f-on-xbox-360.html

What's interesting is that now that VM's are starting to become the new platforms, we are starting to be less restricted by language choice. When it's native code with hand crafted memory management you want, it has to be C++.

But when you start to look at the JDK and the .Net VM's, the language choice has far less impact on performance... after all you're using the same garbage collector, same base libraries and so on.

This is great news for those of us with more peculiar tastes in language (I like Clojure and Common Lisp for example).

Although I think it will be a few years until AAA console games run on VM's, if ever, due to the nature of that business. Memory is always at a premium, and the goal is to choke every last hz of CPU performance.