Saturday, January 15, 2011

Grabbing Rotten Tomatoes movie ratings with Clojure


flikr pic by Gammelmark

Currently I'm teaching myself Clojure from Stuart Halloway's excellent book Programming Clojure. Here's my first program that does something; a simple web page scraper to get the critics and audience ratings for movies off Rotten Tomatoes. Here's how it looks at the REPL:

rottentomatoes.core> (pmap-get-movie-ratings "lord of the rings")
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_return_of_the_king/
Audience 83
Critics 94
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_fellowship_of_the_ring/
Audience 92
Critics 92
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings_the_two_towers/
Audience 92
Critics 96
movie url: http://www.rottentomatoes.com/m/lord_of_the_rings/
Audience 74
Critics 50
movie url: http://www.rottentomatoes.com/m/master_of_the_rings_the_unauthorized_story_behind_jrr_tolkiens_the_lord_of_the_rings/
Audience 34
Critics null
movie url: http://www.rottentomatoes.com/m/jrr-tolkien-and-the-birth-of-the-lord-of-the-rings/
Audience 93
Critics null
movie url: http://www.rottentomatoes.com/m/jrr_tolkien_and_the_birth_of_the_lord_of_the_rings/
Audience 32
Critics null
movie url: http://www.rottentomatoes.com/m/more_at_imdbpro_creating_the_lord_of_the_rings_symphony_a_composers_journey_through_middle_earth/
Audience 100
Critics null
nil


I use leiningen to develop with Clojure (it's like Maven for Java), so if you want to build the project here's my project configuration that includes the dependencies used. I'm using swank-clojure which enables the REPL to function with emacs slime. http.async.client is a clojure API that builds on Netty and I use that for the GET requests to the Rotten Tomatoes server.


(defproject rottentomatoes "1.0.0-SNAPSHOT"
:description "Clojure code to grab movie ratings from Rotten Tomatoes"
:dependencies [
[org.clojure/clojure "1.2.0"]
[org.clojure/clojure-contrib "1.2.0"]
[http.async.client "0.2.1"]
]
:main rottentomatoes.core
:dev-dependencies [
[swank-clojure "1.2.1"]
]
)

And here's the code:

(ns rottentomatoes.core
(:gen-class)
(:require
[clojure.contrib.str-utils2 :as s]
[http.async.client :as c]))

(import [java.net URLEncoder]
[java.lang.Character])

(def *base-url* "http://www.rottentomatoes.com")
(def *search-end-point* "/search/full_search.php?search=")

(defn first-match-after [re1 re2 seq]
"Splits the sequence SEQ using RE1 then searches after the first match and before the next match for the first occurence of RE2"
(let [[_ _ after] (s/partition seq re1)]
(re-find re2 after)))

(defn response-status-code [resp]
(:code (c/status resp)))

(defn scoop-url [url]
"Use the http client to do a GET on the url"
(let [resp (c/GET url)]
(c/await resp)
[(response-status-code resp) (c/string resp)]))

;; Get movie urls
;; Does a search of Rotten Tomatoes for the search text, then scrapes the results
;; for the page for each movie. Returns a collection of the movie urls

(defn get-movie-urls [search-text]
(let [encoded-search-text (URLEncoder/encode search-text)
[code body] (scoop-url (str *base-url* *search-end-point* encoded-search-text))
]
(when (= code 200)
(let [[_ _ after] (s/partition body #"<span>Title</span>")]
(let [[_ & results] (s/partition after #"\"(/m/.*/)\"")]
(map #(str *base-url* (second %)) (take-nth 2 results)))))))

;; Given a movie url GET the page then scrape it for the citic and audience rating

(defn get-movie-rating [movie-url]
(let [[code body] (scoop-url movie-url)]
(if (= code 200)
{:critics (second
(first-match-after #"class=\"critic_side_container" #">([0-9]+)<" body))
:audience (second
(first-match-after #"class=\"fan_side" #">([0-9]+)<" body))})))

;; Finds the ratings for all Rotten Tomatoes movies that match the search string and prints them out

(defn get-movie-ratings [search-str]
(let [urls (get-movie-urls search-str)]
(when (> (count urls) 0)
(doseq [url urls]
(let [ratings (get-movie-rating url)]
(printf "movie url: %s\n\tAudience %s\n\tCritics %s\n" url (:audience ratings) (:critics ratings)))))))

;; Slight variant on above that uses pmap so that the requests are done in parallel

(defn pmap-get-movie-ratings [search-str]
(let [urls (get-movie-urls search-str)]
(when (> (count urls) 0)
(let [ratings (pmap #(get-movie-rating %) urls)
url-and-ratings (map vector urls ratings)]
(doseq [[url ratings] url-and-ratings]
(printf "movie url: %s\n\tAudience %s\n\tCritics %s\n" url (:audience ratings) (:critics ratings)))))))
I'm using the str-utils2 module for it's regex function partition, which will split a sequence up by regex matches. This made it easy to write the function `first-match-after', which finds a regex then finds the first occurrence of some text after that regex.

It was so easy to parallelize the requests. My first attempt at get-movie-ratings retrieved each movie page synchronously. By using pmap I was able to make it do the requests via thread pools, and thus return in a few seconds for many movie matches.

The code is much shorter than it would have been in Common Lisp, at least the way I program. I love the destructuring syntax, and that maps, vectors and lists can be returned from functions and manipulated without much effort.

I'm still new to Clojure so if you feel you can improve the code or have any feedback please let me know.

4 comments:

Dave Kincaid said...

Thanks for posting this. I'm just starting to learn Clojure too and it helps to read other people's code. Yours is easy to read and gave me a good understanding of several different techniques.

Justin said...

Thank you sir

Film Mesum said...

Thank you, Your article is very useful for me

Unknown said...

I'm teaching myself Clojure from Stuart Halloway's excellent book Programming Clojure. Here's my first program that does something; a simple web page scraper to get the critics and audience ratings for movies off Rotten Tomatoes. wedding bands