Saturday, February 7, 2009

Fun with RSS

Today I've been playing around with emacs, and one of the things I've wanted to write is something to grab a web page, do some manipluations to it and dump that to a buffer.

So what I've got going is two things that are not quite joined up yet. Firstly grabbing some URL and sticking it in a buffer...


(defun url-to-buffer(url)
(interactive "sEnter site url : ")
(let ((buffer (get-buffer-create "*url-to-buffer*")))
(shell-command (format "c:/coreutils/bin/wget.exe -q -O - %s" url) buffer)))


As you can see I'm using windows here, but as long as you have a path to wget this should work. I'm simply running wget and capturing the output to a buffer.

So running that on BBC's news RSS feed, I then saved it to a file called bbc.xml.

Since this is an XML file I can parse it without any effort using xml.el. Here is some code that walks through the items in the RSS file and prints them out in human readable format (ie: not XML) into a buffer.

There is some helpful info about xml.el here

http://www.emacswiki.org/emacs/XmlParserExamples

which was neccesary to figure to figure out the syntax to grab the text from a tag. I wrote helper function to go from a node to the text for that node, since the syntax is quite verbose:


(defun get-item(tag node)
(let ((child-node (car (xml-get-children node tag))))
(car (xml-node-children child-node))))


This lets you go from a tag name (eg title) and a node that you parsed from the xml file, to the item.


(defun parse-rss(filename)
(interactive "fRss file :")
(let ((parsed-xml
(xml-parse-file filename))
(buffer
(get-buffer-create (format "%s-parsed.txt" filename))))
(goto-line 1 buffer)
(erase-buffer)
(let* ((rss (car parsed-xml))
(channel (xml-get-children rss 'channel))
(items (xml-get-children (car channel) 'item)))
(dolist (item items)
(let ((title (get-item 'title item))
(description (get-item 'description item))
(date (get-item 'pubDate item)))
(insert (format "*%s*\n%s\n%s\n\n" title description date)))))))


xml-parse-file is our one shot call to parse the xml file, which returns a nested lisp structure representing the xml document. I grab the channel, then grab any items within the channel and print out three items that I've grabbed from them, the title, description and date.

So each item output looks like this:

*Obama defends economic stimulus*
The US president defends his economic stimulus plan as "absolutely necessary", and urges Congress to approve it quickly.
Sat, 07 Feb 2009 22:23:20 GMT

Note that if you want to read RSS feeds in emacs, look up the builtin function newsticker. All this information I'm writing about is really useful if you want to code up something custom though.

No comments: