MONTHLY CATALOG

Newsracker

For most of my career in technology, I’ve worked primarily with front-end technologies - HTML/CSS. I consider it much more of a designer role, only I do my work with markup instead of Photoshop. I won’t go as far to say that I skip the Photoshop phase altogether, but I do try to break out of it as soon as possible and get to tangible code.

I’ve also worked a little bit with the tools that “programmers” use. PHP, *nix system adminsitration, blah blah blah. For the most part it’s boring. PHP is a pain in the ass to use, and I haven’t been able to wrap my head around the automagical frameworks like Ruby on Rails or Django. I don’t know either Ruby or Python particularly well, and these frameworks abstract so much that I don’t ever feel particularly comfortable using them to learn either language.

These days, however, HTML/CSS isn’t enough. You need to be well versed in Javascript, something I’ve been able to avoid for a while, and at least know a server-side technology well enough to build prototypes. I could do this in PHP, but even the thought of working with PHP is enough to make me want to repeatedly bash my dick with a hammer.

I’ve been messing around with Python for a while and ran through a couple of tutorials, but I needed a project of my own to really learn how to use it. I found a lightweight web framework called Web.py that makes me code all the important parts by hand without having to worry about the really low-level stuff like getting the page to actually display in a browser.

While browsing around about a week ago, I came across this page on Newseum’s website that collects the front pages of newspapers from around the world.

newseum's daily covers

It seemed like a gold mine of untapped data. Surely there would be some benefit from being able to analyze what’s on the front pages of newspapers around the world. Newseum provides all sorts of data long with the image: the name of the publication, city, state, and country of publishing, and the date of the newspaper. If I could somehow glean the context of each over, which stories were given prominence, and what words are being used I would be able to visualize what issues are being covered and how they are being reported.

Now that I had my project, I needed to figure out what I wanted to do with it. Initially, I was planning on downloading the full-size JPG, finding some sort of OCR library, indexing the words, and drawing conclusions based on the gleaned data. This plan ran into a few problems:

  1. I knew next to nothing about Python.
  2. Even if I was comfortable with Python, it wouldn’t matter because I’m a horrible programmer.
  3. There aren’t a lot of OCR options for Python. Google open-sourced Tesseract which is based off work done at HP during the mid-90s, but it’s a bit of a pain in the ass anyways.
  4. Any output I could get out of Tesseract would be useless, because the JPGs weren’t of sufficient quality to perform OCR.

I solved the last problem to a certain extent by using the PDF files Newseum provides for each of the newspapers, but I still couldn’t, and I’m sure it’s entirely my fault, get any output from Tesseract. Instead of continuing to bang my head against the wall, I decided to just change the short-term goal of the project and get something running as soon as possible.

After massaging the data into SQLite and grabbing longitude and latitude coordinates for the publishing location of each newspaper (US only for now), I decided I would plot each cover onto a blank canvas based on it’s latitude/longitude. The end result would be a daily, dynamically generated graphic representing the most important daily issues throughout the country based on local media attention.

I was able to cobble together some python code to access my SQLite database, fetch both the JPGs and PDFs (I still want to analyze the content at some point), and resize them for plotting purposes. I tried getting away with only downloading the PDF and converting that, but PIL kept choking.

screen capture of python script running

Sadly, Python either lacks good visualization libraries or I couldn’t figure out how to use the ones that exist. After some searching, I came across NodeBox, a Python-based visualization tool. I was able to load up my database, access the newly downloaded images, and create a composite map using the latitude/longitude coordinates. The end result looks a little something like this (click for full size version):

newsracker map for June 9th, 2008

I’m pretty happy with the results, but there’s still a lot more work to do. I’d like to be able to size the newspapers by relative circulation numbers so that more popular papers become more prominent. I also haven’t been able to figure out how to get the NodeBox library running on my web server, so any web app I would currently be able to build would involve far too many manual steps to keep updated. I’m toying with doing the visualization on the client side using jQuery, but I really would like to generate it on the server using Python.

Nerdy, I know.

— Alex Cabrera, Jun 10, 2008.

No Comments Yet

You can be the first to comment!

Leave a comment