Pay no attention to the code behind the curtain: the tech behind tldr.it
My last post over on my personal blog talked a bit about the story behind the application, showed what it was, and how to use it, but this time, I want to give you guys and gals a little bit of detail on the tech behind the application.
The application is a Rails 3.0.1 application. It has 3 controllers and 2 models, with almost 1000 lines of application code. I'm making use of about 10 third-party gems, mostly for fetching and parsing tasks.
The general architecture of the app centers around two distinct pieces. The Rails application really just kind of accepts and displays data: the real magic happens in the background jobs (currently powered by delayed_job).
The background jobs
Once a job is fired off, a job runner grabs it and fetches the content. If it's a feed, then the feed is fetched by Feedzirra, summarized down (more details on this in a bit), and stored back in the record. I persist all 3 versions of the feed along with the original content. I did that because I intended to show the length differences on the page (e.g., "This feed is 70% shorter!"), but I didn't have time.
If it's a web page, then the content is fetched by RestClient. I then use nokogiri to extract the main content of the page out. The algorithm I'm using is pretty complex and clever, but since it's sort of half the "secret sauce" of tldr.it, I'm not going to describe it in detail. I will say that it uses some things from my own research, some refinements from the Readability bookmarklet's techniques, and some HTML-specific (and HTML5 specific) additions. It's nowhere near perfect, but then again I did built it in 48 hours. 🙂
Next, it's passed to the summarizer. The summarizer is largely powered by libots, an open-source word-frequency powered text summarizer. This library works quite well, but I hit the obstacle that it was written in C. I had planned to just pipe out to its command line utility, but its utility doesn't take input from stdin very well (and by not very well, I mean it segfaulted every time). So, at that point I wanted to just write a Ruby extension or use ffi. Neither of those approaches worked out (good C programmer, I am not), so I just opted to write my own C shell app to pipe to and get info back from. The way the summarizer works is to use the Ruby standard library's Shell class (I bet you've never heard of that one!) to pipe out the text content of the page (with some smart additions and such from my code) to my C summarizer with the summarization ratio as an argument to the utility. It captures the output on stdout (if there's an error for any reason like encoding, then it just returns blank) and places that back in the record. I do this 3 times for each web page and each feed.
Once the summarized text is captured, then the record is updated by the background job and the action that's polled by the user's browser returns the right JSON and HTML to update the user's view to show that it's been fetched.
Places to improve
I want to replace libots with a library of my own creation. I wanted to do this during the Rumble (or at least enhance ots's output with it), but I didn't have time at all. I'm still not totally sure which algorithm I'm going to use, but word frequency doesn't work the best in every situation. I also need to refine the content extraction algorithm, working on more special case parsers (currently there's only one for NYTimes and Blogspot blogs). I see why many of the URL's people try aren't working, but I didn't have a chance to add a second pass algorithm if we miss the content on the first run. I also want to make the extraction content-aware, since right now it just does some analysis on page structure and loose content detection.
Anyhow, that's the technical background. Feel free to ask any questions; I'll answer to the best of my ability!