Despite not signing up for a Rails Rumble team this year, I nevertheless followed the results closely. One project in which I took an early interest was Jeremy McAnally's project, tldr.it. Having always been fascinated by machine parsing of human language, the technology that powered it, Open Text Summarizer, was a real draw for me. Reading through the source code, I realized that this would be a perfect opportunity to combine my resurrected C knowledge with Ruby.
There's a lot going on under the hood but a quick peek shows us that the library first loads up a stemming dictionary based on your language of choice. Parsing a document based on the loaded stem rules creates an
OtsArticle, a pre-defined
struct which keeps track of a document's statistics such as term frequency and word scores. The parsed result is then fed into a highlighter which returns only a portion of the text based on a passed in ratio, an integer between zero and 100.
The source is on github and installation is a breeze provided you are on a POSIX-compliant system with glib-2.0 and libxml-2.0 installed and properly configured.
gem install summarize
For the sake of convenience, I've made the
summarize method available as a public instance method on both String and File.