If you're a geek like me, one of the best things about the summer is new episodes of Nova Science Now. My favourite segments are the ones where they profile a scientist and recently they profiled Luis von Ahn, Assistant Professor of Computer Science at Carnegie Mellon University.
Luis is the inventor of the Captcha, the squiggly words we type when registering on many websites to prove we're human and to keep spammers out. After initially being very proud of his invention, he soon realized that if every person takes 10 seconds to type a Captcha, about 500 thousand hours of humanity's time was being wasted everyday.
In an effort to turn those 10 seconds from a waste into something useful, he invented the ReCaptcha, turning us all into unknowing volunteers in a massive public works project.
There are many projects digitizing books - Google and the Internet Archive projects for example - using optical character recognition. The challenge for these projects has been working with older books where the print is often faded or blurred. The OCR software can't recognize the words. In fact, about 30% of the words are deciphered incorrectly. Most often though, the words are still decipherable by a human.
Luis's ReCaptcha takes those words and uses them as Captchas. In a two-word ReCaptcha, one of the words you type is the computer generated puzzle that proves you're a human. The other word comes out of the archive. It's assumed that if you can decipher the first word correctly, you can do the second word too. If 10 people decipher the word the same way, it is officially digitized.
How fast can the the process of digitizing books and periodicals one word at at time go? Incredibly fast. Especially when Facebook, Twitter, TicketMaster among others are using ReCaptcha. Luis says that since it's launch last year, over 400 million people or 6% of the world's population have helped digitize words. 35 million ReCaptcha words per day results in 125 to 150 digitized books. About 20 years of the New York Times archive was completed in a few months. The entire 30 year archive should be done by the end of the year.
Here's a video of Luis speaking to the Library of Congress.
Recent Comments