Every time you type a two-word Captcha, you're helping to digitize the world's printed archives.
If you're a geek like me, one of the best things about the summer is new episodes of Nova Science Now. My favourite segments are the ones where they profile a scientist and recently they profiled Luis von Ahn, Assistant Professor of Computer Science at Carnegie Mellon University.
Luis is the inventor of the Captcha, the squiggly words we type when registering on many websites to prove we're human and to keep spammers out. After initially being very proud of his invention, he soon realized that if every person takes 10 seconds to type a Captcha, about 500 thousand hours of humanity's time was being wasted everyday.
In an effort to turn those 10 seconds from a waste into something useful, he invented the ReCaptcha, turning us all into unknowing volunteers in a massive public works project.
There are many projects digitizing books - Google and the Internet Archive projects for example - using optical character recognition. The challenge for these projects has been working with older books where the print is often faded or blurred. The OCR software can't recognize the words. In fact, about 30% of the words are deciphered incorrectly. Most often though, the words are still decipherable by a human.
Luis's ReCaptcha takes those words and uses them as Captchas. In a two-word ReCaptcha, one of the words you type is the computer generated puzzle that proves you're a human. The other word comes out of the archive. It's assumed that if you can decipher the first word correctly, you can do the second word too. If 10 people decipher the word the same way, it is officially digitized.
How fast can the the process of digitizing books and periodicals one word at at time go? Incredibly fast. Especially when Facebook, Twitter, TicketMaster among others are using ReCaptcha. Luis says that since it's launch last year, over 400 million people or 6% of the world's population have helped digitize words. 35 million ReCaptcha words per day results in 125 to 150 digitized books. About 20 years of the New York Times archive was completed in a few months. The entire 30 year archive should be done by the end of the year.
Here's a video of Luis speaking to the Library of Congress.
TrackBack
TrackBack URL for this entry: http://www.typepad.com/services/trackback/6a011570a1f01c970b011571f77618970b
You can follow this conversation by subscribing to the comment feed for this post.
Verify your Comment
Previewing your Comment
Posted by:
|
This is only a preview. Your comment has not yet been posted.
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment
The letters and numbers you entered did not match the image. Please try again.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Comments