Thursday, May 31, 2007
I’ve been tired of spam on my websites. The few hundreds messages spammers leave everyday are a bit of a nuisance. Now though, I’ve decided to make them work harder to get their messages ignored.
Last week, reCAPTCHA came online. It’s an effort inspired by none other than Luis Von Ahn, so you know it’s good.
If you don’t know him then he’s the mastermind being similar projects that centre around a simple premise: make humans do the work that computers can’t do.
One of his on-going projects is the ESP Game where 2 online players are trying to come up with a common description of a random picture. It’s apparently an addictive game and it helps solve a problem that computers are terrible at today: describing accurately the content of photographs.
Google is using his research to make image search more useful by returning content more relevant to your queries.
So, what’s the relevance with the CAPCTHA?
CAPTCHASs were invented to resolve a simple problem: stopping computers from automatically filling-in web forms to create accounts on popular free services that they would use to send spam from.
They are a visual version of the Turing Test, elaborated by the WWII genius cryptanalyst Alan Turing as a way to test how far machines could behave like humans: not knowing who she was interacting with, if a person could not tell the difference between a human and a machine, then the machine passes the test. It’s a measure of the success -or lack of- of artificial intelligence and the idea spawned many others, including CAPTCHAs.
CAPTCHAs simply require that small problem be solved before a web form can be submitted. Typical problems include blurry and distorted images of text or numbers that would be very hard for computers alone to decipher, but that our brain has no problem solving.
There are an estimated 60 millions of CAPTCHAs being solved by human beings every single day. That’s a huge amount of lost brain power as nothing really useful comes out of it (apart from preventing spam, of course).
reCAPTCHA‘s genius idea is to use that brain power to solve a problem that we would actually like computers to solve: digitizing books.
There are millions of books that were printed in the days before computers became ubiquitous, and there exist no electronic version of them except scanned images of their pages.
Optical Character Recognition software is getting very good, but when the scan is of poor quality or the book is old, many words cannot be automatically recognised.
Humans on the other hand are quite good at reading words, even if they are badly distorted and barely recognisable.
Instead of making up a distorted image that you would have to recognise, reCaptcha simply presents you with 2 words: one it knows and one it doesn’t and you’re asked to guess both.
Every unknown word is checked multiple times by different people and you thus end-up with a very accurate interpretation of the word that can be fed back into the electronic version of the book being scanned.
CAPTCHA do not entirely solve the problem of spamming, but they are an financial issue to spammers: automated electronic system cannot solve good CAPTCHAs, so some spammer rely on low-paid humans to do the dirty work for them.
It’s fine by me: poor people are getting paid to do something useful (help digitise books) and spammers are wasting their money doing so. In my case, they lose even more, because I use moderation to read comments before they are visible and Askimet to detect spam, which means that however hard they try, their spam never gets anywhere anyway.
In the fight against spammers, it makes me happy to know I’m costing them something for a change…
- Breaking visual CAPTCHA
- Vulnerabilities of some CAPTCHA implementations (reCAPTCHA isn’t)
- Luis Von Ahn’s website
- Lecture on Human computation by Luis von Ahn (technical, but very inspiring)
- The Internet Archive, benefiting from solving ReCAPTCHA, also a incommensurable source of free books