Have you ever wondered how spam ends up in your email inbox? Most people probably don’t know this, but the email client you use most likely has a sending limit of x emails per day. This means that after X emails are sent (usually around 500) your emails will fail to send until the end of a 24-hour cycle. If you don't know this, then that's perfectly fine; you are part of the majority. Most people don't normally send anywhere near 500 emails a day, so why would you? Unless you work in a marketing department or regularly send company-wide emails to over 500 employees, it's extremely unlikely you could send that many emails. So how do we get spam emails? Spamming is defined as the use of messaging systems to send an unsolicited message to large numbers of recipients. If the email limit is on every email account, how are they bypassed? The answer: bots.
Bots, the parasites of the software world, are like barnacles on a boat; they hitch a free ride and burn up more of your ship's resources. They are software programs created to perform the same interaction with a website repeatedly. Without restraint, this can put a tremendous strain on the servers running your website. This can create poor performance for the entire user base, or increase storage costs. The issue plagues large and small companies alike. For companies with larger user bases, most bots are usually created for a specific purpose: marketing.
This isn’t anything new. At one point all of our inboxes were littered with tales of Nigerian princes, certain body enhancement pill recommendations, and brand new girls in your area with less than wholesome intentions. The sender’s goal is usually to get you to send money, buy a product, or sign up for their new website; almost always leading to a profit. These illegitimate business practices are a problem because most of the emails delivered this way are scams that have cost internet users hundreds of millions of dollars. The bots that spread these scams never sleep and have penetrated a variety of communities.
The scams work like this. When the scammer sends a message to a user the desired goal of the sender is to get the recipient to click a link. The quantitative measure of the users who received that message compared to users who clicked the link is called a conversion rate. The group of users who completed the goal the sender intended are called converts. The more users who convert correlates with the profits they make. So the scammer has two options; they can either try to make more convincing scams, or they can send more scams to more people. Spoiler alert; they do both. Over time the scams evolved from fake princes and sugar pills to impersonating brands you know like Geico, Bank of America, and Dunkin Donuts. Increasing the believability of their scam increases the number of people who open the email and click the link, thus improving their conversion rate. Increasing the number of emails that are sent to potential marks increases the number of overall converts.
Increasing the number of overall converts is why these scammers need to create these software applications. They are non-human programs that pose as humans to create accounts and send messages. If you have ever had a Facebook account, you have probably come in contact with a bot like this before. The bot will usually create an account and post an image of an attractive woman as their profile picture. Then they will add as many people as possible as friends. What follows is what you expect. Guys want to be friends with attractive looking women so they add them. Women will eventually see enough mutual friends among her friends to believe they might have possibly met at one point. Once the bot has enough people added, it posts a link on everyone's Facebook wall or tags all of its "friends" in a post on its own wall, usually containing a link to the website associated with the scam.
As you can see this is poisonous to any social community. Having your grandmother see an adult entertainment ad plastered on your Facebook wall does not make Thanksgiving dinner any easier. Differentiating between computers and humans online is complex due to humans and applications accessing the internet in the same way. So much so, that the chief scientist at Yahoo during the height of its domination named it in their list of 10 problems they couldn’t solve in a talk he gave at Carnegie Mellon University. Luis von Ahn (an enterprising Ph.D. candidate at the time) sitting in that crowd and inspired by the challenge came up with the solution with his Ph.D. advisor; CAPTCHA.
CAPTCHA stands for Completely Automated Public Turing test to Tell Computers and Humans Apart. The goal was to construct a small test that humans could do and computers would have a hard time doing. If you have created an account on a popular website after 2003 you most likely have run into a CAPTCHA. The original CAPTCHA was a white box with black lettering. It contained characters drawn by a computer in a distorted fashion. Usually, the individual letters would be rotated in alternating directions, varying in size and position, or warped inwards or outwards depending on what was generated. At the time, computers had a harder time deciphering the text in images.
Optical Character Recognition (OCR) is the electronic conversion of images into text. If a bot wants to read text on a screen, they need to use this technology to do so. Luis knew that at the time the output from OCR was only about 70% - 90% accurate in near-perfect conditions. So if he simply warped the image a small amount, it would be unreadable to bots, but perfectly fine for humans to decipher. The solution worked and the problem so dire that when it was sent to Yahoo it was implemented within a week of being sent. Over the next few years, the CAPTCHA was adopted universally across the web for blocking these programs.
Now this solution was fantastic for Yahoo because they didn’t have to support millions of fake users on their system. There was less spam in online communities overall and the solution was scalable. However, there was one drawback to this solution; time. It cost’s the users of these websites time to be able to fill out the test for every account they made. Most people thought they were annoying and some images generated by the website were so distorted that humans couldn’t make them out. As Luis went about his days he would hear this mentioned repeatedly until he became plagued by it. He figured for every CAPTCHA completed about ten seconds of time is wasted. Doing some calculations he estimated 200,000,000 times a day a CAPTCHA is completed. He multiplied the two numbers and realized about 500,000 hours of humanity’s time is wasted every day completing these tests. So he thought to himself, "Is there something we can get humans to do while typing a CAPTCHA that is also useful?". His solution: digitizing books.
It was mentioned earlier that in near-perfect conditions the technology OCR was only 70% - 90% accurate at the time. This is obviously an issue when scanning a printed book and converting it into text on the computer. Having to re-read the pages and compare every word for accuracy can take longer than simply typing straight through. It just so happens that Luis was the perfect person to solve this problem as well. His Ph.D. research was essentially based on the concept that was later recognized as Crowdsourcing. He describes it as “finding things that were hard for computers to do and getting humans to do them“. If a large number of people do a small task they can complete a larger goal. He decided to call this new company reCAPTCHA.
reCAPTCHA works like this. When a computer guesses what letter is in an image it gives the letter something called a confidence score. A confidence score is the result of a computer algorithm that spits out a decimal between 0 and 1 on how confident it is on the given output. For example, if the computer is sure that it read the letter “M“ it may give it a .96 confidence score. However, the algorithm it is unsure of the letter “Y“ it may give it a .27 confidence score. Books overtime can have natural wear and tear as well as scanning inconsistencies. The pages can be scanned on a slant or have trouble interpreting certain letters if there are many varying fonts on the page. reCAPTCHA would take images of the word and feed them through the CAPTCHA system along with a word they generated. If you typed the word the system generated correctly they assumed you were human. Then whatever you typed for the word taken from the book is given it’s on confidence score. If enough people type the same word in, it is considered correct.
Using this method they can get millions of people to help digitize any book they want. To build a bigger userbase of websites that ran the software, reCAPTCHA was given away for free to any site who wanted to use it. Facebook, in need of a CAPTCHA system, adopted reCAPTCHA in 2006 as part of their registration flow to prevent bots from creating accounts on their website. At reCAPTCHA’s height, it was serving 100,000,000 CAPTCHAs a day across a variety of websites.
So the burning question, how did reCAPTCHA make money? The CTO of the New York times approached Luis after he finished giving a TED Talk in Dallas, TX. The New York Times had about 130 years of Newspaper content they were unable to digitize because computers were unable to read them. They ended up striking a deal stating for every year of content they digitized for the New York Times reCAPTCHA would charge $42,000; making it a $5,460,000 contract. Due to the scale of the reCAPTCHA network of websites, they would digitize a year of content in about a week.
Crowdsourcing work is probably one of the most underutilized ways to generate profit. Aligning incentives across businesses is one of the more unique applications of the concept. When designing your new website, if you are unsure of the business model consider the army of users at your fingertips willing to trade time for services.
I only knew a little bit about this, prior. Very informative, thanks, Jonathan!
This was a great read! It's amazing because I remember all those distorted CAPTCHAS' I had to navigate through when they came out, but I didn't know its significance until after this newsletter! Very informative, keep up the good work!