How I used Twitter to harvest 8000 email addresses in twelve hours
April 7, 2010
After arriving home from work this evening I turned to the Twittersphear to see what was happening in the Blogosphear. I came across an article detailing how some guy got sued by Facebook for scraping their site. It is a rather interesting read, and rather eye-opening at how easy it is to do such things on social networking sites. Facebook has to be one of the worse in terms of security, or at least apparent security, at least on Twitter you know everyone can access it unless your profile is private.
Back to the topic, the idea I had was to investigate how (ridiculously) easy it would be to scrape email addresses from Twitter. Having already created the Rainbow Bot, I knew that searching for particular terms on Twitter using a certain gem was a piece of cake. So I started searching for ‘@gmail’.
The results were pretty much as expected, rather a lot of users Tweeted their email address. Most of them used some sort of simple filtering (a space between the user and domain; non word characters before the domain), and it was rather easy to tell an address which was invalid (three consecutive dots is a sure fire no-no). Also interesting were the number of users who replaced the user of the address with ‘username’ or ‘twitter’, for example ‘Send CVs to my twitter @gmail.com’.
I then decided to expand this for a couple of other free email services (Hotmail and Yahoo), and as expected, the number of addresses went up. A simple search gave me a list of free email domains (NB: I have doubts that is a list of ‘every possible email domains’ :P), which would allow me to expand this further. I didn’t even bother investigating handling of pagination, and without special permissions the Twitter search API only returns a limited subset of the total number of Tweets.
In total I managed to scrape 8000 addresses over a period of approximately twelve hours. So in conclusion, social networks appear to be a spammers paradise. 🙂