AOL Releases Searches From 650,000 Users

Remember the big hubbub of the Government trying to get search data from Google and Microsoft last year? Well, apparently no one at AOL does, they just released search data from 650,000 users, they removed the AOL username, but just changed it to a random id number, so all the data is still collected by user, and apparently, it includes lots of stuff that lots of people would be embarrassed by, or jailed over. According to this blog, it includes user searches for terms like “how to kill your wife”, “how to kill a wife”, “wife killer”, “pictures of dead people”, “decapitated photos” and many more. I wonder what that guy is up to? Hopefully, someone can check him out.

So, what does this mean? It wont be good for those users, I’m sure. Even though AOL pulled the research page, and the data, thanks to this wonderful thing called the internet, the data is still available, and I’m downloading it now. This will certainly help some affiliate marketers out with some search term data they can use, but who else could benefit? Combine vanity searches, where people search for their own name to see what is out there, with social security info and you can have identity theft, combine it with some porn searches and you could end up with some big embarrassments for some people, combine it with drugs or other types of searches and people could end up in jail, maybe, I don’t know, but I can’t believe AOL would release this data like this without talking it over.

Michael Arrington from Techcrunch says,

The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The web page from Aol research, http://research.aol.com/pmwiki/pmwiki.php?n=Research.500kUserQueriesSampledOver3Months, that is being referenced by so many blogs is gone, but below is the text of the page from the Google cache.

500k User Queries Sampled Over 3 Months

This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by an anonimized user id:

‘The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.

The graph below shows that not all users are equal in terms of usage.

Basic Collection Statistics
Dates:
01 March, 2006 – 31 May, 2006

Normalized queries:
19,076,613 queries total
10,865,119 unique (normalized) queries
658,086 unique user ID’s

Data View
Below we rank domains by the probability of click-through and ratio of unique queries. Pick a domain and see some of the top queries that users searched for to see that domain.

If you have other views or insights from the data add it to our U500k community.

We have slit the data into 10 randomly assigned groups of users. This will facilitate experimentation on smaller sets of data, as well as consistent training/testing splits across experiments. For example, in our own experiments we have used 8 groups of users’ data for training and 1 group for testing. We repeat our experiments 10 times for cross-validation with a “leave one out” approach. I suggest that if people are not interested in cross validation, they should train on 6 groups and test on 3, again leaving one out (e.g. train on groups 1-6, test on groups 8-10). The assignment of groups is truly random so any similar arrangement is valid. However, if we all use the same splits we can all compare data easily.

Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson, “A Picture of Search”, The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.

This collection is distributed for non-commercial research use only. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

CAVEAT EMPTOR — SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

500k User Test Collection (tar gzipped) (79 downloads)
Please comment on this collection, add references to works using it or suggest improvements that will help other researchers. Tell us about your experiences on this collection at U500k or post shorter comments here.

What were they thinking? I bet the data is already in use in many places and people will be feeling the repercussions from this for a long time.

Zoli’s Blog is calling for a boycott of AOL.

This entry was posted in AOL, Affiliate Marketing, Google, Research, Search Engines. Bookmark the permalink.

9 Responses to AOL Releases Searches From 650,000 Users

  1. Pingback: Jimmy Daniels » AOL User No. 4417749 Found Easily

  2. Pingback: WaynePorter.Com The Post Human Experience » Blog Archive » Jimmy Daniels on AOL User No. 4417749

  3. Pingback: Jimmy Daniels » Google Says We Will Keep On Storing User Search Data

  4. Pingback: Jimmy Daniels » More on the AOL Data Release

  5. Pingback: vixenk.net

  6. Pingback: Jimmy Daniels » AOL Releases CTO and Two Others for Data Leak

  7. Pingback: Oh No, There Goes Tokyo, It’s Browzar @ Alice Hill’s Real Tech News - Independent Tech

  8. Pingback: Jimmy Daniels » AOL Members Suing AOL for Privacy Violations

  9. Pingback: Security Roundup Tips Dr.com

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>