Jimmy Daniels

Categories

Latest News


Monthly archives


Search




AOL Releases Searches From 650,000 Users

Remember the big hubbub of the Government trying to get search data from Google and Microsoft last year? Well, apparently no one at AOL does, they just released search data from 650,000 users, they removed the AOL username, but just changed it to a random id number, so all the data is still collected by user, and apparently, it includes lots of stuff that lots of people would be embarrassed by, or jailed over. According to this blog, it includes user searches for terms like “how to kill your wife”, “how to kill a wife”, “wife killer”, “pictures of dead people”, “decapitated photos” and many more. I wonder what that guy is up to? Hopefully, someone can check him out.

So, what does this mean? It wont be good for those users, I’m sure. Even though AOL pulled the research page, and the data, thanks to this wonderful thing called the internet, the data is still available, and I’m downloading it now. This will certainly help some affiliate marketers out with some search term data they can use, but who else could benefit? Combine vanity searches, where people search for their own name to see what is out there, with social security info and you can have identity theft, combine it with some porn searches and you could end up with some big embarrassments for some people, combine it with drugs or other types of searches and people could end up in jail, maybe, I don’t know, but I can’t believe AOL would release this data like this without talking it over.

Michael Arrington from Techcrunch says,

The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The web page from Aol research, http://research.aol.com/pmwiki/pmwiki.php?n=Research.500kUserQueriesSampledOver3Months, that is being referenced by so many blogs is gone, but below is the text of the page from the Google cache.

500k User Queries Sampled Over 3 Months

This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by an anonimized user id:

‘The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research.

The graph below shows that not all users are equal in terms of usage.

Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006

Normalized queries:
19,076,613 queries total
10,865,119 unique (normalized) queries
658,086 unique user ID’s

Data View
Below we rank domains by the probability of click-through and ratio of unique queries. Pick a domain and see some of the top queries that users searched for to see that domain.

If you have other views or insights from the data add it to our U500k community.

We have slit the data into 10 randomly assigned groups of users. This will facilitate experimentation on smaller sets of data, as well as consistent training/testing splits across experiments. For example, in our own experiments we have used 8 groups of users’ data for training and 1 group for testing. We repeat our experiments 10 times for cross-validation with a “leave one out” approach. I suggest that if people are not interested in cross validation, they should train on 6 groups and test on 3, again leaving one out (e.g. train on groups 1-6, test on groups 8-10). The assignment of groups is truly random so any similar arrangement is valid. However, if we all use the same splits we can all compare data easily.

Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson, “A Picture of Search”, The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.

This collection is distributed for non-commercial research use only. Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

CAVEAT EMPTOR — SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

500k User Test Collection (tar gzipped) (79 downloads)
Please comment on this collection, add references to works using it or suggest improvements that will help other researchers. Tell us about your experiences on this collection at U500k or post shorter comments here.

What were they thinking? I bet the data is already in use in many places and people will be feeling the repercussions from this for a long time.

Zoli’s Blog is calling for a boycott of AOL.

Posted by Jimmy Daniels August 2006


9 Responses to “AOL Releases Searches From 650,000 Users”

Jimmy Daniels » AOL User No. 4417749 Found Easily Says: August 9th, 2006at 10:55 am

[...] Just finished reading this article from the New York Times about how one reporter easily found search user No. 4417749, a user found because AOL Released the Searches of 650,000 users. [...]

[...] Jimmy took notice of my recent foray into Rosenbaum’s Dilemma but explored things in a different light- the plague of UA Pron and AOL’s releasing user’s search data- whoops! [...]

[...] Even though it’s been a terrible week for AOL, and 650,000 of it’s users, Google Inc CEO, Eric Schmidt, said the mistake by AOL will not change Googles practice of storing the user search data for use in it’s search engines. “We are reasonably satisfied … that this sort of thing would not happen at Google, although you can never say never,” Schmidt said during an appearance at a major search engine conference in San Jose. [...]

Jimmy Daniels » More on the AOL Data Release Says: August 11th, 2006at 12:12 pm

[...] Apparently the AOL release of data now has the attention of Capitol Hill, not sure if this is a good thing or not, as the more the government gets involved with the internet, the worse it will be for it. AOL’s recent privacy gaffe that exposed user search histories may breathe new life into a proposal to slap strict rules on what data Internet companies may collect. [...]

vixenk.net Says: August 13th, 2006at 2:35 am

[...] Since the recent AOL blunder, people have gotten ever more concerned about the privacy of their searches. Mostly I’ve been seeing suggestions like this that recommends browsing via a proxy. There’s still a very large loophole in this idea though - you. Remember, AOL didn’t reveal user identities, yet several people were tracked down via their released searches. I’m sorry, but no amount of proxies in the world are going to protect your identity from being revealed if you’re like most average users and occassionally search for things related to your identity and location. Besides, whether you’re behind a proxy or not, the search engines that insist on keeping your searches are still getting paid. [...]

[...] The fallout is starting at AOL for the release of search data of 650,000 users, with AOL releasing it’s CTO and two others. They’ve apologized and gotten the attention of capitol hill, now it’s all over but the firing. Maureen Govern, who joined AOL as CTO last September, will leave the company immediately along with two other employees who thought publishing the details of 19 million Web searches performed by 600,000 users to the Web was a grand idea. Source: SiliconValley.com [...]

[...] File this in the review first, test later category. Recently Infoworld profiled a new Web Browser Browzar, which sounds more like an enemy of Godzilla, saying that is leaves no footprints. In the story, the creator even brings up the recent gaff by AOL in which they released the searches of 650,000 users, as if the browser would help protect them against this. “Privacy is becoming a bigger issue,” Ahmed said, pointing to the recent leak of more than 20 million user search queries by AOL LLC. “The AOL story highlights the issue that some of the things people are searching for are very, very personal.” [...]

[...] Three AOL members have filed suit against AOL LLC, the internet division of Time Warner, the lawsuit, which was filed as C-06-5866, says AOL violated their privacy by posting their searches online. In a previous post I mentioned AOL Releasing the Searches From 650,000 Users , thats 20 million online searches, and lots of privacy lost. Users were easily found, and I’m sure many are hiding and hoping no one figures out they were the ones searching on how to kill my wife, or the many searches by pedophiles. I hope they make them do something, maybe it will getother search engines, like Google, thinking about the storing of user data, and maybe they will stop storing user searches as well. The complaint states that on July 31, AOL posted on its publicly accessible website a database containing roughly 20 million Internet search queries entered over a three-month period by approximately 658,000 different AOL members. Plaintiffs claim the database detailed the date and time the AOL member conducted each search, as well as any websites the member clicked on after AOL’s search engine returned its results. No AOL user names were attached to the database, but the complaint says search terms contain personal information, enough to identify the AOL member. The Complaint alleges that although AOL later pulled the database from its website, the database had already been downloaded, reposted, and made searchable on other websites. Source: Yahoo Finance [...]

Security Roundup Tips Dr.com Says: March 26th, 2007at 7:51 pm

[...] Spamdexing “R” Us A researcher is curious as to how many times a user can get hit with a driveby download and malware infection just by clicking on a Google search result. He took the AOL search data that was released accidentally by AOL and tried to figure it out. [...]

Post A Comment