AOL’s blunder in publicly posting 20 million search records, helpfully sorted according to user, illuminates a long-standing discussion over the significance of logs held by search engines.
The data was published for research purposes, and is anonymous in that each user is identified only by an ID number. According to AOL this counts as “no personally identifiable data”, yet the logs contain clues to the identity of the individuals, including names, locations and even on occasion social security numbers.
The data has been withdrawn, but the internet never forgets. Mirror sites sprang up, and it did not take long for someone to put up a neat web interface enabling online searches of the supposedly withdrawn information.
The leaked logs are US-based and mostly concern home users who account for most of AOL’s business. AOL’s search is powered by Google, so presumably these logs are really Google searches.
What the incident demonstrates is that the major search engines now hold a remarkable amount of data that can easily be traced back to its source. In some cases the logs may also include desktop searches. The significance and usefulness of this data is open to debate, though it has obvious commercial value for purposes such as targeted advertising.
The deal is that if you want to benefit from the excellent search engines offered by Google and its rivals, you must accept that your activities are tracked and have the potential to reveal information that you might prefer to keep secret. Is that a fair deal? Perhaps, if users know what they are signing up for. It is here that the search engines lack transparency.
Few users who happened to search the web through AOL earlier this year had any notion of the risk they were taking. Of course there are privacy policies describing what data is held. But market leader Google does not even bother to link to this on its home page, though Yahoo and MSN do provide links. In all cases these companies promise to protect our information, but AOL’s mishap shows this can go wrong.
Several things should now happen. Ideally the search engines should stop recording data at this level of detail, or at least offer an opt-out. They should also put more effort into aggregating the data to make it harder to track back to its source, and they should reduce the length of time it is stored to weeks or months rather than years.
Finally, AOL’s misstep demonstrates the inadequacy of the internal security used to protect this data. “We are taking steps to ensure that this type of thing never happens again,” says AOL. It sounds like there is plenty to do.