Does entropic de-anonymization of sparse microdata set your pulse racing? If so, you’re gonna love this paper [PDF] by Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin. Even if your stats math is as rusty as mine, however, the paper makes fascinating reading - and is surprisingly readable, if you skip over the algorithm-heavy bit in the middle.
For those of you who don’t have time to read an academic paper, here’s a summary. The paper presents a method for taking an ‘anonymized’ data set – for example, the Netflix Prize data – and locating the record for a user about which you have a limited set of approximate data. If, for example, I know that you’re a fan of The Bourne Ultimatum, Minority Report and Delicatessen but that you absolutely hated Hitch, Music and Lyrics and Along Came Polly (can’t blame you for the last one, by the way), then there’s about an 80% chance I can find your entry in the Netflix Prize dataset (assuming it’s there – it’s only a 10% sample of Netflix’s total ratings data). And I can do this even if I don’t know anything else about you.
The reason this is possible is that the data is so-called ‘sparse’ data – each record (which represents a Netflix user) has many, many fields (each field represents a particular movie), of which only a tiny fraction are non-null (because even the most prolific Netflix user has only rated a tiny fraction of Netflix’s total library). So the chances of two or more users giving the same rating to the same set of movies is actually quite small.
A lot of the detail in the paper relates to the fact that the information you start with doesn’t even have to be 100% accurate – for example, even though I know that you loved Minority Report, I may not know if you gave it 4 or 5 stars on Netflix. The algorithms are surprisingly robust in this environment. If you know just a little bit more (specifically, when the ratings were entered, to within some tolerance of accuracy), it becomes even easier to locate a record based upon some starting data. Especially if the person is interested in less popular movies (the inclusion of Delicatessen in the list above would dramatically increase the chance of a match).
Why is this interesting? Well, when Netflix released this data they confidently said that it had been shorn of all personally identifiable information – the implication being that you couldn’t link a specific record to an individual. But this paper gives the lie to that - It drives, if not a truck, then certainly a decent-sized minivan through Netflix’s claims.
As the AOL Search data debacle in 2006 showed, simply removing identifiers from this kind of data is not enough to render it properly anonymous. And if you’re thinking that Netflix preference data is hardly sensitive data, then remember that media consumption has a long and inglorious history of being the basis for discrimination and persecution in society – and there are certain idiot politicians who even today still seem to think this kind of stuff is ok.
[Update, 10/3/08: One of the authors of the paper, Arvind Narayanan, has very kindly commented on this post, and points me to a blog that he has started to discuss this topic and its impact, which you can find at http:///33bits.org. The blog has already helped me to understand eccentricity better, so go take a look.]