Dogfood

« September 2008 | Main | November 2008 »

October 29, 2008

Whence the universal tag?

With another E-metrics Summit over (sans me, sadly), it’s clear that interest in web analytics and online measurement remains high, even (or especially) in these troubled times. But as the technology sets for online advertising and web analytics continue to merge and overlap, one urgent question remains unanswered: what are we going to do about data collection?

You only have to talk to any medium-sized web agency, or marketing manager for an e-commerce site, to understand that online behavior data collection is deeply broken right now – ad servers and web analytics products still collect their data entirely separately, leading to misery for webmasters as they struggle to maintain two (or three, or four…) tracking tags on each page of a site, and misery for analysts as they struggle to reconcile differing numbers from different systems. If you throw ad tags (that is, the snippets of code that actually cause ads to be displayed on a page, such as the AdSense code) into the mix, things become even more complicated.

How we as an industry go about fixing this problem depends on who we care about more: webmasters (I use that term loosely to refer to the gaggle of unfortunates who are charged with maintaining and updating a website), or marketers; or whether we decide that we care about them both. Here are some ideas (none of them new) about how to approach the problem, together with “feel the love” rankings for marketers and webmasters. Feel free to add your own ideas in the comments.

 

Idea 1: Merge the back-end data

Marketers: ♥♥♥ (out of 5)
Webmasters: ♥ (out of 5)

head-on-collision It’s not uncommon for a site to be using multiple tag code from the same vendor, such as Google (which has separate tags for Adwords, AdSense, GA and DFA/DFP) or our good selves (adCenter, adCenter Analytics, Atlas and others). If this is the case, then the vendor has the opportunity – some would say the responsibility – to join together the data it collects at the back-end to provide a more joined-up and consistent set of reports for marketers.

Google has just taken another decent step in this direction with its inclusion of AdSense clickthrough and CPC data in GA reports. I don’t actually have detail on exactly how they’re doing this, but my best guess is that they’re merging the click data from AdSense with the impression data from Analytics.

You can generalize this approach to a situation where two or more vendors might group together to pool the data they have to provide a consolidated set of reports. This is (sort of) the approach used by Omniture and DoubleClick, where you can use an Omniture tag in place of DoubleClick spotlight tags for conversion tracking.

The crucial pre-requisite is that the different sources of data need to be mergeable; and that means a couple of things. First, the visitor ID needs to be shared between the data sets. This is fairly easy for a single vendor to achieve, but trickier for vendors working together.

The other implication is that it needs to be possible to de-duplicate individual transactions. If you have two tags on your page, one for a web analytics product, and one for an ad server’s conversion tracking, it can actually be pretty challenging to ensure that when a user requests a page, you don’t count the page impression twice. Either you ignore one source of data completely (which is sort of what Google seems to do with AdSense/GA), or you have to employ various heuristics to decide when to throw something away – for example, if you register two identical page requests within a fraction of a second of one another, you can be confident (though not certain) that they are duplicates.

As for the customers? The marketer gets a decent benefit from this approach; they’ll see merged data, though the quality of the data may still leave something to be desired (hidden ‘seams’ where the data has been stitched together can trip up the unwary analyst). The webmaster, on the other hand, sees little benefit – they still have to maintain both tags, especially if each tag has its own unique capability. So this solution is really more of a stepping-stone to a more complete approach than a destination in its own right.

 

Idea 2: A “tag management” system

Marketers: ♥♥
Webmasters: ♥♥♥♥

trashcan Even if a single vendor or pair of vendors can join forces to combine the data from a couple of tags, most sites are still going to be using multiple tags from multiple vendors, some of whom (by their very nature) are never likely to co-operate on data. Given this state of affairs, one obvious approach is to provide some more technology to the webmaster to help them manage the plethora of tags.

Such a system would be, essentially, a content management system for tagging, enabling the webmaster to define which tags from which vendors should appear in which places on their site. Such a system could come from a vendor, or a sufficiently motivated site owner could create it themselves.

A webmaster using such a system would see a dramatic reduction in the overhead associated with managing multiple tags (once they’d gone through the pain of implementing the tag management system’s tags, that is). Furthermore, a well-implemented tag management system would make it easier for the webmaster to introduce (and remove) tags, reducing some of the friction associated with moving from one analytics or ad serving vendor to another.

The big sticking point, however, with a system like this, is custom tagging. If you actually speak to a site owner about the pain of tag management, having to actually insert a JS file into the page is only a small part of the task – and that step is made much easier by modern content management systems.No, it’s the definition of custom variables, and integrating them with the data coming from the site, that is the challenging and time consuming step. Publishers (who are implementing ad server tag code to host ads on their site) also have the overhead of defining page groups for their content, which is a major task compared to the actual tagging itself.

So in order for such a system to be really useful, it would need to provide a standardized interface between the data coming from the site and the tags – essentially, its own custom variable schema with a defined set of mappings to Omniture, GA, Atlas AdManager, etc.

A company called Positive Feedback (based in London, which means they must be geniuses) has taken a stab at providing a solution here with their TagMan offering. And Tealium is looking to address the custom variables problem with their solution, TrackEvent.

 

Idea 3: A universal tag

Marketers: ♥♥♥
Webmasters: ♥♥♥

rfid-tag Ah, the universal tag. The holy grail of web analytics (at least, according to some). The idea here is that a group of vendors (perhaps under the augurs of the Web Analytics Association) come together to create a universal piece of tag code that can capture data for any of their services. The upshot is that the webmaster only has to place this single tag on their site, and then configure the tag for whichever vendor solutions they’re using. A side benefit of the “universal tag” is that it can direct beacon requests to the customer’s own data collection systems as well as a third-party’s – avoiding the problem of data ownership.

They key challenge with this approach is that, despite warm words on the topic from web analytics vendors, there’s little real incentive to put a bunch of effort into doing something like this. All the vendors get is a potentially more complicated implementation, and more client mobility. What we may find happening instead is vendors supporting other vendors’ custom variables and event calls  - so vendor A could come in and say “simply switch out your call JS file reference (or add ours), and we’ll start capturing the same data you’re already getting”. It would be interesting to see if any vendors complained that their IP was being infringed by this approach.

A variant of this idea is where a vendor creates a tag architecture and then works with partners to encourage them to abandon or supplement their own data collection with the vendor’s – thus making the vendor’s tag the universal tag. This is Omniture’s approach with Genesis. This approach strikes me as more likely to succeed, since the incentives work differently; it’s in Omniture’s interest to push continued Genesis tracking adoption.

The asymmetry of Omniture’s approach also makes a more general point about the universal tag idea – which is that it seems likely that the vendor who already has the most well-established tagging relationship with a client will be able to leverage that to get other systems’ data collection needs met within the framework of their tag. This is likely to be the web analytics vendor, so we should look to those organizations (rather than, say ad serving companies) to lead on a solution like this.

 

Idea 4: A universal data collection service

Marketers: ♥♥♥♥
Webmasters: ♥♥♥

InsideWarehouse_300 If you continue the thought process around universal tagging, and vendors looking to provide more and more help to customers with data collection, then you end up with the idea of a vendor providing a fully-fledged data collection service.

I’ve blogged about this idea before, as it happens. The core idea here is that some kindly organization (which has access to a large pool of cheap processing and data storage) takes it upon itself to offer a data collection service that is so flexible, reliable and cheap that many other vendors abandon their own data collection and use the common service.

Part of the service is a “universal tag” which can be configured to capture the data that each analytics/ad serving service needs. But the difference is that the universal tag doesn’t try to generate beacon calls in the correct formats  for the individual services, or even send that data to those services’ data collection servers – it just gathers the data to a centralized repository and the other services access this data programmatically.

This approach combines some of the benefits of the two preceding ideas – for webmasters, the tag management process is radically simplified because one tag can do multiple things. Marketers like it because it would finally deliver numbers which match up. However, the approach wouldn’t work for certain things, such as adserving tags – unless that system was merged together with the data collection service.

Of course, another obstacle to this kind of approach taking root is vendors’ reluctance to entrust their (or their customers’) data to a third-party. This reluctance is liable to increase in proportion to the size of the vendor. So whilst Omniture would like balk at using a data collection from Google or Microsoft in place of its own, a small vendor (such as our pluckly little friends at Woopra) may find such a service invaluable in allowing them to focus on analytics rather than data collection.

 

So those are my ideas – what are yours? And which one(s) of the above ideas do you think are most likely to gain traction?

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 13, 2008

Clouds, Impressions and Pork Bellies

pork-bally With Microsoft’s (sort of) biennial PDC on the horizon, my mind (and the minds of many of my colleagues) turns to our cloud computing efforts, which will have their coming-out party in Los Angeles at the end of the month. Like anybody else here at Microsoft, there’s little specific that I can say about these efforts before the conference; all I can say is that we’re working on ways of making it much, much easier to develop,deploy and pay for apps in the cloud.

One of the things I’ve been thinking about, however, in the context of cloud computing, is how it may or will change the way that IT infrastructure (by which I mean, processing power, storage and bandwidth) is bought and paid for. Most cloud or utility-computing offerings in the marketplace today are priced on a consumption basis (that is, you pay for what you use, and no more) – indeed, some people hold that you’re not really doing cloud computing if you’re not charging for it on this basis.

This model for charging represents a significant transfer of risk from the customer to the vendor: whereas an enterprise might today purchase so many servers, and so many OS, database and other software licenses to support a particular service, knowing that some will not be used, now it is up to the cloud vendor to predict demand for their services and purchase the appropriate hardware and software.

But I think that cloud computing may yet enable its customers to further reduce the risk they face, by enabling the trading of futures positions in compute power and storage. And in this respect, cloud computing shares some very interesting characteristics with the online advertising business (Ah, you say, now I understand where he’s going with this). Please note,  by the way, that nothing that follows is intended to indicate any specific Microsoft plan in this area. This is just me riffing.

Clouds as commodities

So, imagine you’re running a news and current affairs website. And further imagine that, oh, there’s an election coming up later in the year which you’re confident will generate a big spike in traffic. If you’re running your site on an on-demand cloud infrastructure, then you’ll be confident that your site will scale elegantly if you get traffic spikes – but at what cost? You may be able to buy compute capacity at (say) $0.10 per processor-hour (or whatever measure of compute capacity emerges) on a spot basis; but if you were to reserve this capacity on a forward basis (i.e. a few months in advance), you could pay only $0.05 per processor-hour.

But what if the spike never materializes? You could just release that capacity back to the cloud vendor and get some number of cents on the dollar for it. But an alternative is that you could in theory sell that pre-reserved capacity to someone whose need is greater than yours, potentially at some profit.

Now consider the same business from an advertising perspective. Anticipating the spike in traffic, you want to sell your anticipated inventory for the best price – which means striking a number of ‘guaranteed’ deals, where you commit to delivering the impressions during the time period (and, given the nature of advertising during an election campaign, you really don’t want to be delivering make-good ads after November 4). So to hedge the risk of not meeting your projected impression goals, you buy a block of inventory that you can use, if necessary, to fulfill your obligations.

As the election looms, however, you discover that your traffic is exceeding your expectations – so you don’t need the inventory hedge. You could choose to take a little revenue from this inventory by serving discretionary ads into it, or you could sell it on to someone else whose inventory prediction was not so on-the-money, and needs inventory to fulfill a guaranteed deal. You could potentially get a better rate doing this than by serving remnant ads into the inventory yourself.

What these two examples have in common is that the publisher is taking a forward position on a commodity in order to mitigate against risk on the supply or demand-side of their business. Of course, this kind of hedging is nothing new – the Chicago Mercantile Exchange has been enabling it for years, for commodities as diverse as pork bellies, oil and coffee. And since energies futures are such an important part of that market, it shouldn’t be a far-fetched idea that compute power (which many have described as moving to be a utility, like electricity) could move to being traded in the same way.

More options

The model even lends itself to the idea of options trading – in both the above examples, the publisher could pay for the option to purchase compute capacity or advertising inventory at a particular price, rather than reserving the capacity or inventory itself; and those options could then be sold on later (or exercised, or left to expire, of course).

The next logical step from there is that folks who have nothing to do with online advertising or cloud computing could start buying and selling these commodities and securities with a view to making a profit on price changes. The economy’s current woes notwithstanding, I can see this happening in the next 5 – 10 years.

To make either scenario a reality, however, there need to be functioning exchanges for the buying and selling of the commodities. This is close to becoming a reality for online ad inventory – the likes of Right Media Exchange, DoubleClick Exchange and our own AdECN are close to providing open trading platforms for advertisers, publishers and networks to buy and sell inventory. Though there is no talk of futures trading in these environments right now.

It’s significantly further off for cloud computing capacity. For a start, the industry lacks standards for measurement and billing – will it be the processor-hour, or the Gbyte-day, or the Gbit-month, or some combination of the above? Secondly, unlike the online ad market, where a given ad will run on most publisher sites (with the exception of rich media ads), there is illiquidity between different technology platforms in cloud computing – so an app written for Amazon Web Services will not run unmodified on Salesforce.com’s cloud platform, or Google’s. This may never change, in which case any kind of market or exchange for compute capacity will be limited to a single vendor’s system, greatly limiting the effectiveness of such an approach. But interesting to think about, nonetheless.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 09, 2008

Love numbers? Obsessed by the election? If so…

…you’ll love fivethirtyeight.com, which is one of the best blogs to emerge about the 2008 presidential election. The name comes from the number of delegate votes up for grabs in the election, and the site takes the daily feeds of national and statewide polls and synthesizes them to create a running set of predictions about the likely outcome of the election, which (as of today, 10/9/08) looks like this:

The site’s founder, Nate Silver, has been careful to try to build a model which takes into account the historical accuracy of the various polls that he draws from, as well as a number of other factors such as state demographics, to provide a view which is as likely to be accurate as anything you’ll encounter from the likes of Gallup or CNN.

I love the site because it’s a great demonstration of what can be done with a computer, some publicly available data, and a commitment to citizen journalism. And it seems I’m not the only one – having only started seven months ago, the site pulled in nearly 700,000 visits yesterday, and has earned Nate a certain degree of fame, culminating with an appearance this week on the Colbert Report:

And FiveThirtyEight.com has even garnered the praise of the folks at JunkCharts.com for the clarity of its charts – an achievement perhaps even more impressive than appearing on the Comedy Channel.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 08, 2008

Online Advertising Business 101, Part V – Rich Media

Whack-O-Mole-Game Welcome back to my Online Advertising Business 101 series. In my previous posts in this series, I’ve painted a somewhat simplified picture of the major players in the online advertising market, and how they interact; and we’ve looked at some of the technology that these folks use, specifically ad serving technology for advertisers and publishers.

This post looks at a significant, but hard to define, aspect of online advertising: “rich media”. I’ll attempt to explain how rich media fits into the overall picture, but also why it kind of doesn’t. You see, whilst some of the more mainstream parts of the online advertising world are starting to settle down a little, with familiar patterns of interaction emerging, and some fairly well-established standards, out at the bleeding edge of rich media, things are still evolving quickly. So the rich media ad industry operates as a kind of overlay with mainstream online advertising, with its own set of specialized vendors, and its own set of rules.

 

What is Rich Media?

Rich Media” is an umbrella term for ads which are more interactive, or use sound and/or video to get their point across. The following are all examples of rich media ads:

  • An expandable banner with embedded video
  • An ‘advergame’ (a mini-game embedded inside an ad unit)
  • A video ad served before, after or during a video clip
  • A ‘page takeover’ ad (an ad which covers the page it’s on, or interacts with that page or the other ads on it in some way)

Another more demotic way of defining rich media is “fancy Flash and video ads”. That sums it up to a reasonable degree of accuracy – there are funky ads built in Flash, and then there are video ads, either delivered through their own dedicated Flash units (in which case, they look a lot like regular Flash ads to ad servers and publishers), or embedded into the stream of a video clip. And it’s also reasonably accurate to say that in-stream video is the more challenging to get done, for a variety of reasons (which I list below).

Although rich media currently makes up only a fraction of the ads delivered on the Internet, these ads are the best-paying for publishers, and their share of online display ad spend is set to grow to around 50% in 2009, according to Jupiter, with video being the biggest contributor to this growth. In fact, video is often broken out in such predictions as its own media category – so this post could have been entitled “Rich Media and Video”.

Rich media’s very attraction – the fact that it’s something other than bog-standard banners, buttons & links – is also its main challenge. Each fancy new rich media idea from an agency brings with it its own set of implementation challenges. So a whole industry has grown up around bridging the gap between publishers – who want a simple life, by and large – and advertisers & agencies who want to push back the boundaries of what is possible with online advertising.

 

Rich Media Vendors

Because of the specialized nature of rich media advertising, and the fact that there are few to no standards in the rich media world, a collection of specialist companies – known as Rich Media Vendors (RMVs) – has sprung up to help advertisers & agencies get these kinds of ads onto publisher sites.

An (incomplete) list of some of the more common rich media vendors is as follows:

These vendors provide a combination of technology and services. Much of the technology side comprises proprietary ad templates and delivery and measurement technology, whilst the services focus on authoring ads and help with the ad trafficking.

Because of the vertically-integrated nature of the RMV’s services, they need to interface on the one hand with advertisers’ creative agencies, and on the other hand with the publishers (and, increasingly, networks) that will be running the ads. So a good RMV will be on the “approved” list of lots of agencies and publishers/networks.

 

What do they do?

The challenges of getting a rich-media campaign up and running are as follows:

  • Creating the actual ads themselves (the creative)
  • Trafficking the ads (getting them actually placed onto publisher sites)
  • Managing and serving the creatives themselves, and measuring delivery and response

Creating the ads

Creating a good rich media ad requires a lot of skill – even a simple expanding ad requires knowledge of Flash and JavaScript, together with decent graphic design & copywriting skills. Creating video ads is a whole different ball of wax, especially if you’re aiming for something more compelling than just repurposing your 30-second TV ad for the web. The small screen-sizes available and things like bandwidth considerations further complicate matters.

The RMV will work with the advertiser’s agency to code up the ads they need – the agency may provide a relatively complete ad that has already been coded using the RMV’s template, or the RMV will take some raw creative – say, a video – and code it in their template.

Trafficking the ads

This is the really fun part. The complexity of trafficking a rich media ad is dependent on the kind of ad it is. For ads which are more or less “fancy banners” (i.e. IAB standard-sized units with interactive capabilities or video), the ad can often be served through the advertiser’s and publisher’s respective servers just like a static ad, making trafficking relatively simple. As these kinds of ads grow more sophisticated, however – expanding over the page, for example, or needing to interact with the page (such as the recent Apple ads on NYTimes.com, which talk to one another), special work needs to be done by the publisher to insert the ads onto the page.

This picture becomes even more complicated when you consider in-stream video ads. The burgeoning crop of online video sites (YouTube, MSN Video, Hulu, YuMe, Vimeo, Veoh, Joost, Google Video, DailyMotion, Blinx and more than 200 others) all have slightly different players (though many are Flash-based) which use an array of different video sizes and encoding formats. So anyone looking to serve video ads has to be able to transcode their video into the various required formats and work with the video sites to insert these ads into the video stream.

Furthermore, many online video sites are now enabling new kinds of ads within their players, such as overlay ads; our own adCenter Labs is even looking at video hyperlinks – making parts of the video itself clickable. RMVs have to keep up with all these developments.

Unsurprisingly, no standards exist for the way in which these richer kinds of ads are implemented, so publishers that want to host rich media or video ads end up working directly with a handful of Rich Media vendors (here’s the list that AOL supports, for example) to define a set of formats that they will support. The RMVs end up acting as a kind of gateway to the publisher, ensuring that some kind of consistency and reliability is maintained.

Delivering & tracking the ads

Many rich media vendors (for example, Eyeblaster) will also take a hand in actually delivering the ad, providing their own campaign management and ad serving technology. In situations where this is the case, the advertiser can either use the RMV’s ad management system in parallel with any other ad server they already use (such as DFA, or Atlas Enterprise), or they can use their “primary” ad server to manage the campaign and then hand off the ad calls to the rich media ad server when necessary, as in the diagram below:

image

A third scenario is that the advertiser adopts the RMV’s ad server as their principal third-party ad serving solution – this, unsurprisingly, is the tack recommended by the RMVs themselves.

One other key reason that RMVs provide their own technology is that this makes it easier for them to offer detailed and relevant metrics about ad delivery and interaction. Because rich media is designed to be interacted with without leaving the ad (anything from a simple whack-a-mole-type interaction to something like spec’ing a new car), measuring rich media (and therefore charging for it) is a very different proposition to measuring static display or text ads, where the click (and, to a lesser extent, subsequent conversion behavior) is king.

You’ll be getting a little bored by now of hearing me say that there are no standards for rich media measurement, but they are pretty thin on the ground. Impressions is a fairly useless metric if the unit in question is a ‘peel-back’ overlay – the kind where you have to mouse-over to see any of the ad copy at all – or where the ad is a 15-second pre-roll. Likewise, there’s a world of difference between an ad where a single click signs the user up for an e-mail newsletter and an advergame where the user may click hundreds of times on the ad in an effort to whack that pesky mole.

So another function that RMVs (and the creative agencies they work with) perform is to demonstrate and quantify the value of the work they have done to their clients – coming up with engagement metrics that can be defended in front of the CMO.

 

Where next?

The vertically-integrated nature of the rich media ‘industry’ is sure to change over the next few years, as the industry looks to grow out past its home base of deep-pocketed advertisers and large, sophisticated publishers. Smaller advertisers will want to be able to create richer ads, and to be able to serve ads into richer environments. Similarly, smaller publishers would like to be able to benefit from the higher eCPMs of rich media ads. Larger advertisers, on the other hand, would like to be able to serve their fancy rich media ads across a broader range of sites.

In order to enable these scenarios, advertisers need access to systems that enable them to create or upload one set of creative and have free choice of a wide range of sites to advertise on – without having to implement the creative separately on each site. And publishers need to be able to create standardized rich media ad units which can form the supply side of this larger, more liquid market.

There are some moves afoot to bring about this kind of world. Google AdSense now supports video ads (enabling publishers to have video ads appear on their site), and AdWords text ads will appear as overlay ads on YouTube videos, enabling AdWords advertisers to appear in this context. We’re doing the same with our Silverlight Streaming for Windows Live service. And some RMVs are turning their publisher base into specialized ad networks – VideoEgg’s network being a good example – whilst traditional ad networks such as Advertising.com are branching out into video.

Industry growth won’t get past a certain point, however, without more agreement on standards. The IAB is pretty active in this space (though, bless it, it does move rather slowly), having agreed a set of rich media ad guidelines this year, and there is also a useful set of measurement guidelines available (with some nice diagrams on tracking & serving). But a set of widely-agreed measurement standards seems a little further off for the time being.

 

Online Advertising Business 101 - Index of all posts

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 01, 2008

‘Anonymous’ Netflix Prize data not so anonymous after all

image

Does entropic de-anonymization of sparse microdata set your pulse racing? If so, you’re gonna love this paper [PDF] by Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin. Even if your stats math is as rusty as mine, however, the paper makes fascinating reading - and is surprisingly readable, if you skip over the algorithm-heavy bit in the middle.

For those of you who don’t have time to read an academic paper, here’s a summary. The paper  presents a method for taking an ‘anonymized’ data set – for example, the Netflix Prize data – and locating the record for a user about which you have a limited set of approximate data. If, for example, I know that you’re a fan of The Bourne Ultimatum, Minority Report and Delicatessen but that you absolutely hated Hitch, Music and Lyrics and Along Came Polly (can’t blame you for the last one, by the way), then there’s about an 80% chance I can find your entry in the Netflix Prize dataset (assuming it’s there – it’s only a 10% sample of Netflix’s total ratings data). And I can do this even if I don’t know anything else about you.

The reason this is possible is that the data is so-called ‘sparse’ data – each record (which represents a Netflix user) has many, many fields (each field represents a particular movie), of which only a tiny fraction are non-null (because even the most prolific Netflix user has only rated a tiny fraction of Netflix’s total library). So the chances of two or more users giving the same rating to the same set of movies is actually quite small.

A lot of the detail in the paper relates to the fact that the information you start with doesn’t even have to be 100% accurate – for example, even though I know that you loved Minority Report, I may not know if you gave it 4 or 5 stars on Netflix. The algorithms are surprisingly robust in this environment. If you know just a little bit more (specifically, when the ratings were entered, to within some tolerance of accuracy), it becomes even easier to locate a record based upon some starting data. Especially if the person is interested in less popular movies (the inclusion of Delicatessen in the list above would dramatically increase the chance of a match).

Why is this interesting? Well, when Netflix released this data they confidently said that it had been shorn of all personally identifiable information – the implication being that you couldn’t link a specific record to an individual. But this paper gives the lie to that - It drives, if not a truck, then certainly a decent-sized minivan through Netflix’s claims.

As the AOL Search data debacle in 2006 showed, simply removing identifiers from this kind of data is not enough to render it properly anonymous. And if you’re thinking that Netflix preference data is hardly sensitive data, then remember that media consumption has a long and inglorious history of being the basis for discrimination and persecution in society – and there are certain idiot politicians who even today still seem to think this kind of stuff is ok.

[Update, 10/3/08: One of the authors of the paper, Arvind Narayanan, has very kindly commented on this post, and points me to a blog that he has started to discuss this topic and its impact, which you can find at http:///33bits.org. The blog has already helped me to understand eccentricity better, so go take a look.]

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

Subscribe

Enter your email address:

Delivered by FeedBurner