It’s been a busy couple of years here at Microsoft. For the dwindling few of you who are keeping track, at the beginning of 2012 I took a new job, running our “Big Data” platform for Microsoft’s Online Services Division (OSD) – the division that owns the Bing search engine and MSN, as well as our global advertising business.
As you might expect, Bing and MSN throw off quite a lot of data – around 70 terabytes a day.(that’s over 25 petabytes a year, to save you the trouble of calculating it yourself). To process, store and analyze this data, we rely on a distributed data infrastructure spread across tens of thousands of servers. It’s a pretty serious undertaking; but at its heart, the work we do is just a very large-scale version of what I’ve been doing for the past thirteen years: web analytics.
One of the things that makes my job so interesting, however, is that although many of the data problems we have to solve are familiar – defining events, providing a stable ID, sessionization, enabling analysis of non-additive measures, for example – the scale of our data (and the demands of our internal users) has meant that we have had to come up with some creative solutions, and essentially reinvent several parts of the web analytics stack.
What do you mean, the “web analytics stack”?
To users of a commercial web analytics solution, the individual technology components of those solutions are not very explicitly defined, and with good reason – most people simply don’t need to know this information. It’s a bit like demanding to know how the engine, transmission, brakes and suspension work if you’re buying a car – the information is available, but the majority of people are more interested in how fast the car can accelerate, and whether it can stop safely.
However, as data volumes are increasing, and web analytics are needing to be ever more tightly woven into the other data that organizations generate and manage, more people are looking to customize their solutions, and so it’s becoming more important to understand their components.
The diagram below provides a very crude illustration of the major components of a typical web analytics “stack”:
In most commercial solutions, these components are tightly woven together and often not visible (except indirectly through management tools), for a good reason: ease of implementation. At least for a “default” implementation, part of the value proposition of a commercial web analytics solution is “put our tag on your pages, and a few minutes/hours later, you’ll see numbers on the screen”.
A cunning schema
In order to achieve this promise, these tools have to make (and enforce) certain assumptions about the data, and these assumptions are embodied in the schema that they implement.Some examples of these default schema assumptions are:
- The basic unit of interaction (transaction event) is the page view
- Page views come with certain metadata such as User Agent, Referrer, and IP address
- Page views are aggregated into sessions, and sessions into user profiles, based on some kind of identifier (usually a cookie)
- Sessions contain certain attributes such as session length, page view count and so on.
Now, none of these schema assumptions is universal, and many tools have the capability to modify and extend the schema (and associated processing rules) quite dramatically. Google Universal Analytics is a big step in this direction, for example. But the reason I’m banging on about the schema is that going significantly “off schema” (that is to say, building your own data model, where some or all of the assumptions above may not apply) is one of the key reasons why people are looking to augment their web analytics solution.
Web Analytics Jenga
The other major reason to build a custom web analytics solution is to swap out one (or more) of the components of the “stack” that I described above to achieve improved performance, flexibility, or integration with another system. Some scenarios in which this might be done are as follows:
- You want to use your own instrumentation/data collection technologies, and then load the data into a web analytics tool for processing & analysis
- You want to expose data from your web analytics system in another analysis tool
- You want to include significant amounts of other data in the processing tier (most web analytics tools allow you to join in external data, but only in relatively simple scenarios)
Like a game of Jenga, you can usually pull out one or two the blocks from the stack of a commercial web analytics tool without too much difficulty. But if you want to pull out more – and especially if you want to create a significantly customized schema – the tower starts to wobble. And that’s when you might find yourself asking the question, “should we think about building our own web analytics tool?”
“Build your own Web Analytics tool? Are you crazy?”
Back in the dim and distant past (over ten years ago), when I was pitching companies in the UK on the benefits of WebAbacus, occasionally a potential customer would say, “Well, we have been looking at building our own web analytics tool”. At the time, this usually meant that they had someone on staff who could write Perl scripts to process log data. I would politely point out that this was a stupid idea, for all the reasons that you would expect: If you build something yourself, you have to maintain and enhance it yourself, and you don’t get any of the benefits of a commercial product that is funded by licenses to lots of customers, and which therefore will continue to evolve and add features.
But nowadays the technology landscape for managing, processing and analyzing web behavioral data (and other transactional data) has changed out of all recognition. There is a huge ecosystem, mostly based around Hadoop and related technologies, that organizations can leverage to build their own big data infrastructures, or extend commercial web analytics products.
At the lower end of the Web Analytics stack, tools like Apache Flume can be deployed to handle log data collection and management, with other tools such as Sqoop and Oozie managing data flows; Pig can be used for ETL and enrichment in the data processing layer; or Storm can be used for streaming (realtime) data processing. Further up the stack, Hive and HBase can be used to provide data warehousing and querying capabilities, while there is an incre
asing range of options (Cloudera’s Impala, Apache Drill, Facebook’s Presto, and Hortonworks’ Stinger) to provide the kind of “interactive analysis” capabilities (dynamic filtering across related datasets) which commercial Web Analytics tools are so good at. At finally, at the top of the stack, Tableau is an increasingly popular choice for reporting & data visualization, and of course there is the Microsoft Power BI toolset.
In fact, with the richness of the ecosystem, the biggest challenge for anyone looking to roll their own Web Analytics system is a surfeit of choice. In subsequent blog posts (assuming I am able to increase my rate of posting to more than once every 18 months) I will write more about some of the choices available at various points in the stack, and how we’ve made some of these choices at Microsoft. But after finally bestirring myself to write the above, I think I need a little lie down now.
Great info! Thanks for another informative post. I always learn somethings at Lies, Damned Lies…
Thanks Bob! Good to know someone is still reading…
Ian
Hey Ian,
How are you? Hope you are doing great.
Is there any case study where these big data tools were used along with predictive analysis (Data mining)tools to get more insight?
Regards,
Mandar Morekar
Awesome post!
We had many of the same sentiments when we started building Snowplow – an open source web analytics stack using big data technologies, a couple of years ago – specifically:
1. Big data technologies and cloud-based services make building and instrumenting data processing applications much easier
2. There is a lot of value in not making the kinds of assumptions that web analytics vendors have about your data: especially assumptions around data models
The other area your post really resonated was in the value of decoupling the different parts of the web analytics stack. At Snowplow, that has given us a lot of flexibility to swap in and out different parts of our data pipeline to suit different customer needs and develop specific components faster, in isolation.
Looking forward to your next post on the topic, and would love to hear about the similarities and differences between what you’ve built with what we’ve built.
Very enlightening post!
I´m specially interested in Data Collection. Which are the alternatives to capture the same data sent to web analytics tools, in order to manage them in a big data solution in parallel?
Thanks for interesting insights!