But logs aren’t beat yet. Many vendors – Google, Webtrends, Clicktracks, WebAbacus, Site Intelligence to name a few – still offer the option to use logs as the primary data source. How come? Let’s take a look at how this battle plays out.
Round 1: Convenience
Say what you like about accuracy (and you will, I’m sure), but you can’t beat server logs for convenience. If you have the logs to hand, once you’ve installed your web analytics product, you simply point it at the logs, press the button, and sit back and wait for your data. There are wrinkles to be dealt with, for sure – you might have non-standard logs; you might have multiple web servers; or it might be difficult to gain access to the logs on your network (the three letters that strike fear into my heart? FTP), but most decent analytics tools can take these things in their stride.
Tag-based systems, by contrast, won’t yield up a scrap of data until you’ve made code changes to your website and cut them live to your server. Then there’s the hassle of ensuring that all the pages are tagged, and that pages don’t become untagged at some later date when some developer looks at the code and thinks “what’s this muck?” and removes it.
Round 1 winner: Logs
Round 2: Historical data
Straight back out of its corner after the success of round 1, logs delivers a second blow to tags: historical data. If you’ve been keeping your raw log files, a logs-based web analytics tool will be able to process that set of historical data and give you an instant picture of weeks, months or even years of activity on your site.
Tags just can’t match this – the data only starts to be collected on the day you implement the tags, so you can’t get a historical picture, by definition. This also makes it more challenging to move from one web analytics tools to another, since in the new tool you can’t get a historical picture to ease the transition. It means that many companies leave their old tool in place for months whilst the new tool builds up a base of data – costly if you’re paying for one or both tools.
Round 2 winner: Logs
Round 3: Visit and visitor counts
After its easy victories in the first two rounds, logs comes out with a swagger to square up on visit and visitor counts. But this time, tags is more than a match. Pretty much every tag-based analytics system serves up a persisitent cookie with the tag, and uses this cookie to sessionize the data (that is, build visits, by identifying page requests from the same user) and generate counts of unique users over longer periods of time. Once you’ve gone through the pain of instrumentation, this stuff comes pretty much for free, and is a great benefit.
Round 3 winner: Tags
Round 4: Accuracy
Log-based systems, on the other hand, are at the mercy of intermediate caches on the Internet – if a particular page (say, the home page) is relatively static and popular, a big subset of users will never hit the actual site’s web server when they request that page – they’ll be served a cached copy from a proxy somewhere between them and the site’s server (probably at their ISP, or their corporate firewall). So a tag-based system can under-report page impressions by as much as 80% (though 40-50% is a more common figure). Worse still, the pages in a web site are not evenly cached, so a home page will be served from cache much more often than a deep page or a checkout page. This means that the shape of funnels can look screwy, and it is very difficult to determine anything other than broad traffic patterns.
Round 4 winner: Tags
Round 5: Non-HTML content
Not every web site is made up entirely of HTML. Come to that, not every transaction-based system that you might want to analyze the usage of is HTML based – for example, call center or IVR system usage. In these situations, log-based systems come into their own; many log-based analytics systems can turn their hand to a surprising number of analytics tasks, as long as the system they’re analyzing the usage of can generate a log of its usage.
It used to be the case that this was a sucker punch for logs for non-HTML content on web sites too – but recently tag-based systems have got more adept at finding ways to track the usage of PDF files and other non-HTML content. Both Google Analytics and Gatineau have this functionality, for example.
Round 5 winner: A draw
Round 6: Sub-page events
To pull this off with a log-based system, you’d have to modify your site code to generate a dummy log entry on your web server (perhaps by requesting a non-existent HTML file), and then, whilst processing the data, treat this HTML file and others like it as a special case, ensuring the analytics system doesn’t accidentally count it as a page impression. It’s doable, but gnarly, gnarly, gnarly. And I don’t know of any log-based analytics system which implement a sub-page event model (perhaps someone can enlighten me via the comments box).
Round 6 winner: Tags
Round 7: Data integration
The team in the tags corner cries foul at this point, pointing out that data integration is more a function of whether you run your analytics system in-house or have it hosted as a third-party service; and that there are plenty of web analytics tools which can combine tag-based data collection with an in-house service. But there’s a strong correlation between logs/tags and in-house/hosted, so the referee allows the fight to continue.
In-house systems do make data integration easier. A log-based analytics system will capture all the user identifiers (in cookies, typically), including those used by the site’s own CMS, and a half-way decent web analytics tool will allow these identifiers to be extracted and then used
as a key for the import of related data (for example, the purchase history of a known customer).
Because tag-based systems tend to send their tag request to a third-party server (the web analytics provider’s data collection server), these cookies are not automatically captured. You can modify or customize the tag script for some tools to capture identity cookie values as variables, but then you’re still left with the challenge of importing potentially sensitive customer data across the Internet. Data protection laws in the EU and US state that in order to use customer data for this “secondary use” and transfer it to a third party, you have to get the customer’s explicit permission – something that most site owners are reluctant to do, for obvious reasons.
Round 7 winner: Logs (kinda)
The final score
Finally the competitors stagger back to their corners, bloody but unbowed. After some debate, the judges declare the final score to be:
Tags: 3, Logs: 2½
So, a closer result than you might think. Tagging wins out (just) because of the better quality of the data it yields up; although it’s a pain to instrument a site, you immediately get access to pretty good-quality, well sessionized data that you can start to build reports around. Logs are much more of a struggle to get set up to deliver good quality data, but once you’re there you have as much flexibility as with a tag-based system, and more in some respects (for example, in the area of data integration).