Tags vs Logs: The big fight

sp3 There are many in the web analytics industry who could say (with some justification) that the tussle over whether to use client-side JavaScript tags or web server logs as your source of web analytics data has already been settled, with tags being declared the winner by a knockout. Certainly with Gatineau we’ve decided to place ourselves firmly in the tags corner (if you want to provide a hosted web analytics solution, and collect the data centrally, you really don’t have any other option).

But logs aren’t beat yet. Many vendors – Google, Webtrends, Clicktracks, WebAbacus, Site Intelligence to name a few – still offer the option to use logs as the primary data source. How come? Let’s take a look at how this battle plays out.

Round 1: Convenience

Say what you like about accuracy (and you will, I’m sure), but you can’t beat server logs for convenience. If you have the logs to hand, once you’ve installed your web analytics product, you simply point it at the logs, press the button, and sit back and wait for your data. There are wrinkles to be dealt with, for sure – you might have non-standard logs; you might have multiple web servers; or it might be difficult to gain access to the logs on your network (the three letters that strike fear into my heart? FTP), but most decent analytics tools can take these things in their stride.

Tag-based systems, by contrast, won’t yield up a scrap of data until you’ve made code changes to your website and cut them live to your server. Then there’s the hassle of ensuring that all the pages are tagged, and that pages don’t become untagged at some later date when some developer looks at the code and thinks “what’s this muck?” and removes it.

Round 1 winner: Logs

Round 2: Historical data

Straight back out of its corner after the success of round 1, logs delivers a second blow to tags: historical data. If you’ve been keeping your raw log files, a logs-based web analytics tool will be able to process that set of historical data and give you an instant picture of weeks, months or even years of activity on your site.

Tags just can’t match this – the data only starts to be collected on the day you implement the tags, so you can’t get a historical picture, by definition. This also makes it more challenging to move from one web analytics tools to another, since in the new tool you can’t get a historical picture to ease the transition. It means that many companies leave their old tool in place for months whilst the new tool builds up a base of data – costly if you’re paying for one or both tools.

Round 2 winner: Logs

Round 3: Visit and visitor counts

After its easy victories in the first two rounds, logs comes out with a swagger to square up on visit and visitor counts. But this time, tags is more than a match. Pretty much every tag-based analytics system serves up a persisitent cookie with the tag, and uses this cookie to sessionize the data (that is, build visits, by identifying page requests from the same user) and generate counts of unique users over longer periods of time. Once you’ve gone through the pain of instrumentation, this stuff comes pretty much for free, and is a great benefit.

It’s perfectly possible to use cookies as user identifiers in a logs-based system; but firstly the site has to issue a cookie, and secondly that cookie has to be persistent and pervasive (i.e. every page should issue it if it isn’t already present in the browser). This can be a royal pain to set up.

Round 3 winner: Tags

Round 4: Accuracy

With a win under their belt, the team in the tags corner is starting to feel a little more bullish. And, sure enough, when it comes to accuracy, tags give logs a run for its money. The main reason for this is that the actual tag request made by the JavaScript in a tag-based system cannot be cached; so every request made by a visitor ends up being recorded by the system that’s listening out for the tag requests, resulting in pretty good accuracy at the page impression level

Log-based systems, on the other hand, are at the mercy of intermediate caches on the Internet – if a particular page (say, the home page) is relatively static and popular, a big subset of users will never hit the actual site’s web server when they request that page – they’ll be served a cached copy from a proxy somewhere between them and the site’s server (probably at their ISP, or their corporate firewall). So a tag-based system can under-report page impressions by as much as 80% (though 40-50% is a more common figure). Worse still, the pages in a web site are not evenly cached, so a home page will be served from cache much more often than a deep page or a checkout page. This means that the shape of funnels can look screwy, and it is very difficult to determine anything other than broad traffic patterns.

Round 4 winner: Tags

Round 5: Non-HTML content

Not every web site is made up entirely of HTML. Come to that, not every transaction-based system that you might want to analyze the usage of is HTML based – for example, call center or IVR system usage. In these situations, log-based systems come into their own; many log-based analytics systems can turn their hand to a surprising number of analytics tasks, as long as the system they’re analyzing the usage of can generate a log of its usage.

It used to be the case that this was a sucker punch for logs for non-HTML content on web sites too – but recently tag-based systems have got more adept at finding ways to track the usage of PDF files and other non-HTML content. Both Google Analytics and Gatineau have this functionality, for example.

Round 5 winner: A draw

Round 6: Sub-page events

Another knock-down for tags in this round. Sites which refresh content (manually or automatically) without executing a full page refresh present a particular challenge for web analytics tools of all stripes; but tag-based systems rise to the challenge much better than log-based ones. Increasingly, tag-based analytics tools offer the ability to attach a JavaScript event call to sub-page events, and track them as a separate kind of interaction (i.e. not a full-fledged page impression, but something worth counting nonetheless).

To pull this off with a log-based system, you’d have to modify your site code to generate a dummy log entry on your web server (perhaps by requesting a non-existent HTML file), and then, whilst processing the data, treat this HTML file and others like it as a special case, ensuring the analytics system doesn’t accidentally count it as a page impression. It’s doable, but gnarly, gnarly, gnarly. And I don’t know of any log-based analytics system which implement a sub-page event model (perhaps someone can enlighten me via the comments box).

Round 6 winner: Tags

Round 7: Data integration

The team in the tags corner cries foul at this point, pointing out that data integration is more a function of whether you run your analytics system in-house or have it hosted as a third-party service; and that there are plenty of web analytics tools which can combine tag-based data collection with an in-house service. But there’s a strong correlation between logs/tags and in-house/hosted, so the referee allows the fight to continue.

In-house systems do make data integration easier. A log-based analytics system will capture all the user identifiers (in cookies, typically), including those used by the site’s own CMS, and a half-way decent web analytics tool will allow these identifiers to be extracted and then used
as a key for the import of related data (for example, the purchase history of a known customer).

Because tag-based systems tend to send their tag request to a third-party server (the web analytics provider’s data collection server), these cookies are not automatically captured. You can modify or customize the tag script for some tools to capture identity cookie values as variables, but then you’re still left with the challenge of importing potentially sensitive customer data across the Internet. Data protection laws in the EU and US state that in order to use customer data for this “secondary use” and transfer it to a third party, you have to get the customer’s explicit permission – something that most site owners are reluctant to do, for obvious reasons.

Round 7 winner: Logs (kinda)

The final score

Finally the competitors stagger back to their corners, bloody but unbowed. After some debate, the judges declare the final score to be:

Tags: 3, Logs: 2½

So, a closer result than you might think. Tagging wins out (just) because of the better quality of the data it yields up; although it’s a pain to instrument a site, you immediately get access to pretty good-quality, well sessionized data that you can start to build reports around. Logs are much more of a struggle to get set up to deliver good quality data, but once you’re there you have as much flexibility as with a tag-based system, and more in some respects (for example, in the area of data integration).

8 thoughts on “Tags vs Logs: The big fight”

Stephen Turner

January 22, 2008 at 12:35 pm

Nicely done, Ian. That’s a well balanced summary in an amusing format. You forgot one for logfiles though: tracking robots. You don’t want to mix robots with human visitors, but as long as your program can filter them out into a separate category, SEOs love being able to see when the search engine bots last came by and which pages they viewed.
Richard

January 22, 2008 at 1:17 pm

It would be brash to declare a winner, each method has it’s own and unmatched strength/specialization. I would even consider it a stretch to put both products in the same ring.
Unless a site is using both, they are missing out on critically important data.
Ian Thomas

January 22, 2008 at 11:53 pm

Stephen,
Doh! You’re right. I knew there was something I was missing. Log-based systems are, as you say, great for tracking robot/spider-based activity, essential data for any self-respecting SEO. This could have pushed things over the edge for logs, but then, as Richard wisely points out, it’s not really an either/or situation. But the blogosphere loves a good fight… 😉
Avinash Kaushik

January 23, 2008 at 10:01 am

A shout out to Richard’s thought that each solution might be the right fit for a company – weighing pro’s and con’s was never more important.
But there are two dimensions that the Practitioner in me would like to share (after years of being a Practitioner sadly it is hard to keep “his” voice down!).
1) Complexity of being able to work with your own company IT team and resources to build and maintain a log based solution for web analytics. After years of working in companies large and small I would rather pull my nails out than go back and beg the IT Team to do one more thing for me to get data out of the logs (or even capture all the data I need and store it properly and make sure the jobs run to process the data and create reports etc).
Considering your IT resources available and their commitment to you is very important (especially as over the last few years Web Analytics moves from being a IT function to a Business function).
I love outsourcing my data capture, processing and availability problems to Omniture and demand service! 🙂
2) This applies usually to large companies. It is much easier for me to “integrate” my web analytics data with other web solutions like online surveys and multivariate testing tools etc. It is nice that the tag based solutions have worked with these other solutions to think through how to tie data together (and it helps that they are all tag based I suppose). Makes life easier. Not the hugest of deals but something to consider as user experience becomes more complex and web analytics moves from just clickstream analysis.
More food for thought. I am sure net net the scores might not be influenced that much.
-Avinash.
Juan Damia

January 23, 2008 at 10:48 am

Hehehe,LOL.
Congratulatios! That’s the most friendly way to tell something not so friendly, and you did it in a very funny way.
Jeff

January 24, 2008 at 12:18 pm

This is so 4 years ago!
So, all things are weighted equally? Historical data, no matter how misleading, is just as important as accuracy? I’m shocked we’re still having this debate!
Yeah, tag-based systems are difficult to setup/maintain, but for the majority of companies who actually *need* analytics (including anyone spending/making money online), why shortchange your data and jeopardize your decisions?
It doesn’t matter how easy a solution is to maintain, if the data is misleading, than you risk making poor decisions and losing money.
3 to 2.5? Should be more like 1,000 to 1.
Ian Thomas

January 24, 2008 at 12:44 pm

Jeff,
You may say that this debate was over four years ago, but if you read some of the comments on this post there are still plenty of people out there who value log-based solutions.
As for the weighting, I do agree that some of these measures are more likely to appeal to a bigger group of people than some others, but every company’s analytics needs are different. If you HAVE to have historical data, no matter how flakey, then you don’t have much choice but to go with a logs solution. And bear in mind that there are various ways to mitigate the flakiness of logs-based data (for example, not looking too closely at individual page counts, and instead focusing on overall trends). Likewise, if you have no access to logs, you are only looking at a tag-based solution.
My actual recommendation when people ask me this question is that they should use both kinds of system if possible/needed. It’s horses for courses.
Jared Huber

February 18, 2008 at 7:40 pm

Great post. I, too, think the debate is far from over.. log analysis is limited for RIAs, and my tags have disappeared or broken often enough to give me gray hairs.
I’m curious to hear your thoughts on packet-sniffing solutions. Do they allow the best of both worlds, or do they suffer from another set of limitations?
-Jared

Comments are closed.