Things That Throw Web Stats
- Part 2: Inaccuracies in Web Analytics Software
In my last article, Things That Throw Web Stats – Part 1, I discussed how the nature of web technology itself made absolute precision in web analytics impossible. We’re “guestimating” visitors, we’re inevitably under-counting duration, and a visit is an arbitrary unit of time, not a genuine measure of someone’s activity.
We should remember that these stats are gathered, processed, and delivered by software. The web analytics software industry is very new, and far from mature. Web analytics software is not perfect, and it introduces inaccuracies of its own into the process.
This article discusses the inaccuracies introduced by the software we use.
Log Analysis Issues
Many people use log analysis to get their stats. Log analysis is much less accurate than page-based tracking. Here’s why:
Spiders & Robots
Search engine spiders read your site, so does performance monitoring software. Most log analysis software doesn’t distinguish between page requests by humans and page requests by software. This inflates the number of page views dramatically. Since search engines go through pages at a rate of 1 per second, it can also reduces average visit duration and average page read time. This is OK if you know and adjust for it.
If you think all the reported activity is coming from people, but your system is not separating out spiders, then you believe that people are making shorter visits than they really are. I believe this is why most designers think the average visit duration to a web site is 3-4 minutes and the average page duration is about 30 seconds. In reality it’s about twice that.
SWF files are flash files. Flash is a problem for log analysis. A flash file can be a complete page. It can also be a simple animation inside a page. So when a log analysis product sees that an SWF file has been viewed, does it count as a page view or not? Most count it as a page view. If you’ve got flash animations inside your pages, look at your stats again. If you’ve got flash as both full pages and as page elements, I doubt you’ll be able to get accurate stats from log analysis at all.
Caching & Cache busting
Most browsers keep a copy of each web page you read. If you hit the back button they serve you that page instead of bothering to ask the server for another copy. Log analysis misses this because the server never saw the second viewing. This accounts for about 30% of all page views. Saving pages like this is called “caching.” It’s a major problem in online advertising because if an advertisement is cached people may not get paid for delivering it. This is why there is so much talk about “cache busting” technology in ad delivery.
It isn’t just browsers which cache. Corporate gateways cache commonly requested pages to save time and bandwidth. ISP’s may cache for the same reasons. If you’re using log analysis for your stats, you’re missing about one-third of your activity. Page-based tracking works from the act of reading the page, so it is “cache-busting.”
While page-based tracking may avoid the caching problems of log analysis, it is victim to another back button issue which log analysis avoids. Many people exit a site by repeatedly clicking their back button. Log analysis doesn’t pick this up, but page-based tracking does. This means many visits end with a series of 1 or 2-second page views in reverse order from the first half. There’s no official term for this, but I call it “wake turbulence.” Most analysis tools don’t even recognize this problem, let alone deal with it. It increases the average number of page views per visitor, and reduces the average page duration.
I don’t think you can blame software for this one. If you watch people fill in multi-page forms, you’ll often see them go back and forth within the sequence, so the system can’t automatically eliminate quick views of preceding pages. This would mean you would have to examine the click-streams yourself and decide which were valid reads and which were just wake turbulence. Obviously that’s an impossible job. You just have to accept a degree of fuzziness around your stats for visit duration, number of pages read, and average page read time.
Very few stats packages handle daylight saving change accurately. Think about what happens when daylight saving cuts in. Someone enters your site at 11:45pm. They stay for 30 minutes. Daylight saving starts at midnight and the clocks roll back one hour. Their visit finishes at 11:15pm. At the end of the daylight saving period the clocks go the other way. In this case their 30-minute visit starts at 11:45 and finishes at 1:15.
You’d be surprised how few web analytics packages handle this accurately. In fairness it is a tough one to code for. Some systems can cope with this because they work in GMT, then convert the visit times at the point of reporting. But most use local time. How do they handle this? A surprising number simply discard all the records during the cross-over period.
Some of your visitors don’t trust you. Some major-name tracking systems are listed as spyware and blocked. Some people block cookies. Some people clean out their cookies regularly. If you are tracking repeat visitor behavior with cookies you have to accept some degree of inaccuracy as people block or remove them.
This is more likely if you have another company gathering and analyzing your cookies. It depends on how they do it. If your site sets the cookie you have less chance of being blocked than if their system sets the cookie. A cookie set by them is called a “3rd-party cookie.” 3rd party cookies are much more likely to be blocked than your own. Under some circumstances, in some countries, 3rd party cookies can even be illegal.
Transversal what you do when you click a hyperlink – you transverse from one page to the next. Sometimes people click on a link but never arrive at the other end. Browsers crash, they change their mind, and so on.
This is starting to become a source of contention in PPC advertising. Google usually charges me for more visits than I can see arriving in my client’s sites. It is usually over by 25% or so for my clients. I’m not the only person with this problem. Google believe this is a minor and rare problem, but many users are not so sure. I have questioned Google on this and they have informed me they use log analysis for calculating click-throughs. I have asked them if they filter out search engines and robots which we know read Google results and follow links.
They have replied:
“I would like to reassure you that our system will only count legitimate clicks to your client's ads and will filter out all other traffic... As stated in our Terms and Conditions, we require that all parties using Google AdWords services accept our metrics.”
This problem is not unique to Google. It occurs to a greater or lesser degree with all forms of inter-site link activity. This means that your ROI calculations for PPC advertising and affiliate marketing cannot be perfectly accurate, but need to permit a margin of error. Don’t go making decisions on 1 or 2 percentage points.
We have to accept that web analytics software is in its infancy. Anyone old enough to remember Pong? Compare that with the latest computer games. Look at your web analytics software and try to imagine the same level of improvement over the next 20 years. Now look at the software you use today again. Understand that this stuff has a long way to go.
We can do great things with web analytics software today compared with 5 years ago, but we have only just begun. Remember that this is not precision accounting but statistical analysis with flawed software. Work with big margins of error and focus on trends not detailed numbers. Understand that no matter what you do, a certain degree of guessing is inherent in the process.
Life’s full of uncertainties and web analytics is no different. Somehow we all manage to get by.
Talk to me if you want to discuss this, or any other issue.