New Methods of Analysis
Forget single numbers, you need ranges.
This article is going to ask you to make a paradigm shift in how you think about identifying unique visitors. This article will describe new, cutting edge methodologies for identifying people. We’re going to take a journey from first generation web analytics to second.
What is a paradigm shift?
First, a bit of background about the concept of a paradigm shift.
A paradigm shift is a sudden jump from one way of thinking to another. The concept comes from Thomas Kuhn’s 1962 book, “The Structure of Scientific Revolutions.” He wrote that science doesn’t gradually evolve a little at a time, but as a series of peaceful plateaus punctuated by violent upheavals. During these upheavals the conceptual world view (or paradigm) is replaced by a new one. Think Darwin, Einstein, Galileo. He showed that the intellectual violence of these upheavals was caused by people stubbornly trying to hang on to the old paradigm they were used to, even when it didn’t work any more. In science this generally means we move into a new paradigm when the scientists who grew up with the old one retire.
I hope to shift, fundamentally and gently, the way we think about metrics in this article.
The problem with users
We need to identify unique users on the web. It’s fundamental. We need to know how many people visit, what they read, for how long, how often they return, and at what frequency. These are the “atoms” of our metrics. Without this knowledge we really can’t do much.
If you look in detail at how metrics software works, you’ll see that it operates by negative logic. The raw data is page views and whom these were served to. The software is designed not to identify which set went to one person, but to identify sets which went to different people. It determines who the unique visitors were by looking at pages viewed at roughly the same time and deciding whether they went to the same person or not. In other words, to identify unique individuals, you have to identify different people.
We have two methods for identifying unique individuals. We can look at their IP address plus their User Agent on the basis that every unique combination constitutes a unique person. This is an audit-approved methodology. However, we know this is not reliable. We can’t guarantee a one-to-one relationship between IP and individual. ISPs often dynamically allocate a different IP address every time a user dials in. People will therefore have more than one IP address, and the same IP address can be applied to different people. Similarly, we know lots of people have the same user agent. So IP+User Agent is a rough estimate.
Cookies were supposed to increase certainty. We can plant a unique identifier on someone’s computer. If we see it there next time, we can be almost certain they are the same person. The problem has been that we’ve applied flawed logic to assume that the reverse holds true -- if we don’t see the same cookie we have assumed they are not the same person. In other words, we think these are two different people because they have different cookies.
News that users delete cookies blows this out of the water. How significant a problem this is depends on whose study you read. Estimates range from a high of 55 percent down to 30 percent.
In all cases the percentage is enough for it to matter.
This means that we have been over-estimating the number of unique visitors, underestimating frequency, and have no idea about LTV or any other metric based on understanding repeat visit cycles. We’ve been off by at least 30 percent, maybe more.
How do you feel about a 30 percent margin of error on your ROI?
It means all our methods for identifying users are unreliable.
Poland to the rescue
Soon after the news about cookies started to break, I was contacted by two researchers from Poland: Magdalena Urbanska and Thomas Urbanski. They believe they have a solution. For some time I was under a nondisclosure agreement because they hadn’t published their research, but all can now be revealed.
Their solution is to do away with a single method and use a hierarchy of steps to determine if we have a unique visitor.
Before I detail the steps, it’s time to take the paradigm shift. Here it is:
We have been assuming that we can use a single method to identify unique individuals. We have been looking for yes-no answers and absolute numbers. We have done all the analysis within the framework of a single software system. We can’t do this any more. No single test is perfectly reliable, so we have to apply multiple tests. Some of those tests yield yes-no answers, and some of them yield probabilities, so the count of unique visitors will be a probabilistic estimate. Some of the tests depend on knowledge of IP topology, so we can’t restrict our analysis to a confined block of data analyzed by an isolated system.
In a nut-shell: To determine a web metric we should apply multiple tests, not just count one thing.
The Magdalena and Thomas methodology
Each of these steps is applied in order:
Magdalena and Thomas don’t apply the same weight to each test, and they tell me their analysis of IP topology uses some smart technology they’d rather keep to themselves. Some tests produce overlapping probability distributions rather than discrete groups.
The problem with cookie deletion is not that it happens, but that we’ve been relying on a single method for identifying people.
We have to move to a world in which we identify unique visitors by a series of tests. These tests have to take into account the way the internet is built. The result will be a statistical estimate, not an absolute number. The degree of certainty we hold about unique visitors will vary -- some visitors will be identified with near certainty, some will be close to guesses. This means analysis should separate unique visitors according to the certainty we have about their identification, rather than treating them as a homogenous mass.
As a general principle I think this is the way forward -- in the long run many key metrics will morph from single numbers into ranges. We’ll derive those ranges through multiple tests instead of just a basic count. We’ll use different portions of these ranges for different forms of analysis. Web analytics really is a branch of statistics, not just a fancy form of counting.
Talk to me if you want to discuss this, or any other issue.