Information Request

 

Name:*

WebSite:

Email:*

Phone:*

Comments:

 
 

* = required

What's the solution to cookie deletion

New Methods of Analysis

Forget single numbers, you need ranges.

This article is going to ask you to make a paradigm shift in how you think about identifying unique visitors. This article will describe new, cutting edge methodologies for identifying people. We’re going to take a journey from first generation web analytics to second.

What is a paradigm shift?

First, a bit of background about the concept of a paradigm shift.

A paradigm shift is a sudden jump from one way of thinking to another. The concept comes from Thomas Kuhn’s 1962 book, “The Structure of Scientific Revolutions.” He wrote that science doesn’t gradually evolve a little at a time, but as a series of peaceful plateaus punctuated by violent upheavals. During these upheavals the conceptual world view (or paradigm) is replaced by a new one. Think Darwin, Einstein, Galileo. He showed that the intellectual violence of these upheavals was caused by people stubbornly trying to hang on to the old paradigm they were used to, even when it didn’t work any more. In science this generally means we move into a new paradigm when the scientists who grew up with the old one retire.

I hope to shift, fundamentally and gently, the way we think about metrics in this article.

The problem with users

We need to identify unique users on the web. It’s fundamental. We need to know how many people visit, what they read, for how long, how often they return, and at what frequency. These are the “atoms” of our metrics. Without this knowledge we really can’t do much.

If you look in detail at how metrics software works, you’ll see that it operates by negative logic. The raw data is page views and whom these were served to. The software is designed not to identify which set went to one person, but to identify sets which went to different people. It determines who the unique visitors were by looking at pages viewed at roughly the same time and deciding whether they went to the same person or not. In other words, to identify unique individuals, you have to identify different people.

We have two methods for identifying unique individuals. We can look at their IP address plus their User Agent on the basis that every unique combination constitutes a unique person. This is an audit-approved methodology. However, we know this is not reliable. We can’t guarantee a one-to-one relationship between IP and individual. ISPs often dynamically allocate a different IP address every time a user dials in. People will therefore have more than one IP address, and the same IP address can be applied to different people. Similarly, we know lots of people have the same user agent. So IP+User Agent is a rough estimate.

Cookies were supposed to increase certainty. We can plant a unique identifier on someone’s computer. If we see it there next time, we can be almost certain they are the same person. The problem has been that we’ve applied flawed logic to assume that the reverse holds true -- if we don’t see the same cookie we have assumed they are not the same person. In other words, we think these are two different people because they have different cookies.

News that users delete cookies blows this out of the water. How significant a problem this is depends on whose study you read. Estimates range from a high of 55 percent down to 30 percent.

In all cases the percentage is enough for it to matter.

This means that we have been over-estimating the number of unique visitors, underestimating frequency, and have no idea about LTV or any other metric based on understanding repeat visit cycles. We’ve been off by at least 30 percent, maybe more.

How do you feel about a 30 percent margin of error on your ROI?

It means all our methods for identifying users are unreliable.

Poland to the rescue

Soon after the news about cookies started to break, I was contacted by two researchers from Poland: Magdalena Urbanska and Thomas Urbanski. They believe they have a solution. For some time I was under a nondisclosure agreement because they hadn’t published their research, but all can now be revealed.

Their solution is to do away with a single method and use a hierarchy of steps to determine if we have a unique visitor.

Before I detail the steps, it’s time to take the paradigm shift. Here it is:

We have been assuming that we can use a single method to identify unique individuals. We have been looking for yes-no answers and absolute numbers. We have done all the analysis within the framework of a single software system. We can’t do this any more. No single test is perfectly reliable, so we have to apply multiple tests. Some of those tests yield yes-no answers, and some of them yield probabilities, so the count of unique visitors will be a probabilistic estimate. Some of the tests depend on knowledge of IP topology, so we can’t restrict our analysis to a confined block of data analyzed by an isolated system.

In a nut-shell: To determine a web metric we should apply multiple tests, not just count one thing.

The Magdalena and Thomas methodology

Each of these steps is applied in order:

  • If the same cookie is present on multiple visits, it’s the same person.
  • We next sort our visits by cookie ID and look at the cookie life spans. Different cookies that overlap in time are different users. In other words, one person can’t have two cookies at the same time.
  • This leaves us with sets of cookie IDs that could belong to the same person because they occur at different times, so we now look at IP addresses.
  • We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can’t be the same person because you can’t get from New York to Tokyo in one hour.
  • This leaves us with those IP addresses that can’t be eliminated on the basis of geography. We now switch emphasis. Instead of looking for proof of difference, we now look for combinations which indicate it’s the same person. These are IP addresses we know to be owned by the same ISP or company.
  • We can refine this test by going back over the IP address/Cookie combination. We can look at all the IP addresses that a cookie had. Do we see one of those addresses used on a new cookie? Do both cookies have the same User Agent? If we get the same pool of IP addresses showing up on multiple cookies over time, with the same User Agent, this probably indicates the same person.
  • You can also throw Flash Shared Objects (FSO) into the mix. FSOs can’t replace cookies, but if someone does support FSO you can use FSOs to record cookie IDs. This way Flash can report to the system all the cookies a machine has held. In addition to identifying users, you can use this information to understand the cookie behavior of your flash users and extrapolate to the rest of your visitor population.

Magdalena and Thomas don’t apply the same weight to each test, and they tell me their analysis of IP topology uses some smart technology they’d rather keep to themselves. Some tests produce overlapping probability distributions rather than discrete groups.

Conclusions

The problem with cookie deletion is not that it happens, but that we’ve been relying on a single method for identifying people.

We have to move to a world in which we identify unique visitors by a series of tests. These tests have to take into account the way the internet is built. The result will be a statistical estimate, not an absolute number. The degree of certainty we hold about unique visitors will vary -- some visitors will be identified with near certainty, some will be close to guesses. This means analysis should separate unique visitors according to the certainty we have about their identification, rather than treating them as a homogenous mass.

As a general principle I think this is the way forward -- in the long run many key metrics will morph from single numbers into ranges. We’ll derive those ranges through multiple tests instead of just a basic count. We’ll use different portions of these ranges for different forms of analysis. Web analytics really is a branch of statistics, not just a fancy form of counting.

Talk to me if you want to discuss this, or any other issue.

Web Analytics Article List

Home|About Us|Contact Us|Site Map|Articles
Copyright Brandt Dainow, all rights reserved