Strategic Web Usability

Split-testing made simple (part 4)

Last time, I talked a bit about how to launch your split-test, including a few sampling issues to watch for. In this last entry in the series, I want to cover some basic statistics that will help you determine how long to run your test and whether the results are meaningful.

In experimental design, the key statistical problem boils down to this: if you manipulate two groups and observe a difference, is the difference because of what you did or the result of some other force or natural difference between the groups? Random sampling solves part of that problem, but much of the rest is a numbers game. You have to collect enough data to work out the noise in the system, so to speak.

Statistically-speaking, we use two major measuring sticks, the mean (or average) and some measure of variability. In other words, what is the average behavior of the group you're measuring and how different are the people in that group? Let's look at the simple example below:

Imagine two scenarios representing two different split tests. In Scenario #1, groups A and B have fairly different average scores, and also fairly low variability (indicated by the red and orange error bars). In other words, the people in those groups don't differ much from the average. In Scenario #2, the groups have more similar averages, and much higher variability. In #1, you can safely say that these two groups are different, but what about #2? Yes, the averages are different, but the two groups differ so much internally that calling A and B "different" is dicey at best.

So, how do you know the difference you measure in conversions is really different? There's an entire area of mathematics and thousands of articles on this subject, but it boils down to a numbers game. Since you're running your experiment on the internet, and don't have much control over your visitors, you'll need fairly large numbers. If you have a decent conversion rate (2% or greater), plan on having an A and B group in the thousands of visitors, something on the order of 10,000 total visitors (5,000/5,000).

In the end, the smaller the difference between the two groups, the more visitors you'll need to know that it's reliable. Try to be practical about it, and keep your goals in mind. If a 0.5% clickthrough difference makes a serious financial impact on your ad spending, take it seriously and get the data. A fairly small difference in conversion rate can mean real dollars, and if the cost and risk of using A vs. B is marginal, listen to the numbers. If the cost of one alternative vs. the other is high or the decision is a risky one, make sure your numbers are reliable and stack up against what the marketplace tells you.

©2012 User Effect, Inc. · Home · About · Services · Contact · E-book · Blog · Archive