Strategic Website Usability

Measuring Usability: Split (A/B) Testing

In part two of this series, I wrote about conversion rates, what they are and how to measure them. So, now that you've got something to measure, what's the next step? You could sit around, just measuring things, hoping for the best. More likely, though, you'll want to make changes to your website, presumably changes that will improve conversion and help motivated visitors become buyers.

Most of the time, people make these changes and never look back. They take their best guess and hope for positive results. Ideally, though, you'd like to know whether a change was for the better or worse, in a real, quantifiable sense. Enter the split test, sometimes called an "A/B" test. The name is pretty self-explanatory; you have two versions of something (copy, a graphic, a layout element, etc.) and want to split them to your visitors and find out whether A or B performs better.

Why Split Test?
So, why not just run version A for a while, then run version B for a while, and see what works better (sometimes called "sequential" testing)? To simplify a whole lot of statistical jargon, it comes down to this: you never know what might have changed over time. People may go on vacation, the market may slump, you (or your competitor) may launch a major marketing campaign, the Fed might raise rates, Starbucks might introduce a size bigger than Venti, etc. In the end, you want to have some confidence that any difference you measure between groups A and B is because A and B are actually different, not because of some external factors muddying up your results.

What Should You Test?
Split-testing originated in the advertising world and is often used on the web to test subtle differences in advertisements and landing pages (the pages people arrive on when they click on an ad). You might test changes in copy, layout, colors, button shapes, or just about any page element that potentially has an impact on your visitors. As for whether you should test big changes or small ones, there seem to be two camps on this subject. One side, which I tend towards, believes in starting small. That way, if you do find that B is better than A (or vise-versa) you'll have some idea of why. In the long-term, that will allow you to make more educated guesses about future changes. The other side says that you should go for big changes, as you'll get more bang for your buck (and your time). I can't completely deny the logic of that; it often comes down to what you need to accomplish and how long you have to accomplish it. A stable site that does pretty well may want to stick to an evolutionary process of incremental changes. A site that's having trouble or needs a major overhaul may require more radical methods.

Significance and Reliability
Sorry, but you just can't talk about testing without bringing up a bit of statistics. A split-test is essentially an experiment; you present two options to two groups of people and measure an outcome. When you're done, how do you know if that outcome is reliable? Even before that, how do you know when you're done?

Let's say you observe a difference: Version A converts at 1.5% and version B converts at 2.5%. That sounds good, but are the two groups really different? Simply put, maybe not: most often, this is because the groups aren't big enough for the difference to be reliable. In statistics, we call this "significance". A significant difference is one that, to the best of our ability to measure it, represents a real difference and not just noise or poor measurement.

Understanding all of the mathematics of significance is well beyond the scope of a blog entry, but your best defense is to collect plenty of data. Make sure that your groups are: (1) split roughly 50/50 (in a straight A/B test), and (2) that you measure a decent chunk of conversions. Depending on the size of the difference you may need hundreds of conversions for each group to get a reliable result. Unfortunately, sometimes that's just not practical, but do the best you can.

Some Final Advice
From my days as an experimental psychologist, I know how tempting it can be to peek at the data every chance you get. Don't. Bias is a powerful force, and it's best to let the test run its course and not interpret the data until you're done. Too often, people start to see the results trending the way they'd like, decide they've got enough information, and cut the experiment off early. Also, try not to make changes during the test, no matter how small they seem. If you make a mistake in the testing procedure (that would bias the results), start over, as painful as that may be.

Joshua Ledwell

 · Sunday, September 23
Nice post again Dr. Pete. This is a more accessible summary than past explanations I've seen. A/B testing is a great tool for testing when your infrastructure can support it, and when you have the volume or patience to gather good data.

Are there any online calculators that can quickly measure statistical significance? We used to use one at my previous job, but I can't find it now.

Dr. Pete

 · Sunday, September 23
Thanks, Joshua. Funny you should mention about the calculator, as I'm working on one now. I've seen a couple that are decent, but they aren't geared towards usability, explanation-wise, and I'm trying to add a feature that will help you predict how many more visitors you need (if a test hasn't reached significance). If all goes well, I'll beta the A/B version this week (possibly doing more elaborate scenarios later, if people find it useful).
©2008 User Effect, Inc. · Blog · About · Services · Contact · Archive · Resources · Subscribe