The Big Weakness of Split Testing: Finding New Mountains

22 April 2014

At our startup, we run a lot of split tests (aka A/B tests) when working on new features. When I was first introduced to the idea, it was an intuitively appealing one, an approach like the "scientific method". With the power of numbers, we wouldn't have to rely on qualitative or vague user feedback to make decisions, right? Over a year into my job and after running dozens of split tests, I now have a deeper appreciation for the power of split tests done correctly, but also for the weaknesses. I especially appreciate what I now view as the biggest weakness: they aren't good for helping you find new mountains to chase.

In essence, a split test is like a controlled science experiment: you expose a random group of users to a change and see how their engagement compares to users in the control condition. Afterwards, you can calculate metrics of how statistically strong the results were - a simple t-test can be enough to give an idea of whether the results are meaningful.

Split tests have a lot of advantages, including:

Not everything is perfect, however. Some weaknesses of split testing include:

On top of those, there is one major weakness of split tests that I'm still trying to learn how to get around: testing major changes to a product. This situation usually stems from the desire to completely re-do a significant aspect of the UX. It might be a redesign of a critical page, or an attempt to apply a new "philosophy" to the user's experience. What's more frustrating is that these situations are usually accompanied by extreme optimism. Of course this two-year-old sign-up flow isn't optimal for our users, and of course we can do a better job now, so our brand-spanking new design must perform better, right?

Wrong. A split test of your wonderful, new idea has actually, inexplicably, and quite surprisingly hurt the one major metric you were trying to improve.

But not all hope is lost. A key point to remember is that split testing is not good for combinations of changes, which a revamp most likely is. In fact, the big initiative is probably trying to find a potentially higher global maximum in a metric or experience, rather than trying to find the local maximum closest to your current situation. Split tests are not good for this; they might help you find the optimal color for a button, but they won't answer the question of whether the page with the button should even exist.

I haven't discovered a foolproof guide for proceeding in this situation, but I suspect that two kinds of information will help:

  1. Directed, qualitative user feedback - specific feedback about what's lacking in the current experience, how users use the product, any pain points they have with it, why they don't currently use your product in the way you would hope, etc.
  2. Guiding principles for product development - users determine the success of your product, but perhaps you know something that users don't. You may want to develop a set of longer-term product design principles, ones that will predict how your industry will be different or how your product will change in the future (plans involving monetization and user acquisition can make an impact here).

Gathering useful qualitative feedback and formulating a strong set of principles is not easy. But if you have that kind of support, you should be more comfortable with bad split test results for the first version of your large initiative, and press onwards with your eyes on a bigger prize. You will be able to iterate on the idea to reach, ultimately, a much higher maximum than you could've achieved with smaller, incremental tests - you might find a much bigger mountain to climb.


Notes:

* The repeated-checking phenomenon is explained well here and here. You can get a lot of false positives when testing many metrics because, even if you are testing at a 5% confidence level, that still means that 1 in 20 tests can lead to a false positive. If each split test considers 10 metrics, you'd get a false positive in a set of 2 tests.


comments powered by Disqus