The Big Weakness of Split Testing: Finding New Mountains
At our startup, we run a lot of split tests (aka A/B tests) when working on new features. When I was first introduced to the idea, it was an intuitively appealing one, an approach like the "scientific method". With the power of numbers, we wouldn't have to rely on qualitative or vague user feedback to make decisions, right? Over a year into my job and after running dozens of split tests, I now have a deeper appreciation for the power of split tests done correctly, but also for the weaknesses. I especially appreciate what I now view as the biggest weakness: they aren't good for helping you find new mountains to chase.
In essence, a split test is like a controlled science experiment: you expose a random group of users to a change and see how their engagement compares to users in the control condition. Afterwards, you can calculate metrics of how statistically strong the results were - a simple t-test can be enough to give an idea of whether the results are meaningful.
Split tests have a lot of advantages, including:
- Objectively testing a difference on a large group of users to see how it impacts metrics
- Isolating differences in features or designs that will cause a change in user behavior
- Getting a better sense of the quantitative magnitude of a change
- In many cases, obtaining results fairly quickly
Not everything is perfect, however. Some weaknesses of split testing include:
- Not being able to tell why something happened. A button colored blue performed worse than the same button colored red - but why did blue vs. red make a difference?
- Becoming confusing as soon as you test more than one change at a time. Not only did you change the color of the button, but also the text on the button - if metrics go down, was it because of the color or the text? And despite knowing this, you will be tempted to run compound split tests to try to move product development faster.
- Only showing you the impacts to metrics that you analyze. So, changing the color of the button led to more users clicking on it, which is great! ...But, conversions of those users into purchasers decreased, which you weren't focused on for the test. If you never think to ask about that second step, you might be temporarily blind to that effect.
- Obscuring long-term effects of a change. That red button causes more users to click, but also causes them to irrationally hate your product and never come back. Split tests usually run over the scale of days, but an effect of the test might only manifest over the scale of months!
- Entangling you in statistical traps, such as providing falsely optimistic outcomes if you check results repeatedly while the data is being collected, generating false positives when testing too many metrics simultaneously, etc.*
On top of those, there is one major weakness of split tests that I'm still trying to learn how to get around: testing major changes to a product. This situation usually stems from the desire to completely re-do a significant aspect of the UX. It might be a redesign of a critical page, or an attempt to apply a new "philosophy" to the user's experience. What's more frustrating is that these situations are usually accompanied by extreme optimism. Of course this two-year-old sign-up flow isn't optimal for our users, and of course we can do a better job now, so our brand-spanking new design must perform better, right?
Wrong. A split test of your wonderful, new idea has actually, inexplicably, and quite surprisingly hurt the one major metric you were trying to improve.
But not all hope is lost. A key point to remember is that split testing is not good for combinations of changes, which a revamp most likely is. In fact, the big initiative is probably trying to find a potentially higher global maximum in a metric or experience, rather than trying to find the local maximum closest to your current situation. Split tests are not good for this; they might help you find the optimal color for a button, but they won't answer the question of whether the page with the button should even exist.
I haven't discovered a foolproof guide for proceeding in this situation, but I suspect that two kinds of information will help:
- Directed, qualitative user feedback - specific feedback about what's lacking in the current experience, how users use the product, any pain points they have with it, why they don't currently use your product in the way you would hope, etc.
- Guiding principles for product development - users determine the success of your product, but perhaps you know something that users don't. You may want to develop a set of longer-term product design principles, ones that will predict how your industry will be different or how your product will change in the future (plans involving monetization and user acquisition can make an impact here).
Gathering useful qualitative feedback and formulating a strong set of principles is not easy. But if you have that kind of support, you should be more comfortable with bad split test results for the first version of your large initiative, and press onwards with your eyes on a bigger prize. You will be able to iterate on the idea to reach, ultimately, a much higher maximum than you could've achieved with smaller, incremental tests - you might find a much bigger mountain to climb.
* The repeated-checking phenomenon is explained well here and here. You can get a lot of false positives when testing many metrics because, even if you are testing at a 5% confidence level, that still means that 1 in 20 tests can lead to a false positive. If each split test considers 10 metrics, you'd get a false positive in a set of 2 tests.
comments powered by Disqus