In our latest article, we wrote about the value of A/B testing (also referred to as “split testing”) and how eCommerce brands can set up and scale an effective A/B testing strategy. By getting the fundamentals right, brands can use A/B testing to generate unbiased, data-informed market insights. Large brands often run tens of experiments at any given time and use the results to directly inform their product strategy.
However, we also touched briefly on the hidden complexity behind an A/B testing operation: in fact, one of the curses of A/B testing is its apparent simplicity. It’s easy to get lured into signing up for an A/B testing tool without a clear understanding of how it works, only to find ourselves weeks later with tests that are inconclusive or just plain incorrect.
This article will dig deeper into this complexity, highlighting some of the most common pitfalls brands face when running experiments. We will also discuss the scenarios in which A/B testing is not the way to go and the alternative practices brands should adopt instead.
A/B testing is a form of statical hypothesis testing. As such, it is prone to all the typical pitfalls of statistical analysis: most notably, if the data you’re using to run the analysis is incorrect, biased, or skewed in any way, the results of your analysis are also very likely to display the same problems–garbage in, garbage out.
Even with modern A/B testing platforms, it is still unbelievably easy to skew the results of an A/B test. Your tools can help you write less code and run the math for you, but they won’t be able to make strategic decisions for you–such as which metric to pick–or to shield your test from external factors–such as a concurrent A/B test skewing your test population.
With this in mind, let’s see some of the most common mistakes and oversights a brand might make when designing, executing, and analyzing an A/B test.
All A/B tests are subject to two different types of errors:
We can guard our tests against Type I and Type II errors by picking an appropriate statistical significance and power: the higher our significance, the more resilient our test will be against Type I errors; the lower our power, the more resilient our test will be against Type II errors. Combined, these numbers represent our desired confidence in the test result.
As an example, if we pick a significance of 80% and a power of 5%, it means that:
Higher confidence levels mean we can take more drastic actions as a result of our test but also require larger test populations, which may be challenging to reach for smaller brands.
An 80% significance and a 5% power is a good starting point for most A/B tests, but you should use your judgment: in some tests, the cost of making a mistake in either direction might be higher or lower than usual; in others, you might only want to worry about false positives or false negatives.
Two interesting dynamics can threaten the validity of A/B tests, and they both stem from how your users react to change. These are more relevant for products with a high percentage of returning users, so you should be extra careful if your brand offers subscriptions, for instance, as they could prevent you from getting reliable results.
The Novelty Effect occurs when users engage with a product more than usual because they’re fascinated by a new feature or a change in an existing feature. For instance, users might purchase more if you introduce a brand-new loyalty program to drive incremental recurring revenue. However, as they adapt to the change, most people will gradually regress to the average user behavior patterns.
Change Aversion is the exact opposite and often happens with significant redesigns of existing functionality: you might find that affectionate users engage less with your experiment, as they were used to the previous way of doing things and don’t like that their workflows got disrupted. Over time, they will likely get used to the new design and re-engage with your brand as usual.
For both problems, the solution is to segment your A/B test results by new and existing users. By analyzing the performance of these two buckets separately, you can more easily isolate the effect of your experiment from the effect induced by novelty/change aversion. Of course, you’ll have to figure out what counts as a “new” or “existing” user, and it might differ depending on the test you’re running.
The Network Effect occurs when users of an A/B test influence each other, effectively “leaking” outside the bucket they belong to (i.e., the control group interacts with the experiment, or the experiment group reverts to control). This most often occurs with “social” functionality, where users interact with each other directly.
Consider the case of a second-hand marketplace that wants to offer users the ability to exchange goods directly without a cash transaction. By definition, such a feature will require two users to participate: the sender and the receiver. One quick solution would be to assign users randomly to the experiment or control group and only allow users in the experiment group to use the new functionality. However, this might lead to a sub-par UX, causing frustration and skewing the test results.
Instead, you should cluster users into groups, with all users in the same group being more likely to interact with each other. For instance, you might cluster users by state, assuming that most of your marketplace’s transactions happen within the same state. You would then assign each state to the control or experiment group. Users might still interact across different states, influencing each other, but clustering helps minimize the likelihood of spillover.
The Network Effect is relatively uncommon in eCommerce, but it’s still worth understanding if your brand allows users to interact with each other (as in the case of a marketplace). It can lead to unexpected results and can be incredibly sneaky to detect. As a rule of thumb, you should always consider whether any tests you’re designing have the potential to introduce a network effect and plan accordingly.
The History Effect occurs when an event in the outside world affects the results of your A/B test. This can be anything from a marketing campaign to a shopping holiday. These events are likely to change the behavioral patterns of your visitors compared to the average, which will cause you to extrapolate incorrect insights.
For example, let’s assume you’re testing whether offering a discount in exchange for newsletter signups significantly increases subscribers. If you decide to run the test during a site-wide sale (e.g., because of Black Friday/Cyber Monday), your test might be inconclusive.
To mitigate the History Effect, ensure you have solid processes to ensure you’re not running A/B tests concurrently with major media coverage and PR events. It also helps to run your A/B tests for at least two complete business cycles, which typically translates into 2-4 weeks for most brands.
Simpson’s Paradox is a particularly sneaky type of error that occurs whenever you inadvertently introduce weighted averages into an A/B test, which can happen in a few cases:
In these scenarios, you might find that any correlation identified in individual test segments disappears or is inverted in your aggregate results. This is because the aggregate result calculation effectively becomes a weighted average, and your larger segments will “overwhelm” the smaller ones.
You can find a more academic explanation of Simpson’s paradox here, but there are a few things you can do to prevent Simpson’s Paradox from finding its way into your test results:
P-hacking is an extremely common A/B testing pitfall–so much so that it used to be encouraged by A/B testing tools such as Optimizely. Simply put, P-hacking is the practice of changing a test’s original parameters to reach a pre-determined conclusion. This can come in different forms:
P-hacking is often caused by pressure from your leadership or digital marketing team to get a specific result from an A/B test or to maximize the impact of a successful experiment as quickly as possible. Unfortunately, this is not how traditional A/B testing works: once you have established your sample size, you simply need to let your run test run its course and evaluate the results only at the end.
Because this methodology isn’t particularly well-suited to the speed at which startups typically move, many A/B testing platforms ended up implementing alternative algorithms. Optimizely, for instance, introduced Stats Engine in 2015, which allows A/B testers to peek at test results without the risk of taking action prematurely–you can learn more about how it works in this introductory article by the Optimizely team.
While features such as Optimizely’s Stats Engine or VWO’s Bayesian engine help A/B testing teams avoid pitfalls such as P-hacking, they don’t eliminate the need for proper test planning.
While most pitfalls outlined in this article are statistical, the Instrumentation Effect is much simpler: it occurs when your analytics, A/B testing infrastructure, or test implementation don’t work correctly, skewing test results. Here are a few examples:
Because these bugs happen at the very source of your data, no statistical methods can solve any of these problems for you. Instead, you need to regularly and rigorously test every part of your A/B testing infrastructure:
Reading this article, you might think that we want to discourage eCommerce brands from running A/B testing, and you’d be–at least partially–correct. It’s not that A/B testing doesn’t have its place in an eCommerce brand’s strategy. But the effort involved in planning, executing, and analyzing an A/B test, assuming that you want to get significant results, is often not sustainable by most early-stage brands.
Anyone not adequately trained in the statistical techniques behind A/B testing will have a tough time guarding their tests against the pitfalls we’ve outlined–which, by the way, are a subset of all the different statistical and practical errors A/B testers can incur. Tools can help mitigate some of these errors to an extent, but they can’t turn an inexperienced team into data analysis experts overnight.
Furthermore, A/B testing is very often not the best research methodology. There are many scenarios in which A/B testing falls short.
So, are we suggesting that retailers shouldn’t A/B test? Not at all: A/B testing has its place, and when employed correctly, it can be instrumental in improving a business’s KPIs and bottom line. To dismiss A/B testing as too complicated to be worth the effort would be incredibly short-sighted and detrimental.
However, many eCommerce businesses dive head-first into A/B testing without proper product management fundamentals. We’re talking about generative and evaluative research methodologies such as heatmaps, user interviews, user testing, on-site surveys, session replays, historical analytics, feature flags, and many other techniques.
These practices have a broader set of potential use cases and are also a prerequisite for being able to A/B test intentionally. Plus, they’re almost always simpler to implement and leverage than an effective A/B testing infrastructure!
The next time you–or someone else on your team–want to run an A/B test, stop for a moment and ask yourself: is there a more straightforward, more efficient way to answer this question/validate this testing hypothesis? You might be surprised about how many alternatives you have at your disposal.
In the next few weeks, we’ll be exploring precisely these product fundamentals, explaining when and how they are best used, and how you can orchestrate them together to form the basis of a strong product management practice for your eCommerce brand.