Trustworthy Online Controlled Experiments

Highlights

Thomke wrote that organizations will recognize maximal benefits from experimentation when it is used in conjunction with an “innovation system” (Thomke 2003). Agile software development is such an innovation system.
Many organizations will not spend the resources required to define and measure progress. It is often easier to generate a plan, execute against it, and declare success, with the key metric being: “percent of plan delivered,” ignoring whether the feature has any positive impact to key metrics.
The hard part is finding metrics measurable in a short period, sensitive enough to show differences, and that are predictive of long-term goals.
most who have run controlled experiments in customer-facing websites and applications have experienced this humbling reality: we are poor at assessing the value of ideas.
Interesting experiments are ones where the absolute difference between the expected outcome and the actual result is large.
Experiments can help continuously iterate to better site redesigns, rather than having teams work on complete site redesigns that subject users to primacy effects (users are primed in the old feature, i.e., used to the way it works) and commonly fail not only to achieve their goals, but even fail to achieve parity with the old site on key metrics (Goward 2015, slides 22−24, Rawat 2018, Wolf 2018, Laja 2019).
Our experience is that most big jumps fail (e.g., big site redesigns), yet there is a risk/reward tradeoff: the rare successes may lead to large rewards that compensate for many failures.
We recommend that key metrics be normalized by the actual sample sizes, making revenue-per-user a good OEC.
The latter, however, may not be as effective after the first couple of weeks as unique user growth is sub-linear due to repeat users while some metrics themselves have a “growing” variance over time
Statistical power is the probability of detecting a meaningful difference between the variants when there really is one (statistically, reject the null when there is a difference).
The user accumulation rate over time is also likely to be sub-linear given that the same user may return: if you have N users on day one, you will have fewer than 2N users after two days since some users visit on both days.
It is important to ensure that your experiment captures the weekly cycle. We recommend running experiments for a minimum of one week.
For example, selling gift cards may work well during the Christmas season but not as well during other times of the year. This is called external validity; the extent to which the results can be generalized, in this case to other periods of time.
In general, overpowering an experiment is fine and even recommended, as sometimes we need to examine segments (e.g., geographic region or platform) and to ensure that the experiment has sufficient power to detect changes on several key metrics.
The key thing to remember is that there will be times you might have to decide even though there may not be clear answer from the results. In those situations, you need to be explicit about what factors you are considering, especially how they would translate into practical and statistical significance boundaries. This will serve as the basis for future decisions versus simply a local decision.
Twyman’s Law: “Any statistic that appears interesting is almost certainly a mistake”
Experience tells us that many extreme results are more likely to be the result of an error in instrumentation (e.g., logging), loss of data (or duplication of data), or a computational error.
The p-value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that the Null hypothesis is true. The conditioning on the Null hypothesis is critical.
Generalizations across populations are usually questionable; features that work on one site may not work on another, but the solution is usually easy: rerun the experiment.
When you see anomalous data, think of Twyman’s law and investigate the issue.
There is a fundamental shift that happens when teams change from shipping a feature when it does not hurt key metrics, to NOT SHIPPING a feature unless it improves key metrics.
For experimentation to succeed and scale, there must also be a culture around intellectual integrity—the learning matters most, not the results or whether we ship the change.
In discussing organizational metrics, the taxonomy commonly used is goals, drivers, and guardrails.
Goal metrics, also called success metrics or true north metrics, reflect what the organization ultimately cares about.
Driver metrics, also called sign post metrics, surrogate metrics, indirect or predictive metrics, tend to be shorter-term, faster-moving, and more-sensitive metrics than goal metrics.
Guardrail metrics guard against violated assumptions and come in two types: metrics that protect the business and metrics that assess the trustworthiness and internal validity of experiment results.
Taking the time and effort to investigate metrics and modify existing metrics has high EVI. It is not enough to be agile and to measure, you must make sure your metrics guide you in the right direction. Certain metrics may evolve more quickly than others. For example, driver, guardrail, and data quality metrics may evolve more quickly than goal metrics, often because those are driven by methodology improvements rather than fundamental business or environmental evolutions.
You can use the movement of metrics in experiments to identify how they relate to each other.
A 95% confidence interval is the range that covers the true difference 95% of the time and has an equivalence to a p-value of 0.05; the delta is statistically significant at 0.05 significance level if the 95% confidence interval does not contain zero or if the p-value is less than 0.05.
Many people consider power an absolute property of a test and forget that it is relative to the size of the effect you want to detect. An experiment that has enough power to detect a 10% difference does not necessarily have enough power to detect a 1% difference.
When comparing two variants, the optimal trigger condition is to trigger into the analysis only users for which there was some difference between the two variants compared, such as between the variant the user was in and the counterfactual for the other variant.
When computing the Treatment effect on the triggered population, you must dilute the effect to the overall user base, sometimes called diluted impact or side-wide impact (Xu et al. 2015).
When we run an experiment, we want to estimate the delta between two parallel universes: the universe where every unit is in Treatment, and the universe where every unit is in Control. Leakage between the Treatment and Control units biases the estimate.
One key challenge in determining the OEC (see Chapter 7) is that it must be measurable in the short term but believed to causally impact long-term objectives.

No results found.