What Spotify Learned by Measuring Learning Itself
The streaming giant's new framework reveals that 64% of experiments deliver valuable insights—while only 12% produce traditional 'wins.' That gap tells us something important about how mature organizations think about risk.
The instinct is understandable: run an experiment, find a winner, ship it. Repeat until your metrics curve upward into the kind of growth trajectory that satisfies stakeholders and validates roadmaps. But Spotify's experimentation team noticed something when they looked at their data across hundreds of concurrent tests. The win rate—experiments that improved a key metric—hovered around 12%. Meanwhile, 64% of experiments were teaching them something valuable.
That gap between winning and learning prompted them to formalize what they call Experiments with Learning, or EwL. According to Spotify's engineering blog, it's a framework that "identifies and celebrates experiments that provide the most meaningful insights," rather than just those that move metrics in a positive direction. The distinction matters more than it might appear.
The Problem with Winning
For years, A/B testing frameworks have emphasized win rates as the primary measure of experimentation health. Industry benchmarks suggest win rates typically range from 10-20%, according to various optimization platforms—Spotify's 12% sits squarely in that range. But this metric carries an implicit assumption: that the purpose of experimentation is to find improvements.
At a mature product with hundreds of millions of users, that assumption breaks down. Most changes don't make things better. They can't—the product has been optimized for years, users have developed expectations, and the distribution of possible effects skews negative. As Spotify's team notes on their engineering blog, "To make things better, you need a great idea and flawless execution. To make things worse, all it takes is a bug."
Many experiments at Spotify aren't hunting for wins at all. They're testing infrastructure changes, backend refactors, or system updates—changes that carry only downside risk. The goal isn't improvement; it's confirming you haven't broken anything. A "neutral" result on such a test is exactly what you want. It means you can proceed safely.
What Counts as Learning
The EwL framework sets a high bar. An experiment delivers learning only if it's both valid and decision-ready.
Valid means the technical setup worked as intended. Traffic splits functioned correctly, metrics were captured, health checks passed, and there were no sample ratio mismatches or pre-exposure biases that would compromise the comparison. Invalid experiments, no matter their results, can't inform decisions because you can't trust what they're telling you.
Decision-ready means the results clearly support one of three actions: ship the change because a metric improved without regressions; abort because you detected a regression; or proceed with a neutral result because the test was powered strongly enough that you'd have detected an effect if one existed.
That third category—neutral but powered—is where the framework diverges most sharply from traditional thinking. According to InfoQ's coverage of the framework, these experiments are classified as learning because "the effect is neutral, but the experiment was sufficiently strong to detect it if it existed." You learned that your change doesn't move the needle. That's information. It might mean you abandon the feature, or it might mean you ship it anyway because the business case doesn't depend on user metrics. Either way, you can decide with confidence.
What Doesn't Count
Experiments without learning fall into three categories. Invalid experiments failed health checks or had setup errors—they can't teach you anything reliable. Unpowered experiments showed neutral results but lacked sufficient sample size or traffic to detect an effect if one existed—you're left uncertain whether there's truly no effect or whether you simply couldn't measure it. Aborted experiments were stopped mid-run, often for good reasons, but didn't run long enough to reach a conclusion.
Spotify's platform, called Confidence, helps prevent these outcomes through sample size calculators, early detection of invalid setups, and documentation improvements. When learning rates drop, the platform team investigates: Are tests underpowered? Are integrations weak? Is there configuration friction? The metric becomes a diagnostic tool for the experimentation infrastructure itself.
The Cultural Shift
Getting from 40 experimenting teams in 2018 to nearly 300 by 2021 required more than technology. Spotify's engagement team—what they describe as "a center of excellence for internal customer success"—built training materials, ran internal sessions, and established best practices. They had to help teams understand not just how to run experiments, but why to run them, and what constitutes a well-designed test.
The mobile home screen alone, according to the engineering blog, hosted 520 experiments across 58 teams in a single year. That density of testing means that bandwidth becomes a strategic resource. You can't test everything. The EwL metric helps allocate that finite capacity: channel bandwidth toward product areas generating actionable learning, reduce low-yield experimentation elsewhere.
It also reveals strategic signals. A stable learning rate with declining win rate suggests strong experiment quality but diminishing product returns—maybe it's time for bolder innovation bets rather than incremental optimization. A high learning rate paired with low business impact might indicate misallocated test capacity, prompting reprioritization.
The Guardrails
Spotify monitors three constraints to prevent gaming the metric. Win rate still matters—teams need to achieve positive results, not just avoid bad ones. Experiment volume must stay high to maintain learning velocity. Precision must remain reliable—lowering minimum detectable effect sizes might artificially raise EwL by classifying more tests as "powered neutrals," but would undermine the statistical reliability that makes the insights trustworthy.
Some amount of "no learning" remains healthy. It indicates teams are moving fast enough to sustain innovation. The key is balance: fast iteration, rigorous design, and extracting insight from every outcome.
What This Means for Your Team
The EwL framework offers something rare: a replicable model for how mature organizations quantify the value of experimentation beyond simple wins. It acknowledges that much of experimentation's value lies in risk mitigation—avoiding bad decisions, detecting regressions early, building confidence in neutral changes.
For product engineers and tech leads, this suggests a different conversation with stakeholders about experiment success. A test that conclusively shows no effect isn't a failure. An experiment that catches a performance regression before it ships isn't less valuable than one that boosts engagement. The framework makes these outcomes legible as wins.
It also implies that if your team is celebrating only positive results, you're likely leaving value on the table—or worse, not testing risky enough ideas. The ratio matters. At Spotify, five times more experiments deliver learning than deliver traditional wins. That's not a bug in their process. It's how experimentation works when you're using it to make better decisions rather than just to find better features.
The full framework details are available on Spotify's engineering blog, including diagnostic approaches and platform improvements that raised their learning rates. What they've published isn't just a metric. It's a philosophy about what experimentation is for.