25 June 2026

You must be this tall to ride

Contents
  1. What decisioning is, and why the maths runs backwards
  2. The maths doesn't care which statistics you use
  3. "But ours is Bayesian"
  4. The strongest version of the pitch
  5. The bandit is built to not find out
  6. What "this tall" actually means

There's a measuring post at the entrance to the Cannonball Express, the old rollercoaster at Pleasurewood Hills outside Lowestoft. Clear the line painted on it and you ride; fall short and you don't. Nobody argues with the post. The physics of the ride set the number, the number isn't negotiable, and being turned away isn't an insult. It's just maths about restraint systems.

Image
The cannonball express
Pleasurewood Hills” by Jeremy Thompson, CC BY 2.0

 

Every lifecycle decisioning product has a post like that too. A minimum size below which the thing cannot do what it says on the box, set not by the vendor's generosity but by the statistics of telling signal from noise. The decisioning vendors aren't especially secretive about how they work, which is the surprising part. The serious ones lean on holdout measurement against a control, and one or two publish randomised trials with confidence intervals, more candour than the rest of the category manages. What none of them posts at the entrance is the number that actually decides whether the thing works for you: the audience you'd need before that holdout could pick the effect they're selling out of the noise. The absence isn't damning on its own. Vendors publish what buyers ask about, and a buyer who hasn't worked out they might be too small for it never asks, so the sign stays down by mutual neglect, the one measurement that would tell some of them not to bother raised by neither side. Gartner's 2025 Marketing Technology Survey found marketers use under half the martech they own, and only 15% of organisations can show a given tool meets its goals and pays for itself. The same buyer rarely asks what audience the tool needs to work.

What decisioning is, and why the maths runs backwards

Most of the "AI" in a lifecycle stack is decision support: send-time optimisation, churn and propensity scores, generative subject lines, the features that rank and suggest at points inside a journey the marketer still builds. Useful or not, they leave the human holding the decision. Decisioning is the narrower, newer thing that takes the decision away: a system that chooses, per person, what to send, when, on which channel, which offer, and whether to send at all, learning from holdouts inside the guardrails you set. That is the category that borrows the platforms' machinery, reinforcement learning and contextual bandits, and the one where scale decides whether it works. Aampe, just acquired by MoEngage, runs a bandit-driven agent per user; Hightouch runs reinforcement learning on top of your warehouse; OfferFit, now inside Braze, pitches learning agents as the replacement for your A/B tests; Adobe, Salesforce and Pega run native decisioning above. What every one of them quietly needs, and none advertises a floor for, is the scale to tell its own choices apart from noise.

None of it frees you from that. Nothing does, because the constraint doesn't live in the algorithm. It lives in the data.

The amount of evidence you need to tell two options apart is governed by how different they are. Specifically, by the inverse square of the gap between them. Halve the effect you're trying to detect and you don't need twice the traffic, you need four times. That relationship sits in the likelihood, which is the part neither a frequentist nor a Bayesian nor a bandit gets to edit. So when a vendor says their method needs less data for small-traffic, small-effect situations, they are describing the one regime where the maths is most punishing, and promising it's the easiest. Smaller is harder.

The maths doesn't care which statistics you use

The plainest case: one control, one variant, a conversion rate you care about, and a wish to know whether the variant is genuinely better. The sample size to settle that, at the usual 95% confidence and 80% power, comes out of a formula Ron Kohavi, who ran thousands of these at Microsoft and Amazon, likes to write as n = 16σ²/d² per arm: variance on top, the square of the effect you want to catch on the bottom.1 Plug in real numbers and the bottom of that fraction eats you alive.

Those are per test, for the simplest test there is. Read the 5% baseline column downward and watch what happens as the effect you're chasing shrinks: a 20% lift is cheap, a 5% lift costs a quarter of a million, a 1% lift costs six million. The requirement climbs through the roof as the effect shrinks. The arithmetic here, and the bandit simulation, both live in a companion notebook you can open in Colab and push your own numbers through.

The 5% baseline is generous. Omnisend's 2025 data, across more than 20 billion sends from its own customers, puts broadcast email conversion at 0.08% and automated flows at about 1.5%. At a 1.5% baseline the same 5% lift needs about 845,000 users rather than 244,000, and at broadcast rates it can't be detected at any scale a mid-market brand will reach. The lower rows of the table are the realistic ones.

And the effects are small. This is the part the optimisation industry would rather you didn't dwell on. Kohavi's blunt version, posted to LinkedIn, is that the minimum effect he'd love to detect is 0.1%, but unless you run one of the biggest sites in the world you simply don't have the users for that, and that on a mature property a 30% improvement to a key metric is a fairy tale. Hold that against the marketing. Send-time optimisation, the flagship lifecycle AI feature, is routinely sold as a consistent 20 to 30% lift in open rates across industries. The vendors and their agencies are quoting, as a baseline expectation, the exact figure the field's most experienced experimenter calls a fantasy. The honest literature agrees: more than half of all ideas tested fail to move the metric they were meant to move at all.1 The wins, when they come, are won inch by inch, a portfolio of fractional-percent improvements that add up over years. That's the real distribution of effect sizes a decisioning engine is fishing in. Mostly nothing, occasionally a fraction of a percent, almost never the double-digit lift that would fit comfortably inside the traffic you actually have.

It gets worse the moment you do the thing these tools are bought to do, which is denominate in money rather than clicks. Revenue per recipient is a vicious quantity to measure. Most people spend nothing, a few spend a lot, and that lumpiness inflates the variance enormously relative to a clean yes-or-no conversion. The σ² on top of the formula balloons. Microsoft's own experimenters have noted that user-to-user variation routinely swamps the small movement in the metric you're hunting, and that past a couple of weeks the variance stops falling, so running the test longer stops helping.1 If you're measuring incremental revenue, which is the only denomination that honestly matters, the bar is higher still than the conversion table above, not lower.

This is the same problem I hit writing about the measurement gap. Sub-scale brands can't reliably see sub-percent effects because the platforms have degraded the feedback in the pipe to the point where the signal isn't clean enough to resolve them. The decisioning problem is the twin of that one. Even granting yourself perfect measurement, which you don't have, you still need the raw volume for the effect to clear the noise. Two different failures, the same brands caught by both.

"But ours is Bayesian"

Someone in the room now says the table is a frequentist artefact, their product is Bayesian, and the rules are different. It's the most common rebuttal and it's mostly a vibe.

Switching to a posterior does not conjure evidence that the data didn't contain. It changes the rule you apply to the same likelihood. The two methods are computing different things, granted: a Bayesian setup gives you P(B > A), the probability that B is actually better, which is a more useful sentence than a p-value. But the data requirement to make that probability concentrate near certainty is the same data requirement, because it's reading the same signal. With the weak default priors these tools ship with, a Beta(1, 1) on a conversion rate, the prior contributes the information of about two observations. Two. Against a requirement that runs to hundreds of thousands, the prior's contribution is a rounding error. You would need a prior worth a large slice of those hundreds of thousands of observations, and worth it correctly, to shift the requirement, and nobody has a strong correct prior about a brand-new content block. So at the volumes that matter, the Bayesian posterior probability tracks one-minus-the-p-value almost exactly. Same requirement, repainted in nicer colours.

The thing Bayesian decisioning is actually sold on isn't the prior, it's permission to stop whenever you like. Peek at the dashboard, ship the moment P(B > A) crosses 95%, don't fuss about fixing a sample size in advance. And the half of that which is true is doing a lot of dishonest work. The posterior is genuinely interpretable at any sample size, that part's real. But "interpretable at any time" is not "valid to stop on at any time." Bolt a fixed 95% threshold onto a process you peek at repeatedly and you have reintroduced, through the back door, exactly the runaway error rate that fixed sample sizes exist to control. Evan Miller pointed this out back in 2010. Johari and colleagues, working with Optimizely's own experiment data, put numbers on it: a fixed threshold you stop on the moment it's crossed can inflate the false positive rate five- to tenfold even at 10,000 samples, and without bound the more often you look.2 The 5% on the box is a guarantee about a procedure almost nobody runs. Bayesian methods don't control the false positive rate because they never promised to; they optimise expected loss instead, which in practice means that when you peek and stop on good news, you haven't solved the peeking problem, you've just stopped keeping score of it.

So the Bayesian label buys you a better-phrased answer and a licence to deceive yourself faster. It does not lower the requirement.

The strongest version of the pitch

A good vendor won't argue the arithmetic. They'll say you're measuring the wrong thing. You're asking whether the system can name a winner, and naming winners was never the promise. The promise is expected value under uncertainty: put the next send where the expected return looks highest on current evidence, and don't worry about whether B can be proved to beat A. You don't have to certify which option is best to make money leaning toward the one that looks best, provided you do it often enough. A contextual bandit can exploit an edge it can't yet prove, chase a lift that only shows up in one slice of the list, follow a reward that keeps moving, and pocket the rare creative that lands at 30% while starving the duds. None of that needs a clean verdict. It needs a decent bet, repeated.

It's a good case. It also doesn't escape the arithmetic, because every one of those moves cashes out in effect size, and the effect sizes are small. The money a bandit moves around is the gap between the arms. When that gap is a fraction of a percent, exploiting it perfectly banks a fraction of a percent: route 70% of traffic to an arm that's 5% better on a 5% conversion rate and you've captured most of a quarter-point edge, with nothing to show next quarter for having done it.

Heterogeneity is supposed to be the way out, and it's the part vendors lean on hardest. The average effect is tiny, they'll grant, but the system isn't chasing the average; it's finding the slice where the lift is large, iOS users who open at 8pm, lapsed buyers on their third email, and acting there. True, and it makes things worse. Finding which slice holds the effect spends statistical power instead of saving it, because you're now running many comparisons rather than one and most of them are noise, so some slice always looks like a winner by luck. Whichever slice you pick is smaller than the whole, so the evidence under it is thinner than the global evidence you already couldn't resolve. Personalisation doesn't ease the requirement. It multiplies it, once per segment, then asks you to tell the real winners from the lucky ones.

The one move that does escape the maths is the fat-tailed winner: the 30% lift sitting in a bad default nobody has tested, more common in a chaotic mid-market programme than in a checkout Amazon has spent 15 years filing down. Those are real. They're also cheap to find. A 30% lift clears the sample-size table in a few thousand users, and the marketer who reads the report finds it without an arbitration layer choosing per person. The engine isn't sold for the obvious big win. It's sold for the continuous sub-percent tuning underneath, and that tuning is the regime that needs the scale you haven't got. Either the win is big enough to bank without the engine, or it's small enough that the engine can't see it.

The bandit is built to not find out

A multi-armed bandit is designed to minimise regret. It shifts traffic toward whatever's winning so far, so you waste fewer impressions on the loser while the test runs. That's the whole pitch, and on its own terms it works. But minimising regret and identifying the best option are not the same goal, and pursuing the first actively sabotages the second. This is a theorem, not an opinion: Bubeck, Munos and Stoltz proved that the smaller you drive cumulative regret, the larger your simple regret, the error in the arm you finally name as best.3 The mechanism is exactly the behaviour the vendors advertise. To cut regret, the bandit stops sampling the apparently-worse arm. But to know whether that arm is actually worse, or just unlucky so far, you have to keep sampling it. The feature sold as the benefit is the thing that starves you of the evidence you'd need to trust the verdict.

And the verdict is expensive even when you do it right. Identifying the best of a set of arms costs, in the cleanest analysis, on the order of the sum of the inverse-squared gaps between them.4 There's that inverse square again, the same one from the sample-size table, wearing a bandit costume. Two arms a fraction of a percent apart need an enormous number of pulls before you can name a winner with confidence, and Thompson sampling, the engine inside most of these products, explores each arm in proportion to its current probability of being best. Two arms that are genuinely close to tied therefore keep splitting the traffic roughly evenly, and the volume it would take to prise them apart is one you won't reach before the window closes.

A simulation makes the scale concrete. Two arms, a 5% baseline conversion rate, Thompson sampling with flat priors, 50,000 pulls, which is already a generous budget for a single test. Vary how much better the winning arm truly is and count how often the bandit actually finds it. The full code and assumptions are in the companion notebook, and the results are these:

As a reward engine the bandit is doing its job: 70% of its traffic on the better arm is 70% of the available lift over control, captured. The trouble is what it doesn't know while it earns that. At a 5% edge its posterior probability that B is the better arm clears 95% in only about a third of runs, and in roughly one run in seven it finishes pointing at the wrong arm outright. It banks most of the gain and still can't tell you which arm produced it, which is precisely what the next campaign needs it to know. Volume barely rescues this: quadruple the budget to 200,000 pulls and the 5% case clears 95% only about two-thirds of the time. It doesn't behave until the gap reaches 20%, the fairy-tale end of the range. At the effect sizes that occur in the wild, the bandit's "decision" is a lightly weighted coin.

Then there's the assumption underneath all of it that nobody states out loud: that the world holds still while the algorithm learns. Lifecycle marketing is the opposite of still. Send fatigue, seasonality, promotional calendars, list churn, and the platform editor in the pipe quietly changing what counts as a delivered, opened, or engaged event, all mean the reward distribution drifts under your feet. Those pulls are also calendar time: at a few thousand engaged contacts a day, the 200,000 it took to get the 5% case merely two-thirds confident is the better part of two months. A bandit converging over that window is solving for a world that stopped existing in week two. By the time it's sure, the answer has moved. You've bought a machine that arrives at the right answer to last month's question.

The honest form of this objection is that convergence was never the goal, that the right tool forgets on purpose: a restless or sliding-window bandit that tracks the drift instead of settling. But forgetting has a cost. A bandit that only trusts the last fortnight of data is a bandit working from a fortnight of data, and a fortnight at a few thousand sends a day is nowhere near the volume any of these effects needed to clear the noise. The faster you make it adapt, the less evidence sits behind each decision. Drift doesn't rescue the small sender; it shortens the window they had to work in.

What "this tall" actually means

The threshold is no longer a vague worry; it's a number you can almost read off.

In decision terms rather than statistical ones: the value of a decisioning system is the value of the choices it makes minus the value of the choice you'd have made anyway. When you have enough data that the posteriors on your options barely overlap, that difference is real and worth paying for. When you don't, the posteriors sit on top of each other, the expected loss of picking either option is nearly identical, and the system's recommendation carries almost no value over a sensible default or a coin flip. It's still computing, still rendering confident dashboards, still charging you. It's just an expensive random number generator with good production values. And notice this verdict is derived entirely inside the Bayesian frame the vendor chose. You don't get to wave it away by saying I used the wrong school of statistics.

Where's the threshold, roughly, for a healthy 5% baseline? Below something like 50,000 users per variant per test window, you can only resolve effects of 20% or more, and effects that large barely exist in lifecycle, so you're trying to detect an effect smaller than your data can resolve. In the low hundreds of thousands per variant, you can start to resolve 5 to 10% lifts, and a decisioning layer begins to measure rather than guess. To chase the sub-percent effects the vendor's model is actually fitting, you need millions per variant per window. Not millions on your list. Millions reachable, engaged, and run through a single decision before the season turns. Segment the audience for personalisation and every split divides that evidence again.

The uncomfortable thing about those marks is who clears them. The brands big enough to honestly justify a decisioning layer are, by and large, the brands already big enough to have built one. The mid-market the products are sold into mostly isn't big enough, and never finds out, because the same scale that denies them the test also denies them the clean feedback that would tell them the test failed. This is the owned-versus-rented divide showing up one level deeper. The platforms rent you your channel and edit what comes through it; worse, the optimisation you bolt on top, the thing that was meant to be your edge, needs a scale most renters will never reach to produce a single trustworthy decision.

A decisioning layer costs money, and the cost sets a floor on the lift it has to produce. Say it runs £100k a year. On a programme turning over £2m, it has to add £100k of incremental revenue just to cover itself, a 5% lift. Detecting a 5% lift on a 5% conversion rate takes about 244,000 users in a clean test, the figure from the table. So confirming the layer merely covers its cost, before any profit, takes a quarter of a million engaged users per test. The cheaper the layer, the harder this gets: at £50k the break-even lift is 2.5%, which needs closer to a million users to detect. And conversion is the cleanest signal you have. If the layer's value is meant to arrive as lower churn or less discounting instead, each still needs a holdout large enough to detect it, and both move on a slower, noisier signal than conversion, so they need more users to confirm, not fewer. Below that audience, you can't tell whether the layer pays for itself.

The post is real. It's higher than you've been told. And the teenager in the hi-vis whose job was to turn you away has been replaced by a sales deck that says everyone's tall enough, please form an orderly queue, and whatever you do, don't stand against the post on the way in.

You must be this tall to ride. Most of you aren't. They're counting on you not checking.

  • 1a1b1c

    Ron Kohavi, Diane Tang and Ya Xu, "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing," Cambridge University Press, 2020, https://experimentguide.com/. The n = 16σ²/d² rule of thumb for 80% power follows from the standard two-sample formula, since 2 × (z₀.₉₇₅ + z₀.₈₀)² ≈ 15.7. The same book is the source for the empirical claims here: that most ideas fail to produce a meaningful improvement and wins accrue in fractional-percent increments, that low statistical power yields untrustworthy and exaggerated effects, and that user-to-user variance limits the resolution of small effects regardless of run length. See also Kohavi's public writing on running the power formula in reverse to back out the minimum detectable effect your traffic supports: https://www.linkedin.com/posts/ronnyk_using-the-statistical-power-formula-in-reverse-activity-7027144092459438080-2FTy

  • 2

    Ramesh Johari, Pete Koomen, Leonid Pekelis and David Walsh, "Peeking at A/B Tests: Why It Matters, and What To Do About It," Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), 2017, pages 1517–1525. On Optimizely's experiment data they show that stopping the first time a fixed threshold is crossed inflates the false positive rate well beyond the nominal level, by five to ten times even at 10,000 samples, with the worst case unbounded under continuous monitoring. https://doi.org/10.1145/3097983.3097992

  • 3

    Sébastien Bubeck, Rémi Munos and Gilles Stoltz, "Pure exploration in finitely-armed and continuous-armed bandits," Theoretical Computer Science 412(19):1832–1852, 2011 (conference version, Algorithmic Learning Theory, 2009), https://doi.org/10.1016/j.tcs.2010.12.059. They prove a general lower bound on simple regret in terms of cumulative regret: driving cumulative regret down forces simple regret up. The origin of the regret framework is T. L. Lai and H. Robbins, "Asymptotically efficient adaptive allocation rules," Advances in Applied Mathematics 6(1):4–22, 1985.

  • 4

    Emilie Kaufmann, Olivier Cappé and Aurélien Garivier, "On the Complexity of Best Arm Identification in Multi-Armed Bandit Models," Journal of Machine Learning Research 17(1):1–42, 2016, https://arxiv.org/abs/1407.4443. The sample complexity of identifying the best arm scales with the inverse-squared gaps between arms; for two arms the cost is on the order of Δ⁻² up to a log-log factor.