11 June 2026

Counting in the dark: measuring marketing channels with platforms wedged in the middle

Contents
  1. The three layers
  2. Most senders shouldn't try
  3. Why the gap won't close
  4. The intermediation timeline
  5. Shared probes
  6. The operational kit for push
  7. The operational kit for email
  8. Lifecycle platforms and ESP's
  9. Attribution lives in overlapping layers
  10. The older intermediation layer
  11. The conversion surface
  12. The agentic shock
  13. The harder half
  14. What I might be wrong about
  15. Building the in-house framework
  16. Closing

Your push CTR is down 15% quarter on quarter on iPhones. Your email open rate has been corrupted by Apple Mail Privacy Protection for four years, your click rate is drifting downward as image prefetching spreads to more clients, and you can't tell whether the Gemini summary that Gmail now generates is preserving your subject line or rewriting it into something that doesn't drive a click. The trade press tells you Apple Intelligence is eating your push notifications. Your CSM at Braze sends you a deck about it. A deliverability vendor sends you a different deck about Gmail's AI Inbox view. Your VP of Growth wants to know what you're doing about both.

The honest answer is that you can't tell whether any of these platforms are doing anything to you in particular.

You don't know whether your push notifications appeared on lock screens or got bundled into a summary. You don't know whether your emails landed in the Primary tab, the Promotions tab, the Focused inbox, or the "Other" inbox. You don't know whether the user opened your message or whether their AI assistant did, and you don't know whether what either of them read was your copy or a model's reread of your copy. There is no API to find out, and the trajectory says there won't be.

That's the deeper problem. The platforms have positioned themselves between you and your audience, and they don't have to tell you what they're doing. They don't. Sometimes for genuine privacy reasons. Sometimes because it makes the product better for the user. Sometimes because exposing the action of the intermediation undermines the framing that the user is being protected from senders. The reasons differ across channels and across vendors. The result is the same: a measurement gap you cannot close from the sender side.

I've spent three pieces getting here. The first was about what the four mailbox providers do to your email: the inbox stopped being a transport layer years ago and became an active editor between you and the reader. The second was about how Apple and Google did the same to push, the on-device model now the editor in the pipe. The third was about the other end of the same wire: the messaging systems those platforms build for their own users, the reinforcement learning and bandits and send-or-stay-silent models all tuned to long-term value, set against what a normal brand's lifecycle team can actually do with the machine learning it has bought, which is mostly nothing, because the binding constraint turns out to be the unglamorous state of its own data.

All three ended in roughly the same place. The platform holds the better hand at both ends of the wire, the editor on the receiving end and the frontier on the sending end, it won't show you what it does, and your real weight belongs on the surfaces it can't reach. This piece takes the question all three raise and none of them answered: if you can't see what the editor does to your message, how do you measure it?

The methods apply across push and email, because the underlying problem is the same: an adaptive intermediary acting on your sends and not telling you what it did. The probes available to a marketer in 2026 are channel-agnostic in shape and channel-specific in detail.

The three layers

Image
Three layers of intermediation
Three layers of intermediation

 

Across two channels and three platform layers, there are three things you do not see.

Was the message displayed? APNs accepted your push. Gmail accepted your email. The OS may or may not have rendered the notification. The inbox provider may or may not have routed your message to the Primary tab. Focus may have suppressed the notification. The Promotions tab may have hidden the email under three other promos. The user may have set your channel to "Silent" two years ago, or filed every email from your domain into a folder they never check.

Was it prominent? A push can show on the lock screen, in a banner, in a summary bundle, or only in Notification Center. An email can show in the Primary inbox, in the Promotions tab, in Outlook's Focused Inbox or its "Other" inbox, in Gmail's AI Inbox priority view, or in the spam folder. The discoverability gap between "user opened the inbox and saw your message at the top" and "user has to scroll past forty promos to find it" is enormous. None of it shows up in your dashboards.

Was the original wording shown? This is the new layer in both channels. Apple Intelligence summarises pushes on every iPhone from the 15 Pro forward. Recent flagship Pixels do the same, with the Pixel 9a explicitly excluded on RAM grounds.1  The Galaxy S26 family does it with Notification Highlights in One UI 8.5. Gmail's AI Overviews and AI Inbox, rolled out in January 2026 on top of the Gemini summarisation features that have been live since mid-2024, summarise email threads on the user's behalf, by default on mobile, opt-out in most regions. Microsoft Copilot's Summarize button has been available in Outlook through 2024 and 2025, with the summary chat now available even to users without a paid Copilot license. The summary you don't see may preserve your fact. It may not. It probably elides the call-to-action and keeps only what the model thinks the user cares about.

Each layer requires different inferential work. Confusing them is most of what's wrong with the current discourse. "Apple Intelligence is eating my opens" or "Gmail's Promotions tab is killing my engagement" treats a three-layer problem as one layer. Each claim is partially true and partially a scapegoat. The remedy depends on which layer is doing the work.

Of the three, the wording layer is structurally the hardest to read in isolation. The display and prominence layers move around platform release dates and respond to sender-side ablation. The wording layer, the one driving most of the AI-summarisation panic, is the one no sender-side probe can cleanly separate from the other two.

Most of the kit is shared between email and push. The probes that work in one work in the other with adjustments. The vendors and instrumentation differ. The principles do not.

I'm writing this because the trade press treats the measurement problem as a temporary embarrassment that the platforms will fix with an API one day. That is not what is going to happen. The measurement gap is structural. The measurement gap is going to widen. Get used to counting in the dark.

Most senders shouldn't try

Reading platform-intermediation effects on your own sends takes hundreds of thousands of sends per cell of any design to detect even a 5% relative shift in CTR, and the platform-intermediation effects you're actually trying to read are typically smaller than that. Millions per cell to detect shifts in the low single percent. Tens of millions for the sub-percent shifts platforms can read in their own systems.11 

Anchoring on a 2% campaign CTR baseline (roughly the B2C industry average per Klaviyo and Mailchimp's 2026 benchmarks12 ), at 95% confidence and 80% power:

Image
A/B testing volume threshold

 

These are floor numbers for the easiest design. The DiD that does most of the work in the kit differences four noisy quantities across cohorts and periods, which raises required n above what a two-proportion test demands, and the ITT dilution (eligibility isn't treatment, treated population unknown) raises it again. Treat the table as the optimistic edge of the volume requirement, not the answer.

Cells multiply when you split by eligibility cohort, segment, channel, or any other dimension you care about, so the requirement compounds. Push baselines tend to be smaller (around 1-2% rather than 2%, with substantial variance by industry), so absolute cell sizes shrink, but the relative-detection curve is identical.

If your list is in the tens of thousands or low six figures, every probe here is theatre on your data. The kit produces wide confidence intervals around zero whatever you actually do, and the result is a measurement programme that looks elaborate, runs on cadence, and tells you nothing.

The answer for that audience is the one the AI gap piece lands on. Compete on the things the intermediation layer can't touch. Brand. Product. Relationship. The human craft of knowing what to say to whom, and saying it so the substance survives a rewrite, because the substance is the thing that earns the click, not the wrapper.

For senders with the volume to read sub-percent shifts, the kit is operational. Below the threshold, take the diagnostic and leave the methodology.

Why the gap won't close

The optimistic story is that the platforms will eventually expose intermediation signals to senders the way Gmail eventually exposed deliverability signals via Postmaster Tools. Apple will publish a "your notification was summarised" callback. Google will publish a "your email was summarised by Gemini" hook. Microsoft will tell you when Copilot rewrote your subject line in someone's inbox. Samsung will follow, because Samsung always follows.

There is one concrete motion in this direction. In March 2026 a draft landed at the IETF, draft-brotman-aggregate-performance-reporting, authored by engineers at Google, Comcast and Iterable, proposing a standard format for mailbox providers to send senders daily aggregate JSON reports covering classification buckets (inbox, unwanted, promotional, forwarded) and engagement buckets (positive, neutral, negative), keyed off DKIM domain.2  It is the first serious motion in this direction in a decade. It is also limited in exactly those ways: aggregate not per-message, email-only, no summarisation hook, voluntary for the mailbox provider that decides whether to participate, and at the lowest IETF state with no working-group adoption yet. If it ships, it improves the deliverability picture incrementally on the Gmail and Microsoft side. It does not begin to address the summarisation, push, or AI-rewrite layers where the gap actually lives.

For the layer that actually matters here, the summarisation and AI rewrites and the push-side editor, this isn't going to happen on the trajectory the platforms are on. The last fifteen years say otherwise.

Apple App Tracking Transparency shipped in April 2021. It did not reverse. It did not get a sender-side opt-out. The IDFA stays gone for the overwhelming majority of users who see the prompt and the industry adapted. Apple Mail Privacy Protection shipped in September 2021. It did not reverse. The Mozilla/5.0 proxy did not get a "real open" callback. The industry adapted. Apple Intelligence Notification Summaries shipped in iOS 18.1 in October 2024. It is not going to reverse. The platform is not going to tell you what it summarised.

Gmail tabs shipped in 2013. The Promotions tab did not reverse. Gmail did not publish a "we put this in Promotions" callback. Gmail published the slightly more useful Postmaster Tools dashboard, which tells you aggregate deliverability and spam rate by IP and domain, but does not tell you whether any specific message landed in Primary or Promotions. The Postmaster Tools dashboard is the most senders have ever been given, and it is deliberately thin. The gap has remained.

The reasoning is straightforward. Each of these is a user-facing protection or attention-quality feature, positioned to the user as protection from a hostile sender. Exposing the action of the protection feature to the sender undermines the frame. You can't simultaneously tell the user "we're protecting you from this app's notifications" or "we're sorting these promos out of your way" and tell the app "here's exactly what we did to your message so you can route around it." Pick one. Every platform so far has picked the user-facing frame, predictably.

There's also the on-device model to think about. Apple Intelligence summaries are generated on the device using a small foundation model. Pixel notification summaries run on Gemini Nano locally. One UI 8.5 summaries run on Samsung's Galaxy AI stack locally. There's no obvious mechanism by which any of these platforms could even tell you which notifications got summarised without re-introducing a telemetry channel they explicitly disclaim. The architecture leaves no obvious sender-side trail. Gmail and Microsoft summaries run in the cloud, but exposing summarisation events to senders runs into the same product-positioning problem as the on-device case.

The commercial pressure isn't there either. Senders complain. Senders complaining is not the same thing as senders moving budget. The marginal sender who would actually pull spend from Braze or Klaviyo if Braze or Klaviyo added a summarisation-aware feature is theoretical. The platforms are not Braze's competitors and they are not Klaviyo's competitors.

The regulatory pressure isn't there either, though the case is more developed than it looks. The self-preferencing argument under the DMA, that on-device summarisation and inbox categorisation favour the platform's own surfaces over a brand's message, is real, and I've gone through it in the AI gap piece. The short version: the frame exists and the Commission's enforcement energy exists, but the surface where most of the editing happens was left outside the regime, because when the EU designated its first gatekeepers it declined to designate Gmail and Outlook as important enough to qualify.3  The obvious door is bolted. The one piece of recent regulation that did land, the Yahoo and Gmail bulk sender requirements of February 2024,4  put obligations on senders, not transparency on platforms. The regulatory direction is one way.

And the thing you're measuring isn't a static filter. It's an adaptive machine-learning system, retrained constantly, with more data and better models than anything in your stack. You are not characterising a fixed rule once. You are chasing a moving target that optimises against the same user you do. Any number you produce is a snapshot of a system that has already moved.

The opacity isn't only that the platform isn't telling you what it did. It's that the platform sees what it did, in full, and you don't. The AI gap piece traced this from the sending side: the platforms run experiments against the user that you cannot run, with telemetry you cannot see, on a population larger than your list. The same asymmetry shows up here from the receiving side. The intermediary on the wire is not just opaque, it is accumulating a measurement advantage inside itself, send by send, that compounds over years. By the time a sender can characterise a single intervention, the platform has run a hundred more and learnt from each. Measurement asymmetry is the structural advantage, not opacity for its own sake.

Anyone selling you a measurement strategy that depends on platform cooperation is selling aspirational software. The gap is not a bug. It is the architecture. Plan accordingly.

The intermediation timeline

The full narrative of this lives in the email piece and the push piece. The intermediation events themselves are each discrete and timestamped, which is what makes them natural experiments you can measure against. The AI summarisation panic of 2024-2026 is the latest chapter, not the whole book.

September 2021. Apple Mail Privacy Protection ships in iOS 15.5  The image proxy means open pixels fire on the Apple proxy server, not the recipient device. Open rates inflate by 30 to 50% on Apple-heavy lists. The Mozilla/5.0 user agent appears on every proxy fetch and becomes the canonical identifier for MPP-proxied opens. The email industry's central engagement metric stops meaning what it had meant for two decades.

February 2024. Gmail and Yahoo bulk sender requirements take effect. SPF, DKIM, DMARC alignment, one-click unsubscribe (RFC 8058), spam rate under 0.3%. Microsoft followed with similar rules in May 2025; Google escalated enforcement in November 2025.

October 2024. iOS 18.1 ships. Apple Intelligence Notification Summaries enabled on iPhone 15 Pro, 15 Pro Max, and all iPhone 16 models. The sparkle icon appears.

January 2025. iOS 18.3 ships, having disabled Notification Summaries for News and Entertainment apps after the BBC complained that Apple Intelligence had summarised a story to claim the man charged with murdering the UnitedHealthcare CEO had shot himself.6  The first regression in the AI summarisation rollout, and the cleanest natural experiment available for measuring the summarisation effect.

November 2025. Pixel notification summaries roll out to Pixel 9 and Pixel 10 in the November Feature Drop. The December follow-up adds the Notification Organizer, which auto-categorises and silences low-priority alerts under "News" and "Promotions" without the user doing anything.

January 2026. Gmail launches AI Overviews and the AI Inbox view, both powered by Gemini 3, with email summarisation default on mobile in most regions. The most aggressive intermediation of email since the Promotions tab thirteen years earlier. Galaxy S26 ships shortly after with One UI 8.5 Notification Highlights.

February 2026. Microsoft pulls the Copilot Priority View from Outlook on iOS and Android, citing cost and user feedback. The related Prioritize My Inbox feature continues across all clients including mobile; what was pulled is a separate mobile-only priority-sorted view of the inbox. The second meaningful reversal in the timeline, and a quieter one than the BBC episode: not protection abandoned, just an AI feature that didn't justify its compute bill on mobile.

The pattern is one direction with rare exceptions. Most steps add intermediation. The reversals that do happen are partial and cost-driven: iOS 18.3 disabling news summaries after the BBC complaint, Microsoft pulling the Copilot Priority View from Outlook mobile in February 2026 on cost and user-feedback grounds. Neither is a sender-side opt-out reappearing; both are the platform deciding the feature didn't pay its way. Every step that survives is positioned as user protection. None of the survivors offers a sender-side signal of any consequence.

What email and push have shared, throughout all of this, is the MPP playbook. The thing the industry actually did in the eighteen months after September 2021 is the prototype for what to do now, in both channels.

First, identify the proxy. Mozilla/5.0 was the canonical signature on the Apple Mail proxy's image fetches. Senders who looked at user-agent strings could segment opens into "Apple-proxied" (open inflation, not a real user signal) and "everything else" (still mostly real). The same logic applies to identifying Apple Intelligence-eligible devices by model, Gemini-summarisation-eligible mailboxes by client header, current Pixel and Samsung flagship cohorts by user-agent. The proxy or the eligibility marker is the segmentation key.

Second, model the inflation or the shock. With the proxy identified, you can back out what your "real" engagement rate had been against a pre-intervention baseline on the affected cohort. This was useful for trend continuity, and not much else. It told you nothing about whether the email had actually been read by a human, or whether the push was actually shown.

Third, retire the metric. Eventually everyone stopped trying to model out the inflation and just admitted that opens were dead as a primary metric for B2C email. Click-through, downstream conversion, list churn, and reply rate took over. The shift wasn't voluntary. It was forced by the fact that the open signal had ceased to mean anything specific.

The transferable lesson is that there's a six-to-eighteen-month window after a major platform intervention in which you can usefully measure the shock. After that window, the question changes from "how much did the platform shift my metrics" to "how do I operate without that signal at all." The mistake is treating the modelling work as a permanent solution. It isn't. The companies that did well post-MPP were the ones who used the eighteen-month window to build out their click-and-downstream measurement infrastructure and then quietly stopped reporting opens. The companies that did badly are still reporting "Apple-MPP-adjusted open rate" four years later, which is approximately as useful as a "weather-adjusted batting average."

For push, the iOS 18.1 shock is around twenty months old by the time this publishes, which is well past the window. For email it depends which shock you mean: the bulk sender requirements have been in force for over two years, so that modelling-out is long done, while the Gmail AI Inbox is only a few months old, recent enough that measuring the shock is still worth the effort. The operational kit assumes the signal is mostly gone. Build for that.

Shared probes

Difference-in-differences around platform release dates

The closest thing to a clean experiment you can run without platform cooperation is a difference-in-differences analysis around a known release date.7 

The setup. Pick a cohort that was affected by the intervention and a cohort that wasn't. Compute the change in your metric for each cohort across the date. The DiD estimator is the difference between the two differences. If the affected cohort moved and the unaffected cohort didn't, you have a measurement of the intervention effect, net of whatever else was happening in your sends that quarter.

Push examples. iOS 18.1 (28 October 2024) initial notification summaries. iOS 18.3 (27 January 2025) news disablement. iOS 18.4 (March 2025) Priority Notifications. The Pixel November 2025 Feature Drop. Galaxy S26 with One UI 8.5 in early 2026.

Email examples. MPP (September 2021). Bulk sender enforcement (February 2024). Gmail Gemini summary rollout (June 2024). Microsoft Copilot Summary (August 2025). Gmail AI Inbox (January 2026).

Primary methodology, with caveats. The first and most binding: DiD needs a lot of data. Detecting sub-percent effects (which is what platform interventions usually produce against a typical sender's lift) wants tens of thousands of sends per cell of the design, before you've even split by eligibility cohort. Most senders don't have that. Below the volume threshold, the technique gives you a wide confidence interval around zero and not much else. Above it, three methodological pitfalls.

Caveat one: parallel trends. The DiD assumes that without the intervention, the two cohorts would have moved together. That's an assumption, not a fact. Apple-heavy cohorts and Android-heavy cohorts have different demographics. Gmail-heavy lists and Outlook-heavy lists are not similar populations. If your iPhone audience is in the US and your Android audience is in Brazil, you have a parallel-trends problem on holidays alone. Test it visually before and after the date. If the pre-period lines are wandering all over each other, your DiD is going to be noise.

Caveat two: contemporaneous changes by you. Each major platform release is the loudest channel news of its quarter. Every marketer reads the same trade press and adjusts their copy and structure in the same direction. Front-loaded fact, "Important:" prefixes for push, shorter subject lines and clearer first-line previews for email. Your own sender behaviour shifted around the same date as the platform intervention, and the DiD can't separate one from the other. The careful version of the analysis acknowledges this and tries to estimate the sender-side effect by looking at copy and structural changes you made in the same window.

Caveat three, and the deepest: you can't observe treatment at the unit level. You proxy "treated" by device eligibility, but eligibility isn't treatment. Apple Intelligence is opt-in, summarisation fires on a subset of notifications and emails, bundling depends on volume and timing. So your treated cohort is intent-to-treat against a treated population of unknown size, with a compliance rate that varies by user, time, and message. The effect size the DiD reads is a lower bound diluted by an unmeasured fraction. The true per-affected-message effect is larger, by how much you can't tell. Mitigation is partial: compare a stabilised post-period (months after release, once adoption has plateaued) against an equivalent pre-period, and report the DiD effect as "intent-to-treat against the eligible cohort," not as the effect of summarisation itself.

The iOS 18.3 news shock is the cleanest of these for push. Apple disabled summarisation specifically for News and Entertainment categories on 27 January 2025. If you're a news publisher, your CTR on Apple Intelligence-eligible devices should have moved on that exact date relative to a non-news control. The Gmail Promotions tab launch is the analogous email shock, except it's far enough in the past that few senders kept the cohort data needed to analyse it cleanly. The recent one to focus on for email is the Gmail AI Overviews rollout in January 2026, where summarisation became default on mobile in most regions. The MPP launch in September 2021 is too noisy to be a clean DiD because the metric itself changed character.

Sender-side ablation

Hold your audience cohort constant. Vary the things you control. Measure differential effects across known eligibility cohorts. This is the cleanest internal experiment available in either channel: cohort held constant, content varied, platform-eligibility split a stable observable, deltas with an obvious interpretation. Failure modes are statistical power (in practice, tens to hundreds of thousands of sends per cell to detect anything below a 20% relative shift) and cell contamination (your eligible cohort isn't perfectly identifiable from the sender side, so you're approximating with device-model headers or mailbox-provider domain mappings).

Push-side variables to ablate:

  • Interruption level: active vs time-sensitive.
  • Copy structure: front-loaded fact vs buried lede.
  • Bundle composition: a single send vs several within the bundling window.

Email-side variables to ablate:

  • Subject-line structure: front-loaded fact vs brand-first.
  • Preheader content: summary-friendly vs decorative.
  • HTML weight: rich vs near-plain-text.
  • Image-to-text ratio, which is a Promotions-tab classifier signal.
  • From-address consistency: same domain vs different sender names.
  • One-click unsubscribe header content: matches the body link vs differs from it.

In both channels the principle is the same. The delta between two variants should be larger on the eligible cohort than on the non-eligible cohort, if the intermediation layer you suspect is doing the work. If the delta is the same on both cohorts, the layer isn't doing much to you and the win or loss is just normal copy performance.

Cohort comparison by platform eligibility

The probe everyone reaches for first, and the weakest of the four on its own.

Push: your audience on Apple Intelligence-eligible devices (recent iPhones from the 15 Pro forward, M-series iPads and Macs, the iPad mini A17 Pro) versus identical-OS-but-older-hardware devices. Same kind of split for recent flagship Pixels versus older ones, and the Galaxy S26 family versus older Samsungs.

Email: your audience on Gmail (with default Promotions tab and AI Overviews) versus Outlook (with Focused Inbox and Copilot summaries) versus Yahoo versus Apple Mail with MPP versus corporate Office 365 without consumer Copilot.

The problem is identical in both channels: the cohorts are not random samples of each other. iPhone 16 Pro buyers in 2024 are systematically wealthier, more engaged, earlier adopters. Gmail users are different from Outlook users in age, occupation, income, brand affinity. The cohort difference is going to swamp the intermediation signal in any naive comparison.

You can try to control for this with matching or propensity scoring. The literature on selection bias in observational studies has a lot to say about how to do this, and most of it doesn't help much when the treated and untreated cohorts are this systematically different. Use the comparison as a sanity check, not a primary estimate. The within-cohort change around an intervention date (your DiD) is more informative than the cross-cohort comparison at any single moment.

Transactional sends as a within-user control

The tempting version of this is to treat transactional sends as a negative control: transactional pushes (order shipped, fraud alert, ride arriving) aren't bundled the way promotional pushes are; transactional emails (order confirmations, password resets) aren't routed to the Promotions tab the way promotional emails are. So compare the promotional-to-transactional engagement ratio across the eligible and non-eligible cohorts and read the difference as the intermediation effect.

It doesn't work. Transactional and promotional messages aren't comparable in the first place. A fraud alert and a flash-sale promo differ on intent, urgency, content, recipient state and baseline engagement, none of which is the intermediation you're trying to measure. Their ratio isn't stable for reasons that have nothing to do with the platform, so a difference in that ratio across cohorts tells you about the cohorts, not about summarisation. Apples to oranges.

The version that survives drops the cross-category comparison and uses transactional sends only as a within-user control on general engagement, in a DiD around an intervention date. Take users who receive both your transactional and your promotional messages. Around a release date (iOS 18.1, the Gmail AI Inbox launch), look at the change in promotional engagement and the change in transactional engagement for those same users. If promotional engagement drops while transactional engagement holds, the user's general willingness to engage didn't crater for some unrelated reason (seasonality, an app update, a deliverability problem), which makes the promotional drop more plausibly about the platform's treatment of promotional content specifically.

This is a robustness check on the cohort DiD, not a method of its own, and it carries its own assumption: that promotional and transactional engagement would have moved in parallel for these users absent the intervention. That's weaker than it sounds, since the two categories respond differently to plenty of things. The absolute gap between them is meaningless; never interpret it. All you're reading is whether they moved together or apart across the date, for the same people. Use it to break a tie when the cohort DiD is ambiguous, not as evidence on its own.

The DiD around platform releases reads display and prominence shifts, since release dates are mostly about which messages are shown and how prominently. Sender-side ablation targets wording sensitivity most cleanly, since you're controlling the input the editor rewrites. Cohort comparison covers all three at once with selection-bias caveats.

None of the probes reads the wording layer in isolation. The layer the discourse is loudest about, Apple Intelligence eating your copy, Gemini rewriting your subject lines, Galaxy AI choosing words for you, is the one no sender-side probe can separate from the other two. You can read display shifts around iOS 18.1 with a DiD. You can read prominence shifts on the Gmail AI Inbox launch. You cannot read what the summary said versus what your copy said, on your own sends, with anything in this kit. The wording layer is opaque all the way down.

The operational kit for push

On top of the shared probes, these are push-specific.

Interruption level ablation

Time-sensitive notifications bypass Focus and bypass summarisation. Active notifications are eligible for both. Running an A/B test of active vs time-sensitive at the same audience cohort, on the same content, gives you a delta. The delta should be larger on Apple Intelligence-eligible devices than on non-eligible ones, because on non-eligible devices the time-sensitive flag doesn't change the summarisation pathway (there isn't one). If you don't see a larger delta on the eligible cohort, summarisation isn't doing much to you. If you do, you have a rough estimate of its size.

Bundle composition

Send a single notification in variant A. Send four notifications within the bundling window in variant B. The CTR-per-notification on B should be lower on the summarisation-eligible cohort if bundling is collapsing your four sends into a single summary line. If it isn't lower, your sends aren't bundling, or the bundling isn't costing you anything.

Confirmed delivery instrumentation

Several lifecycle platforms expose a "confirmed delivery" or equivalent metric that fires from the host app's NotificationServiceExtension when the system asks the extension to process an incoming notification. Airship, OneSignal, and a couple of others have variants of this.8 

This is a thinner signal than the marketing copy suggests. The NSE runs before any user-visible rendering decision. It fires when APNs hands the payload to the device. It does not fire when the user sees the notification. It does not tell you whether the notification was bundled into a summary, displayed prominently, demoted to Notification Center, or silenced by Focus. It tells you that APNs reached the device and the OS asked your extension to do its job.

The delta between "accepted by APNs" and "confirmed delivered" is mostly Focus suppression and silent platform kills. It is not summarisation. Treating the confirmed-delivery metric as a summarisation signal is a category error. The metric is still useful as a Focus / silent-kill diagnostic. It just isn't doing the work you might think.

Time-to-session distribution

The shape of the time-to-session distribution after a send is mildly diagnostic. A bimodal distribution (immediate clicks plus a delayed cluster) is consistent with summary bundling, Focus-end delivery or proactive Notification Center pull, where users see the notification later than send time. Noise floor is high and modes are hard to identify without volume, so this is more useful as longitudinal change-detection ("did my delayed cluster get bigger after iOS 18.1") than as a single-campaign read.

MMP-side link conversion decomposition

Your MMP attributes re-engagement to push via embedded tracking links, and the ratio of click-attributed to organic-near-send conversions across cohorts is the signal: if click-attributed conversions drop more than organic-near-send on the eligible cohort, the summary is likely stripping or de-emphasising the deep link; if both drop together, the user is simply seeing fewer notifications.

The operational kit for email

The email-side probes, beyond the shared kit above.

Postmaster Tools and SNDS

Gmail Postmaster Tools and Microsoft SNDS (Smart Network Data Services) are the most generous platform-provided signals in any of the channels under discussion. Gmail Postmaster Tools gives you, per IP and per domain, spam rate as reported by users, IP reputation, domain reputation, authentication pass rates, delivery error breakdown, and now (since late 2025) a compliance dashboard against the bulk sender requirements. Microsoft SNDS gives you complaint rate, trap hit rate, IP reputation tier. Yahoo's Sender Hub gives you a more limited but similar set.

These are not summarisation signals. They are deliverability signals. They tell you whether your messages are reaching inboxes in aggregate, not whether any given message landed in Primary or Promotions, and definitely not whether Gemini or Copilot rewrote your subject line. They do, however, tell you when your overall deliverability is shifting in a way that matters, and they're the only platform-cooperative signal anyone has ever exposed in either of these channels.

The right use: set up Postmaster Tools and SNDS for every sending IP and domain. Track them weekly. Set alerts on spam complaint rate (the 0.3% Gmail threshold is the legal limit; the 0.1% recommendation is the safe one) and IP reputation tier changes. Use them as deliverability monitoring, not as engagement measurement.

Mozilla/5.0 segmentation

The MPP proxy fetches images via a Mozilla/5.0-tagged user agent on an Apple-controlled IP range. This was the foundational segmentation insight of the post-MPP era and remains useful. Senders who segment their open data into "Apple-proxied" and "non-proxied" cohorts can use the non-proxied cohort as a remaining, partial, signal of real opens for the population that uses non-iOS clients.

The fraction of opens that come through MPP has been climbing every year as more users adopt iOS Mail. Litmus's email client market share data puts Apple Mail at around 50-60% of all opens globally; typical B2C lists run higher.13  The remaining cohort is small enough that opens are retired as a primary metric, even with the segmentation. The segmentation is useful for finance-side trend continuity and not much else.

The structural risk: the segmentation key itself is a shrinking signal. Apple has spent five years narrowing the fingerprinting surface area available to senders, and there's no architectural reason a future iOS couldn't collapse Mozilla/5.0-style identifiability on the push side the way MPP collapsed open-pixel identifiability on the email side. The same logic applies to device-model headers and any other "eligibility marker" the kit leans on. The proxy is a useful key today. It is a key the platform can close.

Inbox provider cohort comparison

Segment your list by mailbox provider domain: Gmail, Outlook/Hotmail/Live, Yahoo/AOL, Apple iCloud, corporate Office 365 (a separate population), corporate Google Workspace (another separate population), and a long tail. Compare CTR, conversion, and downstream metrics across providers for the same campaign.

This is the email equivalent of the device-eligibility cohort comparison and has the same selection-bias problems. Gmail users are not Outlook users. Corporate Office 365 users are even more distinct (they're at work, on laptops, during business hours, and have IT-managed spam filtering). Use as a sanity check, not a primary estimate. The within-cohort change around an intervention is more informative than the cross-cohort comparison. Cross-cohort comparisons also need more volume than within-cohort tests, because cohort-level variance compounds the sampling noise on top of everything else.

Tab placement and seed testing

Litmus, Email on Acid, GlockApps, Mailgun and a few others operate seed-list panels that report where your campaign landed across the major providers. Seed addresses have no engagement history with your sender, so the classifier behaviour they see is not the behaviour your real subscribers see. Treat seed-list placement as directional, useful for catching sudden Promotions-tab drift after a content or domain change, not for absolute placement claims.

One-click unsubscribe as feedback signal

Since the bulk sender requirements took effect in February 2024, all bulk promotional senders have to support one-click unsubscribe via RFC 8058 (List-Unsubscribe-Post header); the unsubscribe POST hits a sender-controlled URL the moment it happens.

The new signal is timing. When in the relationship do users unsubscribe? Within an hour of receipt? At 6am the next day? After a specific campaign type, or three sends in a week? That's granular real-time data you couldn't collect before the regulatory shift, and unusual patterns are diagnostic. The absolute unsubscribe rate, paired with the Postmaster Tools spam complaint rate, is also the cleanest engagement-decay measure available for an email list, and the metric Google watches under complaint-based deliverability, not opens.

DMARC aggregate reports

DMARC aggregate reports (the "rua" tag) get sent to a mailbox you specify. They tell you which IPs are sending mail under your domain, whether the mail is passing SPF and DKIM alignment, and what the mailbox providers are doing about it. The reports are dense and not human-readable, but every DMARC aggregator (Dmarcian, DMARC Digests, Valimail and others) parses them into a dashboard. This is upstream of summarisation but downstream of nothing, and it's the closest you get to platform-cooperative reporting in email. Set it up if you haven't. While you're at it, BIMI sits on top of a fully aligned DMARC and gets your verified logo into supporting clients; not a measurement instrument, but deliverability table stakes worth claiming.

Reply rate as engagement

Email has a reply surface push doesn't. Most marketing senders use a no-reply address and squander it. The senders that actively encourage replies, asking questions, soliciting feedback, sending from a real human-named address, see reply rate as one of the strongest positive signals an inbox can read: a reply tells Gmail the user wants this sender's mail, which strongly favours Primary tab placement for future sends from the same address. Even a 1-2% reply rate on a campaign tends to produce broad inbox-placement improvements no other lever delivers as cleanly.

Make replies possible. Make replies easy. Read them and respond. Reply rate at typical levels (1-2%) needs at least a few thousand sends to read meaningful patterns, and ten thousand-plus to compare patterns across segments.

None of this gives you "did Gemini summarise this email" or "did Copilot prioritise it down to the Other inbox." Those questions are not answerable. Build the kit anyway.

Lifecycle platforms and ESP's

Most lifecycle platforms and ESP's market features that sound summarisation-aware or inbox-placement-aware. None of them actually expose per-send signals about what the editor did to your message; the AI-branded features the category ships are send-time and content optimisers, not intermediation-detectors.

A deliberate spread across tiers makes the point. Enterprise suites (Salesforce Marketing Cloud, Adobe Experience Cloud, Emarsys, Braze) have the deepest segmentation, orchestration and predictive-scoring stacks, wrapped in Einstein, Sensei and similar AI branding, with opens-and-clicks reporting underneath that comes out in the same shape it always has. Mid-market cross-channel platforms (Iterable, Klaviyo, HubSpot, Customer.io, MoEngage) differ in strengths (email-and-push parity, ecommerce focus, developer ergonomics, regional reach) and share the intermediation blind spot. Push and mobile specialists (Airship, OneSignal, Pushwoosh) all expose confirmed-delivery metrics with the same limitation: APNs reached the device, not user saw the message. SMB tools (Mailchimp, ActiveCampaign) serve the long tail with open-heavy, MPP-corrupted engagement reporting, and leave the shift to click-and-downstream to the customer.

The through-line is that across the entire category, regardless of tier, the available instrumentation for measuring intermediation is roughly the same. None of them tell you what the summary said. None of them tell you whether your notification was bundled or your email was demoted. None of them tell you whether Focus was active or whether Gemini chose to surface your message in the AI Inbox view.

Pick the platform that works best for your other requirements (audience model, channel orchestration, deliverability hygiene, native CDP, developer experience). Don't pick on the basis of "confirmed delivery" or "MPP-aware opens" or "inbox placement scoring" as though one vendor's version of these answered a different question than another's. They all answer the same questions, with roughly the same limitations.

The useful inferential probe a lifecycle platform or ESP can give you isn't any of those features. It's experimentation infrastructure: holdouts, A/B testing, time-to-conversion measurement, audience segmentation that lets you cut by device model, OS version, mailbox provider, and engagement history. That's where the kit actually lives. The summarisation-aware feature is marketing copy. The audience model and experiment runner is the thing you should care about.

Attribution lives in overlapping layers

It's tempting to split this cleanly: push attribution in the MMP layer (AppsFlyer, Adjust, Branch, Singular, Kochava), email attribution in the CDP or marketing analytics layer (Segment, mParticle, Snowflake-based stacks, Adobe Experience Platform, Tealium). It isn't clean, and pretending it is will mislead you. The layers overlap heavily. The MMP's moved past mobile-install attribution years ago into web, email and SMS link attribution and cross-channel people-based models; Branch is as much a cross-channel deep-linking layer as an install tracker. The CDP's ingest events from every channel, push and email and web alike. Your lifecycle platform (Braze, Iterable, Klaviyo) attributes its own push and email sends natively. Your web analytics captures the landing. A single email click can be claimed by the ESP's link wrapper, an MMP's cross-channel link, the CDP's event stream, and your web analytics all at once. The useful question is not which tool owns which channel. It's which one you treat as source of truth and how you reconcile the rest, because the more interesting cut runs by destination, not by channel: clicks that land in an app, and clicks that land on the web.

Clicks that land in an app. Post-ATT, the MMP's have all adapted their re-engagement attribution differently, and the mechanism is the same whether the click came from a push or an email. AppsFlyer's OneLink, Adjust's deep links, Branch's universal links, Singular's tracking links, Kochava's SmartLinks: all embed a tracking parameter in the deep link that fires on click. If the click happens, the MMP knows. If it doesn't, the MMP infers from probabilistic signals on iOS (much weaker since ATT) and deterministic signals on Android.

The probe that uses this layer is the link-conversion decomposition. The ratio of click-attributed to organic-near-send conversions across cohorts is the signal.

Clicks that land on the web. This attribution is older and more fragmented, and it serves email mostly but also any push or SMS click that opens a browser. The historical stack: UTM parameters on every link, cookie-based session tracking on the destination site, deterministic identity stitching where the user is signed in. All three have degraded.

UTM parameters still work and remain the foundation. Every campaign link should have utm_source, utm_medium, utm_campaign, utm_content tagged. Every analytics tool reads them.

Cookie-based session tracking has degraded for the obvious reasons. Third-party cookies are gone. First-party cookies are time-limited on Safari (7 days for client-side JavaScript cookies under ITP, longer for server-set cookies that aren't flagged by CNAME-cloaking detection), more limited on iOS by Apple's privacy controls, increasingly limited on Chrome. The user-as-cookie identity model is unreliable.

Identity stitching where the user is signed in is the reliable layer that remains. If the user clicks your email link and lands on your site signed in (because they clicked through to a logged-in surface), the click-to-conversion path is clean. The fraction of your audience that this works for is the fraction that's already authenticated. For most B2C senders that's a meaningful but minority slice.

The CDP layer is where you reconcile all of this. Segment or mParticle ingests the email click event from your ESP, the on-site session event from your site analytics, and the conversion event from your commerce or back-end systems. With a stable user_id (from your authenticated session), you can stitch the path. Without one, you're back to UTM-based attribution and probabilistic matching.

The implication for intermediation measurement: the email cohort comparisons and DiD's need to be run on authenticated users, where attribution is deterministic. The probabilistic side of the audience is more noise than signal. Build your measurement framework around the authenticated cohort. Accept that you have less coverage but more reliable inference.

The older intermediation layer

The summarisation panic is a 2024-2026 story. The intermediation story is much older.

Gmail tabs since 2013. Outlook Focused Inbox since 2016. Spam filtering since the early 2000s, now ML-based and opaque. Apple Mail Privacy Protection since 2021. Notification channels on Android since Android Oreo (2017). Notification interruption levels on iOS since iOS 15 (2021). Focus modes on iOS, Do Not Disturb on Android, "schedule send" features that hide messages until a designated time, the user-configured silencing of channels that they never wanted in the first place.

The user opting your Promotions channel to "Silent" in 2019 has been getting your pushes as silent Notification Center entries for seven years and you've never measured that effect. The user whose Gmail filter routes all your email to a "Marketing" folder they check monthly has been a non-engager for a decade. The user whose Outlook moves you to the Other inbox every time and who they only sometimes scrolls into has been doing this for ten years.

The summarisation story is a thin layer on top of a much larger intermediation story. The cohort metadata that lets you do good DiD's on Apple Intelligence is the same metadata that lets you do good analysis of channel-importance shifts and Focus-prone behaviour. The mailbox-provider segmentation that lets you do good DiD's on Gemini summaries is the same segmentation that's been letting senders run good tab-placement analysis for over a decade.

The other thing this view gives you is a sense of proportion. If you're worried about Apple Intelligence summarisation eating your CTR, you're worried about a layer that affects a substantial but not majority share of your iOS audience, with the affected slice on notification summaries specifically a subset of that.9  Meanwhile the much larger story of user-configured silencing has been going on for years and you've never measured it. Where you spend your measurement effort should reflect that.

The conversion surface

The platform surface is the discovery channel. The destination is the conversion surface.

For push that's the in-app experience, the deep-link landing screen, the in-app message you triggered on session start, the home tab you personalised. For email it's your website or app, the landing page and conversion funnel you built. In both cases the OS or mailbox provider gates the discovery half; the conversion half is fully under your control, and it's where the money actually moves. Optimise your measurement entirely around channel-surface CTR and you're optimising for the discovery half of a two-step funnel.

The right frame isn't "what was my channel CTR" but "what was my channel-to-destination conversion, and what was my destination-to-target-event conversion." The first is partly intermediated. The second isn't. The DiD's on platform releases are useful for the first and tell you nothing about the second, which is the part of your funnel that actually generates revenue. If channel-to-destination drops on a platform release but in-app or on-site conversion holds steady, the platform took some of the top-of-funnel and you adapted your destination to convert the remainder better; net effect, small. If both drop, you have a bigger problem.

The strategic response to intermediation, in both channels, isn't to fight the channel surface. It's to shift the load to the surfaces the editor can't reach. Both prior pieces ended here, and the conclusion is the same for measurement as it was for strategy: the surfaces that arrive unedited are the ones with no model in the pipe at all. The logged-in screens inside your own product, your in-app inbox, SMS, physical mail, the loyalty surface where you are the platform. Nothing summarises, ranks, bundles or silences them, and you can measure them end to end. The marketers who do this well end up with those surfaces driving meaningfully higher conversion than the intermediated channel itself. The marketers who don't end up complaining about Apple Intelligence and Gmail Promotions at industry conferences. Decide which group you want to be in.

SMS is not actually outside the intermediation story; US carriers run their own spam filtering, RCS is rolling out on Google Messages with auto-categorisation that looks a lot like the Gmail Promotions tab, and per-message regulatory and compliance costs are climbing. It is freer than email and push today, not forever. And the in-app inbox, in-app message, and loyalty surface are post-discovery: they only work after the user has opened the app, which is the very behaviour the push surface was supposed to drive. Shifting load to them is the right move; it doesn't solve the top-of-funnel problem that push and email historically covered, which is to get the user back into the app at all. That problem stays with the intermediated channels.

The agentic shock

The forward-looking thread is the one that makes the entire measurement gap moot in a different way.

Apple, Google, Microsoft and Anthropic have been pushing on agentic capabilities since 2024: an agent that handles small tasks on the user's behalf, replying to messages, scheduling, summarising, eventually transacting. The state of the art is partial, the trajectory clear. Not the decisioning agents a sender runs to decide what to send, the Aampe-and-Hightouch category the AI gap piece covered, but the agent sitting on the recipient's device, deciding what the user ever sees and increasingly acting in their place.

When an LLM acts on a message without showing it to the user, the entire engagement-metric concept dissolves. A simple push example: you send "Your order is delayed. Tap here to reschedule delivery." The agent intercepts, recognises the user's standing preference ("always reschedule to the next morning if possible"), calls the rescheduling endpoint, updates the calendar. The user never saw the push. No click. There's a conversion: the rescheduling event fired, attributable to "in-app organic" or "uncategorised" in your analytics.

The email case is already further along. Gmail's AI Inbox view, rolled out in January 2026, reshapes the inbox around summaries and to-dos. Microsoft Copilot's Prioritize My Inbox actively ranks incoming messages and surfaces the high-priority ones across Outlook clients, demoting the rest. A related mobile-only Copilot view, Priority View, was pulled from iOS and Android in February 2026, but Prioritize My Inbox itself continues everywhere. Apple Mail summaries on iPhone do the same on a smaller scale. For users on these features, the inbox is not the inbox anymore. It's a summary, a ranked feed, a to-do list. The marketer's email is one input to whatever the agent decides to show.

Search has gone furthest of all. Google's AI Mode already fans a single query out into dozens of hidden subqueries, assembles a per-user answer from passages it selects by reasoning, and cites whom it chooses, so the brand is one input to a synthesis the user reads instead of visiting the site; Google's own guidance to publishers has started pointing them away from counting clicks.10  That is the receiver-side agent in production rather than in prospect, one channel over from the two here.

What does CTR mean in that world? Approximately nothing.

The right measurement frame is conversion attribution, with the agent as a new actor in the funnel. The sender's job stops being "write the most clickable subject line" and starts being "make the message structured enough that the agent can act on it reliably." This is much closer to a B2B API integration problem than to a marketing-copy problem.

First, the agent is a more aggressive intermediation layer than summarisation. Summarisation rewrites your copy; the agent skips it entirely. The policy that decides what the agent will do on the user's behalf, what payloads it treats as actionable, what it ignores as advertising, sits with the platform (Apple, Google, Microsoft, Anthropic), and the sender will see only what the agent did, never the policy that produced it.

Second, the sender-side observable becomes the agent's success rate at executing on your sends, not the user's click-through rate on your copy. The metric stops being "did the user engage" and becomes "did the agent route the user through my flow successfully." Closer to a deliverability KPI than an engagement one, and the lifecycle team's skillset starts to look more like the API integrations team's.

Third, the agent could change the commercial dynamic of consumer messaging. If most engaged users have their messages handled by agents and most non-engaged users have theirs summarised away, the marginal human reader who actually decides whether to click is going to be smaller and more atypical, and the audience for traditional lifecycle marketing collapses to the people whose agents haven't yet captured the relevant behaviours, a shrinking set.

The implication for the measurement framework: the conversion-based, destination-shifted frame is also the right one for the agentic future. The companies that have already moved their measurement weight off channel CTR and onto downstream conversion will adapt to agents reasonably well. The companies still measuring everything on push CTR or email opens are going to spend the next four years staring at metrics that mean nothing, and the cohort drifting into agent-mediated behaviour first is the one with the highest spending power.

Most of the measurement kit is for the current intermediation, the editor on the wire that summarises, ranks and bundles for a human reader. The agentic state is a different problem and the playbook shifts with it: the work moves to schema and structured payloads the agent can parse, the success metric becomes the agent's execution rate on your intent, and within-user controls like the transactional probe weaken further because agents handle transactional and promotional sends with very different policies. The destination-conversion frame is the through-line that survives the transition; most of the rest is for the present and won't all carry forward.

Cleaner inputs make the platform's agent work better, which creates a direct incentive for platforms to publish a structured-payload schema. That cuts against the gap-won't-close thesis at first reading: here is a form of cooperation that would arrive, and arrive soon. But it is cooperation on the platform's terms, in the direction that helps the platform's agent, not the sender's measurement. Two different schemas, two different beneficiaries. Senders get one of them. Agentic-era standardisation is the platforms confirming the asymmetry, not relenting on it, and reading it as the latter is reading the wrong direction of the cooperation.

The kit, then, is for the two-to-four-year transition before agents capture most consumer messaging interactions among the cohorts that spend the most. During that window, the kit reads the world. After it, the destination-conversion frame is what survives and most of the methodology goes obsolete. You don't get to skip the transition by predicting its end. Build for the years the kit covers, on the understanding that you are building something perishable.

The harder half

Volume was the first hurdle. Comprehension is the next, and the bigger one for the senders who clear scale.

Most marketers don't really understand data. I've argued that before, and the measurement kit here demands exactly the literacy the discipline has mostly never built. Every probe here hands you a distribution, not a verdict. A DiD gives you an effect size with a confidence interval, conditional on a parallel-trends assumption you can't fully verify. A cohort comparison gives you a number you have to discount for selection bias you can't fully remove. A time-to-session distribution gives you a shape to read, not a fact to report. A link-conversion decomposition gives you a ratio whose movement you have to reason about rather than announce.

This is the same shift the move from decision support to decisioning forces on the rest of lifecycle marketing. An A/B test hands you a winner and a loser. The intermediation kit hands you a shifting set of bets and asks you to trust the holdout rather than the dashboard, to act on a distribution rather than a result. Operating it takes a comfort with probability, with uncertainty, with not being able to point at the single thing that worked, that the discipline has mostly never had to develop.

That's the binding constraint, more than any tool gap. The platform won't tell you what it did. The inference that replaces seeing is probabilistic, and probabilistic inference is the thing marketing has always flinched from. You can buy the kit. You can't buy the comfort with uncertainty that makes the kit mean anything. That part has to be built, and building it is a hiring and a culture problem, not a software one.

What I might be wrong about

If platforms optimise for engagement, why would they suppress good messages? Because the platform's objective function is not your CTR. It's session time, retention, and complaint rate at the user level, over a long horizon. A push that gets a click but earns a "low priority" mark or a future swipe-away can be net-negative for the platform on a longer time scale than your campaign report runs. Good messages on your terms can be costly messages on the platform's, and the platform's terms are the ones the editor optimises against.

Maybe the gap is smaller than marketers think. Possibly. If summarisation has high overlap with the original wording, Promotions-tab placement closely matches "this is promotional content," and Focus-mode suppression matches "the user doesn't want this right now," then the platform is mostly doing what the user wants and the marketer's frustration is misdirected at the editor when it belongs with the strategy or the relevance of the message. The measurement kit here doesn't separate "the platform suppressed a good message" from "the user told the platform to suppress this kind of message"; it tells you the marketer-visible metric moved. If user-perceived value moved with it, the right response is different and outside the scope here.

Maybe intermediation improves overall user engagement, just not yours. Plausible, and partly the platform's pitch. Summarisation makes notifications less annoying; users may engage more with the device because of it; the marketer's CTR drop coexists with the platform's session-time rise. Both are real and the platform's metric is the one that pays for the feature.

Maybe the inductive argument is wrong. The thesis bets that fifteen years of trajectory will continue, that the IETF brotman draft is a one-off rather than the leading edge of a slow opening, and that no major platform will reverse on summarisation transparency over the next decade. The counter-reading: regulatory pressure under DMA-style frameworks intensifies, the brotman draft sees working-group adoption and ships, placement reporting follows, mailbox providers compete on transparency in a way they currently don't, and senders end up with the kind of per-message intermediation API the kit treats as impossible. I'd put the probability low. It is not zero. A meaningful chunk of the kit's perishability has the trajectory continuing as a load-bearing assumption, and the trajectory bending is the obvious objection to the central claim.

Granting all of this, the measurement gap remains. Even on the most generous reading of platform motivation, the sender's ability to optimise their own work depends on seeing what was done to it, and that's the case the platform isn't going to make. The prescriptions hold whether the editor is benevolent, indifferent, or self-interested.

Building the in-house framework

Capture the right metadata at send time. Store all of it against the send_id. The marginal cost is small and the analytic value compounds.

  • For push: device model, OS version, app version, notification channel (Android) and importance level, interruption level (iOS), app foreground or background state on receipt (from your NSE), and the user's local time.
  • For email: mailbox-provider domain mapping, MX-record-derived provider classification, the user-agent of the last engagement (for MPP segmentation), Postmaster Tools and SNDS reputation tier for the sending IP and domain, BIMI status, and DMARC alignment status.

Maintain clean eligibility cohort flags. This list will grow on both sides; keep it current.

  • For push: Apple Intelligence-eligible devices are every iPhone from the 15 Pro forward, plus M-series iPads and Macs and the iPad mini A17 Pro. Pixel-summary-eligible devices are recent flagship Pixels (9 and later, excluding the 9a). The Galaxy S26 family and later for One UI 8.5 Notification Highlights.
  • For email: Gmail consumer, Gmail Workspace (a different cohort, less aggressive AI), Outlook consumer, Outlook 365 with Copilot, Yahoo, Apple Mail with MPP active, and iCloud Mail.

Run quarterly experiments. The cadence matters more than the specific cell design.

  • At least one DiD around each major platform release, on both channels.
  • At least one sender-side ablation per quarter: interruption level and copy structure for push; subject line and preheader structure for email.
  • At least one bundle-composition test for push.
  • At least one cohort-specific creative test for email: a Promotions-tab-optimised variant against a Primary-tab-optimised variant on a Gmail-only cohort.

The transactional within-user control belongs in the kit but not on the quarterly cadence; reach for it only when the cohort DiD comes back ambiguous on a specific release. The cross-cohort comparison is a sanity check on the DiD, not a primary read.

Shift measurement weight off channel-surface CTR. Optimise for channel-to-destination conversion (push-to-in-app-session, email-click-to-on-site-session) and for destination-to-target-event conversion (in-app-session-to-purchase, on-site-session-to-purchase). Both are under your control. Neither is intermediated by the OS or the mailbox provider. The CTR and open metrics are going to become less reliable, not more, over the next several years. Build for that.

Build the email signal kit around what actually works.

  • Reply rate.
  • Click-and-downstream conversion.
  • One-click unsubscribe timing.
  • Postmaster Tools complaint rate.
  • DMARC alignment.
  • The cohort that authenticates on click-through, where attribution is deterministic.

Track platform announcements. Each is a natural-experiment opportunity; set a calendar reminder against each major one and have the DiD design ready before the release ships.

Hire for the analyst, not the tool. The leverage is in the experimentation discipline and the inferential statistics. The lifecycle platform and ESP are the substrate. A good analyst with Braze and Iterable, or Klaviyo, or any of the major platforms can produce better intermediation measurement than a bad analyst with whatever the most expensive tool happens to be. The market for lifecycle tools is mostly competing on feature inventory and dashboards. The market for lifecycle measurement is competing on analytical depth. Hire and budget accordingly.

Closing

The mistake is treating any of this as a problem to be solved. It isn't. It's a constraint to be designed around. The measurement framework is the design.

Build the kit. Run the experiments. Capture the metadata. Stop arguing with the trade press about Apple Intelligence and the Promotions tab and the AI Inbox. Each summarisation panic is a thin layer on a deep intermediation story, and the deep intermediation story is only going to get deeper.

The measurement gap won't close. Get good at counting in the dark.

  • 1

    Google's official reason for excluding the Pixel 9a from notification summaries is that 8GB of RAM isn't enough to run the on-device Gemini Nano model. Whether you find that credible probably depends on how generously you view Google's product segmentation.

  • 11

    Christian Kroer et al., "Fair Notification Optimization: An Auction Approach," 2023, on Meta's Instagram notification system: a 0.42% lift in click-through measured across 77 million users per arm. https://arxiv.org/abs/2302.04835

  • 12

    Klaviyo, 2026 Omnichannel Benchmark Report, places B2C campaign click rate at 1.69% average (3.38% top decile); Mailchimp's 2025 industry benchmarks land between 1.9% and 3.4%. 2% is a defensible round number for the typical B2C campaign send. https://www.klaviyo.com/uk/blog/email-marketing-benchmarks-open-click-and-conversion-rates

  • 2

    Alex Brotman (Comcast), Tom Corbett (Iterable), Emil Gustafsson (Google), "Aggregate Performance Reporting," Internet-Draft draft-brotman-aggregate-performance-reporting-00, IETF, 17 March 2026. Defines a JSON reporting format for mailbox providers to send daily aggregate classification and engagement data to senders by DKIM domain. Intended Standards Track; currently at I-D Exists, no working-group adoption. https://datatracker.ietf.org/doc/draft-brotman-aggregate-performance-reporting/

  • 3

    When the European Commission designated its first gatekeepers under the Digital Markets Act on 6 September 2023, Gmail and Outlook.com met the quantitative thresholds but were not designated, the Commission accepting Alphabet's and Microsoft's arguments that the two email services did not serve as important gateways. https://ec.europa.eu/commission/presscorner/detail/en/ip_23_4328

  • 4

    Google, "Email sender guidelines," effective February 2024: bulk senders must authenticate with SPF, DKIM and DMARC, support one-click unsubscribe per RFC 8058, and keep reported spam rates below 0.3%. Yahoo announced parallel requirements; Microsoft followed with similar rules from May 2025. https://support.google.com/a/answer/81126

  • 5

    Apple introduced Mail Privacy Protection in iOS 15 in September 2021; it proxies image loads through Apple-controlled infrastructure, and the Mozilla/5.0 user agent on those fetches is the canonical signature senders use to identify proxied opens. https://support.apple.com/en-us/102320

  • 6

    Specifically, that Luigi Mangione, the man charged with murdering UnitedHealthcare CEO Brian Thompson, had shot himself. He had not. Reporters Without Borders called for the feature to be dropped entirely. Apple's response was to label it more clearly as a beta, italicise the summary text, and quietly disable it for news and entertainment apps in iOS 18.3. It was re-enabled, now opt-in, in iOS 26 in September 2025.

  • 7

    For the modern treatment, including the parallel-trends assumption and its common failure modes, Jonathan Roth, Pedro H. C. Sant'Anna, Alyssa Bilinski and John Poe, "What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature," Journal of Econometrics, 2023. https://arxiv.org/abs/2201.01194

  • 8

    See Airship's Confirmed Delivery documentation (https://docs.airship.com/guides/messaging/user-guide/messages/reports/) and OneSignal's confirmed-delivery reporting; both fire from the iOS NotificationServiceExtension when the OS hands the payload to the app's extension, before any rendering decision.

  • 13

    Litmus, "Email Client Market Share." Apple Mail's share of all opens has run in the 50-60% band across 2024 and 2025, with the MPP-impacted subset accounting for most of that. B2C-skewed lists with heavy iPhone audiences typically land higher than the global average. https://www.litmus.com/email-client-market-share

  • 9

    Apple does not publish Apple Intelligence opt-in or daily active use figures. The third-party estimate I'm working from puts Apple Intelligence-enabled devices at roughly 940 million in Q1 2026 with around 410 million daily active users across all surfaces, against an iPhone active base of about 1.56 billion. That's roughly a quarter of the iPhone base by DAU, with wide error bars and no first-party confirmation. https://presenc.ai/research/apple-intelligence-usage-statistics-2026

  • 10

    For the mechanics, Mike King, "How AI Mode Works and How SEO Can Prepare for the Future of Search," iPullRank, 2025, which sets out the query fan-out, per-user embeddings and passage-level synthesis described in Google's own AI Mode patents and explainer. A practitioner teardown rather than a first-party source, and one selling a remediation ("Relevance Engineering"), so read it as the engineer's case, not a neutral one. https://ipullrank.com/how-ai-mode-works