Postmortem: 7 customers received ~570 SMS each in 4 hours

Petr Guskov April 24, 2026

engineeringpostmortemreliability

On April 24, 2026, between 10:40 UTC and ~14:40 UTC, our SMS pipeline re-sent the same 7 notifications repeatedly. 3,987 SMS were delivered across 7 recipients before we caught it. Each person received ~570 messages to their phone over 4 hours.

3,987

SMS delivered

7

customers affected

Who was affected

7 customers. Each received ~570 SMS to their phone over the 4-hour window. That is not a small thing. Regardless of any other metric, 570 messages to one phone in 4 hours is a terrible experience. We owe each of them an apology and we sent one personally with an explanation of the bug.

Root cause

Three independent layers had to fail for this to happen, and they did:

Layer 1: Incomplete code branch

Our pricing code returned nil for tiered Stripe pricing instead of computing the value. This branch was never implemented.

Layer 2: Wrong ordering

The pricing call ran after Twilio confirmed delivery but before we wrote sent_at to our database. When pricing crashed, the row never moved out of pending.

Layer 3: Aggressive retries

Our scheduler retried pending notifications every minute. For 4 hours. Multiplied by Oban's per-job retry of 3.

The DB-level uniqueness key prevented duplicate notification rows from ever being created. But it didn't prevent the same row from being sent many times.

Detection

We caught it from a Twilio billing alert, not from internal monitoring. That's a separate problem we're fixing.

What we shipped immediately

A new :processing state and atomic claim (UPDATE WHERE status = 'pending') so two workers can't process the same row.
"Commit-first" ordering: we now mark the row :sent with the provider's message ID in its own atomic SQL statement, before anything fallible runs. Pricing, analytics, etc. happen separately and can fail without triggering a retry.
A reaper that resets any notification stuck in :processing for more than 5 minutes to :failed, so even a crashed worker can't leave a row in limbo.
Sentry alerting on the critical paths.
Removed all try/rescue blocks in favor of explicit pattern matching, so unexpected return values crash loudly instead of silently absorbing.
The underlying tiered-pricing bug, properly fixed.

The full state machine: pending → processing → sent | failed. Every transition is a single atomic SQL UPDATE. There is no state where a provider call can succeed without the row reaching a terminal state.

What we're investing in to prevent the next class of bug

Property-based tests that generate random worker interleavings and assert: "Twilio is called at most once per notification, no matter what fails when".
TLA+ specification of the send pipeline. We'll model-check the state machine and prove the invariants ("at most one send", "no stuck state", "exclusive claim") at the design level. This is the technique Amazon uses to verify DynamoDB and S3.
Per-recipient rate limit as a circuit breaker. Even a future bug that makes it past everything else can't send 570 messages to one phone.
Spend-rate alerts so we hear about anomalous Twilio cost in minutes, not hours.

What we're NOT doing and why

We considered formal proof in Lean/Coq. We're not doing it. The cost is months of work to maintain the algorithm in two languages, the benefit beyond TLA+ is marginal, and the honest risk is that we'd ship the verifier without it ever finding a bug TLA+ wouldn't have.

We considered switching from Twilio to a provider with native idempotency keys so we could let them dedupe instead of doing it ourselves. Neither Twilio nor AWS SES expose this. The right place for the guarantee is in our code, where we now have it.