Taming 50 Million Callbacks with Event-Driven Architecture

A legacy .NET HttpHandler buried inside the customer portal was processing webhook callbacks synchronously — and at 20M+ messages a month, vendor retry storms inflated that to 75 million callbacks with 90-second processing latency. We replaced it with an Azure Function that acknowledges in milliseconds and routes to channel-isolated processors via Service Bus, dropping latency to sub-second and eliminating the retry cascade entirely.

2–3 min read

Key Tradeoffs

Speed over completeness. Acknowledge first, process later
Isolation over simplicity.
Serverless cost for predictable scale.
Standardization over flexibility.
Independent deploys, coordinated schemas.

What happened

The platform was a multi-channel communications SaaS — customers composed HTML messages and broadcast them as email, SMS, voice, or any combination across their contact base.

20M+

sms / month

5M+

emails / month

1000's

voice min / month

Customers reported delivery results were missing or delayed by hours. Outbound delivery was confirmed working — the bottleneck was the callback webhook handler ingesting delivery receipts, status updates, and opt-outs from vendors.

Root Cause

The callback processor was a .NET Framework HttpHandler baked into the main portal's IIS app pool. Application Insights showed ~75M callbacks/month — far exceeding actual message volume — with 90–120 second processing latency at peak. The excess was vendor retries: every unacknowledged callback triggered exponential backoff re-sends, creating a feedback loop that worsened under load.

How it was addressed

This wasn't a code problem — it was a missing architectural boundary. A synchronous handler embedded in a portal cannot absorb millions of async vendor callbacks. We defined three constraints for the replacement:

Decouple ingestion from processing. Vendors only need an HTTP 200 — acknowledge receipt instantly, process later.
Eliminate the retry cascade. Fast acknowledgment kills retries at the source. Handle the rest with idempotency.
Isolate failure domains. Email processor failures shouldn't impact SMS status tracking. Each channel needs independent scaling and failure boundaries.

The solution

The redesign replaced the monolithic handler with a three-layer event-driven pipeline on Azure serverless infrastructure.

Ingestion Layer — Entry point, Service Bus topic, subscription routing

Ingestion: A lightweight Azure Function accepts webhooks, normalizes payloads into a generic envelope, and pushes to a Service Bus topic. Response time: milliseconds. The URL convention /incoming/{vendor}/{servicetype}/{*requesttype} made vendor onboarding a config change, not an infrastructure change.
Routing: The Service Bus topic fans out via subscriptions — SMS to SMS processors, email to email, voice to voice — each with independent retry policies and dead-letter queues.

Processing Layer — Processor microservices, actions, storage backends

Processing: Vendor-scoped processor functions own the full event lifecycle — routing, business logic, and storage writes — with Azure SQL for transactional data, Cosmos DB for audit trails, and Azure Storage for overflow.
Result
Callback latency dropped from 90–120 seconds to sub-second. Inbound volume fell from ~75M to expected range — confirming the majority of prior load was vendor retries. Customer-reported delays resolved completely.

Architectural debt compounds under load. The handler worked at lower volumes. Recognizing a boundary problem vs. a code problem determined the right investment.

Fast acknowledgment is a force multiplier. Decoupling receipt from processing eliminated the retry cascade and cut total system load by ~65%.

Isolation enables independent scaling. Per-channel processors turned capacity planning from an all-or-nothing upgrade into a targeted decision.

Taming 50 Million Callbacks with Event-Driven Architecture

What happened

How it was addressed

The solution

Tradeoffs

Related Reading