Open to Engineering Manager / Director rolesLet's connect
Architecture
Architecture

Taming 50 Million Callbacks with Event-Driven Architecture

A legacy .NET HttpHandler buried inside the customer portal was processing webhook callbacks synchronously — and at 20M+ messages a month, vendor retry storms inflated that to 75 million callbacks with 90-second processing latency. We replaced it with an Azure Function that acknowledges in milliseconds and routes to channel-isolated processors via Service Bus, dropping latency to sub-second and eliminating the retry cascade entirely.

3 min read

Key Tradeoffs

  • Speed over completeness. Acknowledge first, process later
  • Isolation over simplicity.
  • Serverless cost for predictable scale.
  • Standardization over flexibility.
  • Independent deploys, coordinated schemas.

What happened

The platform was a multi-channel communications SaaS — customers composed HTML messages and broadcast them as email, SMS, voice, or any combination across their contact base.

20M+

sms / month

5M+

emails / month

1000's

voice min / month

Customers reported delivery results were missing or delayed by hours. Outbound delivery was confirmed working — the bottleneck was the callback webhook handler ingesting delivery receipts, status updates, and opt-outs from vendors.

Root Cause

The callback processor was a .NET Framework HttpHandler baked into the main portal's IIS app pool. Application Insights showed ~75M callbacks/month — far exceeding actual message volume — with 90–120 second processing latency at peak. The excess was vendor retries: every unacknowledged callback triggered exponential backoff re-sends, creating a feedback loop that worsened under load.

How it was addressed

This wasn't a code problem — it was a missing architectural boundary. A synchronous handler embedded in a portal cannot absorb millions of async vendor callbacks. We defined three constraints for the replacement:

  • Decouple ingestion from processing. Vendors only need an HTTP 200 — acknowledge receipt instantly, process later.

  • Eliminate the retry cascade. Fast acknowledgment kills retries at the source. Handle the rest with idempotency.

  • Isolate failure domains. Email processor failures shouldn't impact SMS status tracking. Each channel needs independent scaling and failure boundaries.

The solution

The redesign replaced the monolithic handler with a three-layer event-driven pipeline on Azure serverless infrastructure.

Ingestion Layer — Entry point, Service Bus topic, subscription routing
  • Ingestion: A lightweight Azure Function accepts webhooks, normalizes payloads into a generic envelope, and pushes to a Service Bus topic. Response time: milliseconds. The URL convention /incoming/{vendor}/{servicetype}/{*requesttype} made vendor onboarding a config change, not an infrastructure change.

  • Routing: The Service Bus topic fans out via subscriptions — SMS to SMS processors, email to email, voice to voice — each with independent retry policies and dead-letter queues.

Processing Layer — Processor microservices, actions, storage backends
  • Processing: Vendor-scoped processor functions own the full event lifecycle — routing, business logic, and storage writes — with Azure SQL for transactional data, Cosmos DB for audit trails, and Azure Storage for overflow.

    Result

    Callback latency dropped from 90–120 seconds to sub-second. Inbound volume fell from ~75M to expected range — confirming the majority of prior load was vendor retries. Customer-reported delays resolved completely.

1

Architectural debt compounds under load. The handler worked at lower volumes. Recognizing a boundary problem vs. a code problem determined the right investment.

2

Fast acknowledgment is a force multiplier. Decoupling receipt from processing eliminated the retry cascade and cut total system load by ~65%.

3

Isolation enables independent scaling. Per-channel processors turned capacity planning from an all-or-nothing upgrade into a targeted decision.

Tradeoffs

  • Speed over completeness. Acknowledge first, process later — cutting response time from minutes to milliseconds eliminated the retry storm entirely.

  • Isolation over simplicity. More moving parts, but a failure in one channel no longer drags down every other channel with it.

  • Serverless cost for predictable scale. Trading free IIS capacity for consumption-based Azure infrastructure, but gaining the ability to absorb traffic spikes without manual intervention.

  • Standardization over flexibility. A generic message envelope simplifies routing at the cost of flattening vendor-specific nuance at the front door.

  • Independent deploys, coordinated schemas. Each processor ships on its own schedule, but envelope changes still require a synchronized rollout.