Open to Engineering Manager / Director rolesLet's connect

Architecture Labs

Choose a lab and navigate the decisions that shaped the outcome.

These are real decisions I've navigated — not textbook scenarios. Walk through the trade-offs, make the calls, and see how the outcomes unfold.

ArchitectureIncident ResponseFeatured

Production Pods Were Restarting Randomly

A production incident involving intermittent connection failures and pod restarts under normal traffic patterns.

Choose your path
ArchitecturePerformanceFeatured

GraphQL Performance Was Deteriorating

API response times climbing steadily under normal load. The database is getting blamed. Infrastructure spend is on the table.

Choose your path
Architecturearchitecturemicroservicesstrangler-figscaling

Scaling Crisis: Your Monolithic Worker Has Hit the Wall

Six critical workloads share one process, one deployment, and one busy flag — and customers are feeling the pain.

Choose your path
Architectureobservabilityalertingincident-response

We Had 400 Alerts and Missed the One That Mattered

A memory leak alert fired 90 minutes before the database OOM'd, but was buried in 400 weekly alerts.

Choose your path
Architectureperformancecachingdebuggingprofiling

We Built a Cache That Made the System Slower

A team added Redis caching to speed up a slow API endpoint, but response times got worse.

Choose your path
Architecturedatabasemigrationdowntimearchitecture

A Database Migration Took Down the Entire Platform

A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours.

Choose your path
Architectureincident-responsedependency-managementfinancial-services

A Minor Dependency Update Broke Production for 12 Hours

A semver-compliant patch update silently corrupted financial reports through changed locale handling.

Choose your path
Architectureincident-responsefeature-flagsdata-corruption

A Feature Flag We Forgot About Caused a Production Incident

A stale flag's default value routes financial transactions through deprecated code, corrupting data for 6 hours.

Choose your path