Labs
Choose a lab and navigate the decisions that shaped the outcome.
These are real decisions I've navigated — not textbook scenarios. Walk through the trade-offs, make the calls, and see how the outcomes unfold.
Production Pods Were Restarting Randomly
A production incident involving intermittent connection failures and pod restarts under normal traffic patterns.
Choose your pathGraphQL Performance Was Deteriorating
API response times climbing steadily under normal load. The database is getting blamed. Infrastructure spend is on the table.
Choose your pathSecurity Vulnerabilities Were Accumulating in Our GraphQL Stack
Active CVEs in a production compliance platform. Audit scheduled. Limited team capacity. Every new feature built on a deprecated foundation.
Choose your pathThe Cloud Migration That Almost Broke Our Export Service
3 weeks to migrate file storage from AWS S3 to Azure. The codebase has a storage abstraction layer. It looks clean. It isn't.
Choose your pathScaling Crisis: Your Monolithic Worker Has Hit the Wall
Six critical workloads share one process, one deployment, and one busy flag — and customers are feeling the pain.
Choose your pathHow do you handle a lead developer's resistance to migrating their legacy system?
The technical migration plan is solid, but the person who built what you're replacing isn't on board.
Choose your pathWe Had 400 Alerts and Missed the One That Mattered
A memory leak alert fired 90 minutes before the database OOM'd, but was buried in 400 weekly alerts.
Choose your pathWe Built a Cache That Made the System Slower
A team added Redis caching to speed up a slow API endpoint, but response times got worse.
Choose your pathA Database Migration Took Down the Entire Platform
A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours.
Choose your pathA Minor Dependency Update Broke Production for 12 Hours
A semver-compliant patch update silently corrupted financial reports through changed locale handling.
Choose your pathA Feature Flag We Forgot About Caused a Production Incident
A stale flag's default value routes financial transactions through deprecated code, corrupting data for 6 hours.
Choose your path