A Feature Flag We Forgot About Caused a Production Incident
A feature flag created 18 months ago was still in the codebase. When the flag provider had a timeout, the flag evaluated to its default value — which no longer matched the state of the system. The result was a data corruption bug that took three days to fully remediate.
- Stale feature flag controlling a write path to a financial data pipeline
- Flag provider outage caused the flag to evaluate to a default nobody remembered setting
- Data inconsistency propagated for 6 hours before detection
Feature flags are treated as temporary by design but permanent by neglect. This flag was supposed to be removed after a two-week rollout 18 months ago. Instead, it became invisible load-bearing infrastructure — and when it failed, it failed silently into a state that corrupted downstream data.
The Scenario
You're the tech lead at a fintech company. An alert fires: financial transaction records don't match between two systems. Investigation reveals a feature flag that's been in the codebase for 18 months is suddenly evaluating to its default value because the flag provider is timing out. The default routes transactions through a deprecated code path that writes data in an old format. It's been running this way for 6 hours. What do you do?
No hints. Just judgment.
Hardcoding the flag and deploying immediately is the right instinct — stop the bleeding first. But deploying without establishing the corruption boundary first means the remediation team has no clean scope for recovery. They end up diffing entire datasets instead of querying a precise time window. Two minutes of scoping before the fix saves days of recovery work after.
- Feature flags are temporary by design but permanent by neglect — enforce expiration
- A flag's default value is a silent contract with the future — it must remain valid or the flag must be removed
- Establish the corruption boundary before deploying the fix — two minutes of scoping saves days of recovery
- Flag provider failures should fail safe, not fail to a default that may no longer be correct
- Stale flags are invisible technical debt until they become visible incidents
- Data remediation completed in 8 hours instead of an estimated 3 days
- 23 stale feature flags identified and removed across the codebase
- Flag expiration policy implemented — all new flags require a review date
- Flag provider SDK configured with fallback cache to prevent timeout-driven defaults
Related
A Minor Dependency Update Broke Production for 12 Hours
A routine patch update to a date-formatting library changed its locale handling. The change was semver-compliant. Tests passed. The bug shipped to production and silently corrupted date-sensitive financial reports for 12 hours.
Read more Incident ResponseProduction Pods Were Restarting Randomly
A production incident involving intermittent connection failures and unstable recovery behavior. The fix was surgical. The real lesson was about what not to do first.
Read more