Impact
Reduces Azure outages by 30%, halts regressing deployments with 95% P95 accuracy, and improves Azure OS update stability by 18%. Monitors services across public and government clouds.
Tech Stack
The Problem
Azure deploys thousands of changes every day — across infrastructure, platform, and software layers — across public and government clouds worldwide. Each deployment is a potential regression. A bad OS update silently degrading service reliability. A platform change causing a spike in crashes. A software rollout introducing latency that compounds over millions of requests.
Traditional monitoring catches these after the fact: an on-call engineer gets paged, investigates, and traces the issue back to a deployment that happened hours ago. By then, the blast radius has grown.
The question we set out to answer: can we detect a regression from a deployment's crash signature before it becomes an outage?
What I Built
The Core Insight
Crashes are signals. Not just that something broke — but what broke, when, and whether it correlates with a deployment event. The key insight behind Harmony was treating crash telemetry as a time series tied to deployment events, and building a system that could distinguish a genuine regression from normal noise at Azure's scale.
System Architecture
Harmony operates in three stages:
1. Signal Ingestion & Normalization
Crash data flows in from across Azure services — infrastructure, platform, and software — at high volume. The first challenge is normalization: crashes from different service layers look different, have different severities, and have different baseline rates. The system builds per-service, per-layer baselines so that a spike in one service isn't masked by quiet periods in another.
2. Regression Detection
The ML model watches for anomalous crash patterns that correlate temporally with deployment events. This is harder than it sounds — deployments and crashes both happen continuously, so the model has to distinguish causation from coincidence. We used a combination of time series anomaly detection and deployment-aware windowing to achieve P95 accuracy of 95%.
3. Alerting & Automated Halting
When a regression is detected with sufficient confidence, Harmony alerts the service owner and — critically — can halt the regressing deployment automatically. This closes the loop from detection to action without requiring human intervention in the critical path.
LLMs for Continuous Improvement
A second layer uses LLMs to process feedback from resolved incidents: extracting structured signal from unstructured post-mortems and on-call notes, and feeding that back into the training dataset. This creates a flywheel — every incident the system misses or catches late becomes training signal for the next version.
Challenges
Scale without false positives. At Azure's deployment frequency, a 1% false positive rate would generate an unacceptable alert volume. Getting precision high enough to be trusted required careful per-service calibration and a confidence threshold system that errs toward alerting only when the signal is clear.
Bootstrapping without labeled data. There's no ground truth dataset of "deployments that caused regressions." We had to build the labeling pipeline alongside the model — using historical incident data, on-call notes, and manual review to create the initial training set.
Latency vs. accuracy tradeoff. Earlier detection means acting on less data. We tuned the system to optimize for catching regressions within the first deployment wave, accepting slightly lower precision at very early detection in exchange for speed.
Impact
- 30% reduction in Azure outages attributable to deployment regressions
- 95% P95 accuracy in regression detection — trusted enough to halt deployments automatically
- 18% improvement in Azure OS update stability
- Operates continuously across Azure public and government clouds at >99.99% availability
- Featured at Microsoft Ignite 2021, presented by Azure CTO Mark Russinovich