ML Infrastructure

Time Series Forecasting & Anomaly Detection at Azure Scale

Built two production ML pipelines — one saving $100M annually in capacity planning, another surfacing data center outages to on-call engineers within 15 minutes.

June 2021

Impact

Forecasting saves $100M+ annually at <5% error. Anomaly detection surfaces outages with root cause in under 15 minutes at >80% precision. Presented by Azure CTO at Microsoft Ignite 2021.

Tech Stack

PythonSparkAzure Time Series InsightsAzure Digital TwinsPyTorch

The Problem

Two separate but related problems, both rooted in the same challenge: making sense of time series data at Azure's scale.

Capacity Planning: Azure's infrastructure is massive and growing. Over-provisioning is expensive — at the scale of Azure's global footprint, even modest inefficiencies translate to hundreds of millions of dollars annually. Capacity planning teams needed a reliable way to forecast resource demand across multiple variables, far enough in advance to act on it, with error rates low enough to trust.

Data Center Anomaly Detection: When something goes wrong in a data center — a cooling failure, a network degradation, a power anomaly — the signal is often subtle and distributed across many sensors and telemetry streams. On-call engineers were spending significant time just diagnosing what had happened before they could begin fixing it. Root cause was arriving too late.


What I Built

Project Tesseract — Multivariate Time Series Forecasting

Tesseract is a forecasting pipeline for Microsoft's capacity planning function. The core challenge is multivariate: resource demand doesn't move in isolation. Compute, memory, storage, and network all interact, and forecasting one without accounting for the others produces errors that compound over long time horizons.

The pipeline:

  • Ingests telemetry from Azure's infrastructure layer across regions and service types
  • Builds multivariate forecasting models that capture cross-variable dependencies
  • Produces rolling forecasts with confidence intervals that capacity planners can act on
  • Achieves less than 5% error across forecast horizons, enabling reliable long-range planning

The sub-5% error threshold was the key design target — below that, the forecasts are accurate enough to drive real procurement and provisioning decisions rather than being treated as rough estimates.

Azure IoT — Anomaly Detection with Root Cause

The IoT anomaly detection pipeline tackles a different time series problem: not forecasting future values, but detecting anomalous present values and tracing them to a cause.

Architecture:

  • Built on Azure Time Series Insights (TSI) for ingestion and storage of sensor and telemetry streams from data center infrastructure
  • Azure Digital Twins (ADT) provides the graph layer — a live model of physical relationships between devices, racks, cooling units, power systems, and network equipment
  • Spark powers the correlation engine, joining anomalous signals across the ADT graph to identify which physical component is the most likely root cause
  • When an anomaly is detected, the system surfaces a structured alert to the on-call engineer: what's anomalous, where it is, and what it's likely caused by

The graph layer is the key differentiator. Without it, anomaly detection produces alerts that still require manual investigation. With it, the system can say: "Temperature anomaly in rack 14B, likely caused by cooling unit C3, affecting these downstream services."


Challenges

Multivariate forecasting at scale. The interaction effects between variables grow combinatorially. The model architecture had to be expressive enough to capture real dependencies without overfitting to historical patterns that don't generalize.

Graph traversal under latency pressure. The 15-minute root cause target meant the ADT graph traversal and Spark correlation had to complete well within that window, even for large, complex data center topologies. Query optimization and pre-computed subgraph caches were critical.

Alert fatigue. Anomaly detection systems that cry wolf get ignored. Tuning precision to above 80% while maintaining recall required careful threshold calibration per sensor type and data center topology.


Impact

  • $100M+ saved annually through accurate capacity forecasting with Tesseract
  • Under 15 minutes from anomaly to root cause surfaced to on-call engineers
  • >80% precision on data center anomaly detection
  • Azure IoT project presented by Azure CTO Mark Russinovich at Microsoft Ignite 2021