I turn complex problems into AI systems that hold up in production.

AI & ML systems engineer at Microsoft, specializing in production reliability, AI infrastructure, and anomaly detection at global scale. I also work with early-stage teams on building AI systems that last.

About

I use AI and ML to solve real business problems — not just to build interesting models. At Microsoft, I work on the health and quality of Azure, helping the platform stay reliable for its customers at global scale.

What drives me is understanding a business's most important challenges, then figuring out how AI and ML can help — quickly, cleanly, and with impact. I work iteratively: scoping tightly, shipping early, and improving based on feedback from the system, the data, and the people using it.

Translate Problems

I break down complex problems into smaller, solvable chunks.

Scalable and Secure by Design

A good system is designed deliberately to be scalable and secure. I build systems that are easy to reason about, debug, and extend.

Keep it Simple

I believe complexity is an outcome of interaction between simpler components.

Areas of Focus

AI & ML Observability

Production Reliability for AI Systems

LLMs in Production

Fast AI Iteration for Resource-Constrained Teams

Building Trust in Model Outputs at Scale

AI Infrastructure for Early-Stage Teams

Selected Projects

AI Motion Insights: Computer Vision for Elite Athletic Performance

Computer Vision

Built a computer vision system with the US Olympics Committee and USA Surfing Team to deliver real-time motion analysis and performance feedback for elite athletes.

PythonPyTorchOpenCVComputer Vision

Deployed with the USA Surfing Team ahead of the Olympics. Featured in press coverage January 2023.

Time Series Forecasting & Anomaly Detection at Azure Scale

ML Infrastructure

Built two production ML pipelines — one saving $100M annually in capacity planning, another surfacing data center outages to on-call engineers within 15 minutes.

PythonSparkAzure Time Series InsightsAzure Digital Twins+1 more

Forecasting saves $100M+ annually at <5% error. Anomaly detection surfaces outages with root cause in under 15 minutes at >80% precision. Presented by Azure CTO at Microsoft Ignite 2021.

Harmony: Regression Detection System for Azure Services

ML Infrastructure

Architected an AI/ML system that detects regressions in Azure infrastructure before they cause outages — reducing incidents by 30% at global scale.

PythonAzureLLMsPyTorch

Reduces Azure outages by 30%, halts regressing deployments with 95% P95 accuracy, and improves Azure OS update stability by 18%. Monitors services across public and government clouds.

View All Projects

Recent Writing

When "Done" Means Four Different Things: The Hidden Dysfunction in AI Teams

Every AI team is operating with four competing definitions of done simultaneously. The dysfunction isn't wrong expertise — it's that nobody has decided which one governs right now.

May 20268 min

View All Posts

Let's Connect

Always interested in discussing ML systems, sharing ideas, or exploring opportunities to build something meaningful together. I'm also open to advising early-stage teams working on hard ML problems.

If you're working on something where AI feels both essential and uncertain — that's usually where the interesting work is.