When "Done" Means Four Different Things: The Hidden Dysfunction in AI Teams
I work on Watson, a Microsoft product for engineering reliability. One of its features uses AI to help engineers debug system crash dumps - a full call stack and system memory state captured at the moment of failure.
When we decided to build this capability, we weren't doing it in a comfortable environment. We were doing it under real competitive pressure, in the middle of one of the most watched product cycles in Microsoft's recent history. The engineers who needed this feature needed it yesterday.
A crash dump is not a tidy artifact. It contains everything the system knew at the moment it fell over - memory addresses, stack frames, loaded modules, heap state. It is as raw and sensitive as software data gets. And we wanted to put an AI system in the room with it, because AI could do something no previous tool could: hold the entire dump in a single analytical frame and reason across it. Find the error. Trace the call stack. Surface the root cause. What took an experienced engineer hours of WinDbg work could, in theory, take minutes.
But these weren't just internal dumps. They could be customer dumps. Third-party dumps. And putting an AI system in the room with data that sensitive meant we had to be deliberate about how it operated - not just what it could do, but what it should do.
So before we had written a line of feature code, we had a conflict. Not a theoretical one. A genuine tension between two things that both mattered: the engineers who needed this shipped, and the customers whose data was at stake. That conflict is what this post is about - and more specifically, about the moment when you have to decide which concern wins.
TLDR: AI teams fail not because they lack expertise but because they're running four competing definitions of done simultaneously - reliability, evaluation, security, and velocity - without deciding which one governs. This post names the four, explains why AI makes the conflict harder to resolve than traditional software, and gives you four questions to run with your team before your next release.
Four instincts. One question.
Every team building AI-powered products is running with at least four simultaneous and competing instincts about what "done" means. They rarely get named explicitly. They should.
Reliability means done is observable and recoverable. If this breaks, we'll know immediately. We can reconstruct what happened. We can roll back. The question it keeps asking: what are the failure modes, and have we instrumented them?
Evaluation means done is regression-tested and version-controlled. A change to the model, the prompt, or the context pipeline doesn't ship without evidence that nothing got worse. The question it keeps asking: how do we know it's still working the same way it was last week?
Security means done is adversarially hardened. The attack surface of an AI system is different from traditional software - it isn't just network ports and authentication flows. It's every input the system receives. A user can type "ignore your previous instructions and do X instead," and if the system isn't built to resist that, it will comply. OWASP lists prompt injection - instruction override through natural language input - as the top security risk for AI applications [1]. It's not exotic. It's the first thing a curious user tries. The question it keeps asking: what does this system look like to someone trying to break it?
Velocity means done is in front of real users. Shipping and observing is the only way to learn what the other three instincts actually need to watch for. The question it keeps asking: what are we waiting for?
None of these is wrong. All four are responding to real risks. The problem is that they pull in different directions simultaneously - and in most AI teams, nobody has explicitly decided which one governs.
Why this is harder with AI than it was before
This tension isn't new. Product managers and SREs have been arguing about velocity versus reliability since long before AI existed. What AI systems do is remove three of the mechanisms teams had developed to resolve it.
Ground truth. Traditional software has a clear definition of working: given input X, produce output Y. A test passes or it doesn't. An AI system doesn't have that contract. The model is non-deterministic. "Done" loses its shared meaning - each instinct is using a different proxy for correctness, and the proxies don't measure the same thing.
A contained attack surface. Traditional systems get exploited at the infrastructure layer. You patch a vulnerability, rotate a credential, close a port. AI systems introduce a qualitatively different attack surface: natural language. You cannot patch this the way you patch a buffer overflow. Air Canada's chatbot was found liable by a British Columbia tribunal in 2024 after it gave a customer incorrect information about bereavement fares - information the customer relied on, that differed from Air Canada's actual policy, and that the tribunal ruled constituted negligent misrepresentation [2]. The chatbot wasn't hacked. It just answered questions incorrectly, confidently, in a context where customers had reasonable grounds to trust it. Traditional QA isn't designed to catch that failure mode.
Predictable cost scaling. When you ship a traditional feature, usage and infrastructure costs scale in a relationship that's well-understood and plannable. AI systems break this in two ways. First, legitimate adoption can outpace forecasts by orders of magnitude: Uber's CTO confirmed, as reported by The Information in April 2026, that the company had exhausted its entire planned AI budget for the year within four months, driven entirely by legitimate engineering adoption - individual engineers were spending between $500 and $2,000 per month on AI coding tools [3]. Second, adversarial usage is indistinguishable from legitimate usage at the token level. Every prompt-injected interaction costs you the same as a genuine query. At scale, the compounding effect is not a rounding error.
These three changes mean that the four instincts, which previously had established mechanisms for reaching consensus, now have no shared resolution mechanism. The conflict persists because nothing resolves it automatically anymore.
How the Watson decision actually got made
Back to Watson.
The security instinct won. But it didn't win in a meeting where someone presented a framework and the team voted. It started with an instinct - a feeling, early in the design process, that putting an AI system in the room with customer crash dumps without tight controls was wrong - and that instinct triggered a larger conversation about what "right" actually required.
That conversation wasn't comfortable. There was real pressure to ship. The answer the security instinct produced - a compliant sandbox VM, no copy-paste in or out, no internet connection outside the environment, every AI action audited and reconstructable - wasn't the fastest path to a demo. It added engineering surface area and compliance review. It slowed things down.
But it was the right answer for this system, at this stage, for these users. Customers trust Microsoft with sensitive data. That trust exists independently of any contract. A crash dump from a customer's machine contains data they didn't choose to share with an AI system. Treating that responsibly meant designing the architecture around what we believed was right, not around what we could technically justify.
The architecture that resulted keeps the AI doing what it's uniquely good at - holding a complex, unstructured artifact in a single analytical frame and reasoning across it - while constraining what it can affect rather than how it reasons. The model operates inside the sandbox. What it can see, what actions it can take, and what can leave the environment are all controlled by systems that don't depend on the model's judgment. That's not a limitation on the feature's usefulness. It's the condition under which the feature's usefulness can be trusted.
The Watson decision slowed us down. It also meant that when the feature shipped, we knew exactly what the AI could touch. (GitHub Copilot's arc from 2021 to 2026 is the cleaner counter-example for when velocity should lead: ship narrow, instrument everything, harden progressively as failure modes emerge [4].)
The context determines which hat should lead. Neither decision was wrong. They were different calls for different systems at different stages with different users.
How to run the conversation on your team
The question isn't which instinct is right in the abstract. It's which one should govern right now, for this product, at this stage, for these users.
Most teams don't make this decision explicitly. The velocity instinct tends to dominate by default because it's the most visible - users, stakeholders, and leadership can see whether something shipped. The other three instincts are invisible until something goes wrong.
Here's how to surface the decision before an incident makes it for you. Before your next AI feature ships, get written answers from the team to four questions:
- If this system produces a wrong output confidently - not a crash, not an error, just a fluent wrong answer - how will we know? Who catches it, and how quickly?
- If we change the model version, the system prompt, or the retrieval pipeline next month, what's our evidence that quality didn't regress? Do we have evals, or are we eyeballing it?
- What happens if a user sends an input designed to override the system's intended behavior? Have we tested this? What does the output look like?
- What data does this system touch, and are we treating it the way the people it belongs to would expect us to - regardless of what we could technically justify?
The team member who finds these questions hardest to answer is wearing the hat that's been underweighted. That's where the conversation needs to happen.
The reason this conversation often doesn't happen isn't that nobody is empowered to start it - anyone on the team who feels strongly about a reservation should be able to bring it up, and in a healthy team they can. The real problem is that the environment makes it feel costly to do so. When velocity is the dominant culture, raising a security or reliability concern feels like volunteering to be the person who slows things down. The fix isn't to designate a single owner for the uncomfortable question. It's to make clear, before the pressure to ship arrives, that raising any of these four concerns is expected and welcome - not an obstacle to the work, but part of it.
The tiebreaker heuristic: whichever instinct is protecting against the less recoverable failure should get more weight, not less, when the pressure to ship is highest. A delayed feature is recoverable. A violated customer trust obligation is not.
A healthy AI team needs all four instincts active simultaneously. The dysfunction comes from whoever owns the definition of done for the current stage setting the culture - with the other three underweighted or absent.
The instinct most consistently underweighted in the teams I've seen isn't reliability or evaluation. It's security - specifically, the recognition that the attack surface of an AI system is qualitatively different from traditional software, and that the cost of adversarial usage is real and compounds with adoption. This isn't an argument for paranoia. It's an argument for asking the third question - what does this look like to someone trying to break it - before the answer arrives as a customer complaint.
The Watson decision slowed us down. It also meant that when the feature shipped, we knew exactly what the AI could see, what it could do, and what could leave the environment. That's not a constraint on the AI's usefulness. It's the condition under which the AI's usefulness can be trusted.
References
[1] OWASP Gen AI Security Project: LLM01:2025 Prompt Injection
[2] Moffatt v. Air Canada, 2024 BCCRT 149 - American Bar Association coverage
[3] Uber CTO on exhausting the company's full-year 2026 AI budget by April - Yahoo Finance
[4] GitHub Copilot: The Agent Awakens - GitHub Blog, February 2025
The observations here are my own and don't represent Microsoft's views or positions. This post was written with the assistance of AI. The observations, experiences, and judgments are my own. The cover image is AI generated.