The Machine That Needs a Witness: Why AI Can't Handle Incidents Alone
SRE teams are quietly rejecting the promise of autonomous AI incident response. The shift reveals something deeper about trust, judgment, and what we're willing to delegate when systems fail.
The alert comes at 3 AM. A production system is failing. In one version of the future, an AI agent investigates, diagnoses, and deploys a fix while you sleep. In the version we're actually building, the AI does the tedious work—sifting logs, correlating metrics, drafting hypotheses—but waits for you to say yes.
This isn't a failure of imagination. It's a deliberate choice.
According to InfoQ's recent coverage of OpsWorker's multi-agent incident response work, the industry is moving decisively toward what they call "human-centered AI" for site reliability engineering. Rather than handing the pager to a machine, teams are designing systems where specialized AI agents work alongside on-call engineers, narrowing the search space and automating the tedious steps while leaving judgment calls to humans.
The Experiments That Changed Minds
Ar Hakboian, co-founder of OpsWorker, ran a series of experiments that tell you everything you need to know about where we are. He built a three-agent system: one investigates the alert, one reviews the findings, one makes the final call. They communicate through sequential files, each agent reading the previous one's analysis. They have access to real Kubernetes clusters and can even create pull requests.
In seven documented experiments, the agents successfully identified root causes. They caught an init container hardcoded to fail. They spotted an AWS load balancer configured with HTTPS on port 444 instead of 443. They proposed fixes.
But here's what stayed with me: in one experiment, the agents investigated a pod called failing-init-demo that had been crashing for days. They analyzed 4,512 restarts over 16 days. They proposed deletion as the fix. What they didn't realize was that this was the test case—the very scenario designed to evaluate them. As Hakboian notes in his deep-dive blog post, "Agents lack meta-awareness. They didn't question why a deployment named 'failing-init-demo' existed."
That gap—the inability to step back and ask why—is not a bug to be fixed in the next model release. It's the space where human judgment lives.
What the Research Actually Shows
The academic work supports this intuition. Zefang Liu's recent arXiv paper used the Backdoors and Breaches tabletop framework to study how teams of LLM agents coordinate during simulated cyber incidents. Liu compared centralized, decentralized, and hybrid team structures.
The findings were specific: homogeneous centralized and hybrid structures achieved the highest success rates. Decentralized teams of domain specialists struggled to reach consensus without a leader. Even more interesting, mixed teams of specialists sometimes performed worse than homogeneous teams of generalists—the specialists disagreed on priorities and couldn't converge.
OpsWorker's approach addresses this by emphasizing what Hakboian calls "explicit role design and structured hand-offs." Each agent has a clear set of tools and responsibilities. But even with perfect orchestration, Hakboian's conclusion is measured: "The agents are excellent technical investigators but lack the safety controls, reliability engineering, and operational maturity required for production incident response."
The Gap Between Promise and Performance
EverOps, a cloud consultancy, recently surveyed SRE professionals and found that only 4% believe AI will replace their jobs within two years. The majority—53%—see it as a tool to make work easier. This isn't denial or fear talking. It's pattern recognition.
EverOps also ran their own test, using ClickHouse to pit advanced language models against real root-cause analysis scenarios. The result, according to their analysis: "autonomous root cause analysis by LLMs fell short of human-guided investigation." The models struggled to break out of narrow lines of reasoning. They missed subtle signals that experienced SREs would catch immediately.
In live incident response, where systems are failing and every minute matters, those gaps become critical.
Where AI Actually Helps
The irony is that AI is already transforming SRE work—just not in the dramatic, job-replacing way the headlines promise. The practical use cases are quieter: log ingestion and anomaly detection at scale, triage automation based on service topology, alert clustering to reduce noise, retrieval-based access to internal documentation.
Amazon Web Services published a detailed example of a multi-agent SRE assistant built on their Bedrock platform. The architecture uses a supervisor coordinating four specialized agents—one for metrics, one for logs, one for topology, one for runbooks—all wired into a Kubernetes backend. It's sophisticated work. But notice what it optimizes for: comprehensive analysis, source attribution, step-by-step procedures. The human still approves the action.
This is the pattern emerging across the industry. As Hakboian puts it, AI agents should "propose hypotheses, queries and remediation options while humans stay in the loop for judgment and approval."
What This Means for Your Career
If you're building skills in DevOps or SRE, this trend shapes what matters. The value isn't in doing what AI can do—parsing logs, drafting status updates, clustering alerts. The value is in the judgment layer: knowing when an anomaly matters, understanding system context that isn't in the documentation, making the call when the runbook doesn't cover your specific failure mode.
Hakboian offers specific recommendations for teams integrating AI agents: start with read-only access, test with realistic incidents, grant minimum necessary privileges, roll out gradually. But the underlying principle is simpler—treat AI as an investigator that needs supervision, not a colleague you can fully trust.
The engineers who thrive won't be the ones who resist AI or the ones who blindly delegate to it. They'll be the ones who understand where the boundary is, who can evaluate an AI's analysis and know what it's missing, who recognize that some decisions require a kind of contextual awareness we haven't figured out how to automate.
The Deeper Pattern
There's something revealing in this industry course-correction. We built systems that could automate incident response. The technology works, within limits. But we're choosing not to deploy it autonomously in production.
Maybe it's about risk—the knowledge that when things break at scale, the consequences are real. Maybe it's about trust, which is harder to code than capability. Maybe it's just that we've been here before, with automation that worked perfectly until it didn't, and we remember what that felt like.
The future of AI in incident response isn't about removing humans from the loop. It's about building loops where humans and AI have clearly defined roles, where the machine does the grinding work and the human makes the calls that require stepping back to see the whole picture.
That's not the science fiction version. But it might be the version that actually works when your production system fails at 3 AM and someone needs to decide what to do next.