Tech News

AI Agents Are Rewriting SRE Playbooks: From Hours to Seconds

Recent presentations demonstrate AI agents automating incident response workflows, with some implementations reducing mean time to resolution from hours to seconds. The shift signals a fundamental change in how operations teams work.

7 min readDecember 30, 2025

Site reliability engineers are watching their workflows transform as AI agents move from experimental tools to production systems. Recent case studies and research presentations reveal organizations automating incident response, optimizing performance diagnostics, and reimagining how operations teams handle everything from SLO breaches to routine health checks.

The evidence suggests this isn't incremental improvement—it's a structural shift in how reliability work gets done.

Automating What Used to Take Hours

Bruno Borges, Principal PM Manager at Microsoft, presented research at InfoQ Dev Summit Boston showing how SRE AI agents can reduce mean time to resolution from hours to seconds. According to the presentation, the approach combines established methodologies like USE (Utilization, Saturation, Errors) and jPDM (Java Performance Diagnostics Methodology) with large language models to automate performance diagnostics.

The system uses Model Context Protocol tools to perform real-time diagnostics and memory dump analysis. Borges emphasized a core SRE principle: "If it can be automated, it should be automated." The presentation focused on identifying and addressing bottlenecks—what he described as "rate-limiting activities" in system performance—rather than general performance tuning.

The shift moves performance management from manual troubleshooting to automated agent-driven responses. According to Borges, SREs prioritize efficiency and reliability over raw speed: "If it's fast but breaks often, or cheap but fails SLOs, it's not performing." The AI agents aim to balance speed, cost, and customer expectations simultaneously.

Generalist Agents Enter Virtual Environments

Google DeepMind's SIMA 2 (Scalable Instructable Multiworld Agent) demonstrates how AI agents are developing capabilities that extend beyond narrow task execution. Built on the Gemini foundation model, SIMA 2 can understand and act across multiple 3D virtual game environments without requiring step-by-step direction.

According to the DeepMind research published in December, the agent marks a departure from its predecessor by "reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images." The researchers report SIMA 2 "substantially closes the gap with human performance" across their test portfolio while demonstrating "robust generalization to previously unseen environments."

The system employs a self-improvement cycle where Gemini provides tasks with estimated rewards, which SIMA 2 uses to build a bank of self-generated experience for subsequent training. The researchers state this allows the agent to "improve on previously failed tasks entirely independently of human-generated demonstrations and intervention."

While SIMA 2 operates in gaming environments, the implications for SRE and DevOps are direct. The agent's ability to navigate unfamiliar environments, use tools, and execute collaborative tasks represents capabilities that translate to infrastructure management and incident response scenarios.

Pragmatic Implementation Without the Vision Statement

Michelin's China operations group provides a counterpoint to grand AI transformation narratives. Matthew Liu, an architect in Michelin's China IT operations team, documented an AIOps implementation that began with personal conviction rather than executive mandate.

According to Liu's account published on InfoQ, the motivation was straightforward: monitoring, telemetry, and incident management were already mature, but "the volume of incidents and manual checks continued to rise despite process optimisation efforts." Liu built working demonstrations using Dify, a low-code platform for AI applications, before seeking formal approval.

One prototype helped database administrators with health checks and slow query analysis. Another assisted Kubernetes administrators with routine tasks. These agents used Model Context Protocol servers to query ServiceNow tickets directly, allowing Liu to "wire ServiceNow into Dify and create working AIOps prototypes in just a few hours," according to the InfoQ report.

The implementation encountered organizational resistance. Teams wouldn't provide metrics for expected improvements because they feared demonstrating efficiency gains would lead to headcount reductions or higher MTTR targets. Liu repositioned the platform as a low-code exploration tool where operations teams could build workflows themselves, framing it as building "AIOps literacy whilst making knowledge explicit and reusable."

The platform deployed within Michelin's validated AliCloud landing zone using a modular architecture that separated three replaceable layers: the Dify app builder, the LLM reasoning layer, and MCP-based tools connecting to ServiceNow, GitHub, and AliCloud resources. According to Liu's retrospective, the platform has passed security and governance hurdles and moved into concrete work with operations teams on flagship use cases.

Liu's conclusion emphasized learning over transformation: "We wanted to test, safely and cheaply, whether AIOps can reduce pain in one or two concrete areas. Success is: we learned what works and what doesn't, and we have patterns we can reuse."

The Infrastructure Layer Enabling Agent Work

The Model Context Protocol appears consistently across these implementations. MCP provides a standardized way for AI applications to connect to data sources and tools. According to available data, MCP server downloads increased from approximately 100,000 in November 2024 to over 8 million by April 2025, with over 5,800 MCP servers now available.

Major deployments at Block, Bloomberg, and Amazon demonstrate enterprise adoption. However, security researchers have identified outstanding issues including prompt injection vulnerabilities, tool permissions that can exfiltrate files when combined, and lookalike tools that can silently replace trusted ones.

Randy Bias of Mirantis argues that MCP needs to become "safe, governable and observable at enterprise scale" before agents can access sensitive data sources. Security and compliance teams remain cautious about allowing arbitrary agents to access critical systems like electronic healthcare records, financial data, and customer PII.

What This Means for Operations Teams

The pattern across these implementations is consistent: AI agents are moving from proof-of-concept to production systems handling real operational work. The capabilities being demonstrated—autonomous incident response, self-improvement through experience, and generalization across unfamiliar environments—represent fundamental changes to how reliability work happens.

According to an Enterprise Management Associates survey cited in the Michelin case study, 80 percent of companies are seeking new AIOps platforms, and half plan to switch within the coming year. The market activity suggests organizations recognize current tools have limitations but see value in the category.

For SRE and DevOps engineers, the implications are direct. Automation of incident response, SLO monitoring, and operational decision-making is becoming standard capability rather than competitive advantage. The Borges presentation's demonstration of reducing MTTR from hours to seconds through agent automation represents an order-of-magnitude improvement that changes what's considered acceptable operational practice.

The Michelin case provides a template for pragmatic adoption: start with prototypes that solve concrete problems, build organizational literacy through hands-on exploration, and maintain alignment with governance requirements. Liu's approach of building working demonstrations before seeking formal approval, then repositioning based on organizational feedback, offers a pattern others can follow.

The Skills Shift Underway

The technology is moving faster than organizational adaptation. Gagan Singh at Elastic points out that AIOps implementations may require specialized skills in machine learning and data analysis that aren't readily available. The Michelin case demonstrates one response: using low-code platforms that allow operations teams to build workflows themselves, encoding their domain expertise into prompts and agent configurations.

The alternative is waiting for AI systems to become capable enough that specialized skills aren't required. The SIMA 2 research suggests that trajectory is progressing quickly—agents that can reason about high-level goals, learn from experience, and generalize to new environments are already demonstrating capabilities that approach human performance in constrained domains.

For engineers currently working in SRE and DevOps roles, the practical question isn't whether AI will transform their work but how quickly and in what form. The evidence from these implementations suggests the transformation is already underway in production systems, not just research labs. Understanding how to work with AI agents, configure them for specific operational contexts, and integrate them into existing workflows is becoming core competency rather than emerging skill.

AI Agents Are Rewriting SRE Playbooks: From Hours to Seconds

Automating What Used to Take Hours

Generalist Agents Enter Virtual Environments

Pragmatic Implementation Without the Vision Statement

The Infrastructure Layer Enabling Agent Work

What This Means for Operations Teams

The Skills Shift Underway

More in Tech News

Open Source AI Models Challenge Proprietary Dominance

The Question Nobody Wants to Ask About AI Coding Tools

The Fine Print Microsoft Doesn't Want You to Read: Copilot Is Just 'Entertainment'