← Back to News

Production-Ready Autonomous Incident Resolution with AWS DevOps Agent (now GA) and Datadog MCP Server

The partnership between AWS and Datadog has matured into something genuinely useful: a system that can detect, diagnose, and fix infrastructure problems with minimal human intervention. AWS DevOps Agent, now generally available, works alongside Datadog’s Model Context Protocol (MCP) Server to turn monitoring alerts into actionable resolutions. Instead of waiting for on-call engineers to wake up, correlate logs, check configurations, and apply fixes, this integration handles the routine work automatically—and does it in minutes instead of hours.

Here’s how it works technically. The AWS DevOps Agent acts as an orchestrator that understands your AWS infrastructure deeply. When Datadog detects an anomaly—say, elevated CPU on an EC2 instance or a failing health check—the agent queries the Datadog MCP Server to gather context. The MCP Server translates monitoring signals into structured data about what’s happening. The agent then correlates this with your actual AWS configuration: which security group rules are active, what autoscaling policies exist, whether this instance is part of a load balancer, and so on. Armed with this complete picture, the agent can execute remediation actions like restarting services, scaling up capacity, or rolling back recent deployments. Importantly, it operates within guardrails you define—certain actions require approval, and everything is logged for compliance and learning.

The practical impact matters most. Consider a common scenario: a microservice running on ECS starts returning 5xx errors because memory usage exceeded its allocated threshold. Traditionally, this triggers a page, someone runs commands to check logs, queries the dashboard, identifies the memory leak in a recent deployment, and rolls back. That’s 20-30 minutes of human effort and customer impact. With this setup, the system detects the issue, correlates the memory spike with the deployment timestamp in your CI/CD pipeline, identifies it as anomalous, and automatically rolls back the problematic version—all while alerting the team for visibility. Teams report resolving 60-70% of incidents this way without waking anyone up.

What makes this production-ready now is maturity in three areas. First, AWS DevOps Agent is GA, meaning it’s fully supported and the API surface is stable. Second, integration with Datadog’s MCP Server is well-tested across real customer workloads, not just proof-of-concepts. Third, the safety mechanisms—approval workflows, action limits, audit trails—are baked in, not added as afterthoughts. For teams already using both AWS and Datadog, this is worth evaluating seriously. Start small: enable autonomous resolution for well-understood, low-risk actions like service restarts or adding instances. As you gain confidence, expand the scope. The goal isn’t to eliminate your operations team—it’s to free them from repetitive diagnostics so they can focus on improving systems and building resilience.

Source
↗ AWS DevOps & Developer Productivity Blog