← Back to News

Accelerate Incident Resolution with PagerDuty and AWS DevOps Agent

Every ops engineer knows the scenario: your phone buzzes at 2 a.m. with a critical alert. Your heart sinks. The notification tells you that something is broken, but not why. You’re now scrambling through CloudWatch logs, SSH-ing into instances, and running diagnostics while your application hemorrhages traffic and your customers watch their requests timeout. This context gap—between detection and understanding—is where SRE teams waste the most time during incidents. AWS and PagerDuty have partnered to close that gap with the AWS DevOps Agent, a tool designed to automatically gather diagnostic data and surface it directly in PagerDuty incidents, cutting mean-time-to-resolution (MTTR) significantly.

The technical architecture is straightforward but powerful. The AWS DevOps Agent runs on your EC2 instances and integrates with PagerDuty’s incident response platform. When an alert fires, instead of just sending a notification, the agent automatically collects system metrics, logs, and operational context—CPU usage, memory consumption, disk I/O, application logs, and even running processes—and enriches the incident details in PagerDuty. Your on-call engineer doesn’t need to jump into the AWS console or SSH into a box; they have a diagnostic snapshot waiting in their incident details. Imagine a database connection pool exhaustion at 3 a.m.: normally you’d check application logs, query CloudWatch, verify database metrics, and check process listings. With this integration, all that context arrives pre-packaged in the incident itself. The agent also enables bidirectional communication, so responders can run diagnostic commands directly from PagerDuty without leaving the incident interface.

This matters practically because SRE teams operate under brutal time pressure. Studies show that every minute an incident remains unresolved costs real money—in downtime, in customer churn, and in team burnout from context-switching. By eliminating the “go find the logs” phase of incident response, teams can spend time on actual remediation. Consider a real scenario: a web service starts returning 5xx errors. With the AWS DevOps Agent, your responder sees immediately that application memory usage is at 98%, which points them toward a memory leak or unoptimized query rather than spending five minutes digging through CloudWatch dashboards. For teams managing hundreds of microservices across multiple AWS regions, this automation compounds—faster diagnosis across more incidents means fewer pages in the middle of the night and more sleep for your team.

Setting this up requires AWS IAM roles that grant the agent permission to collect metrics and logs, a PagerDuty integration key, and minimal configuration on your EC2 instances. If you’re already using PagerDuty and have a standard Linux/Windows deployment on EC2, the barrier to entry is low. The value, though, is immediate: reduced incident response time, less cognitive load on your on-call engineers, and the data you need exactly when you need it. In cloud infrastructure, speed of diagnosis often determines the outcome. This integration gives you that advantage.

Source
↗ AWS DevOps & Developer Productivity Blog