Build SRE agents that understand your infrastructure
Build SRE agents that understand your infrastructure
Build SRE agents that understand your infrastructure
Just describe a workflow for your agent then connect to Unpage’s ecosystem of plugins and tools. In as little as 5 minutes, your agent will have the context it needs to save your team hours of toil every week.
Just describe a workflow for your agent then connect to Unpage’s ecosystem of plugins and tools. In as little as 5 minutes, your agent will have the context it needs to save your team hours of toil every week.
Create
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issues
Create
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issues
Create
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issuesCreate
Define
Run
$ unpage agent create host-not-reportingCreate
Define
Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.prompt: >- Search the logs to see if 'Power key pressed short.' was logged recently.-If so, post a status update that the host was cleanly shut down, and resolve it.- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.-If it is not in 'running' state, post a status update with the instance's current state and resolve it.tools:- pagerduty_*- aws_get_instance_status- solarwinds_search_logsCreate
Define
Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting✓ Agent investigation complete✓ No action required, alert may be resolved.→ Details: The host not reporting alert fordatabase-soothing-wheat.internal was caused by New Relic network connectivityissues, not an actual host problem.Root Cause Analysis:-Alert triggered at 2025-08-26T17:05:02Z for hostdatabase-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)-Investigation found New Relic connection errors at 2025-08-26T17:24:41Zshowing "metric sender can't process" and "Client.Timeout exceeded whileawaiting headers"-These errors indicate New Relic infrastructure API timeouts preventing thehost from successfully reporting metrics-No evidence found of host shutdown or actual system issues
Creating infra agents couldn't be easier
Choose an agent from the Unpage library or start your own from scratch.
Detail (or refine) the steps your agent will follow then list the tools it can access.
Test, refine, deploy.
Automate repetitive tasks and save time on investigation
Automate repetitive tasks and save time on investigation
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_updateReduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: >Use this agent to analyze alerts that meet the following criteria:-The alert is related to CPU usage exceeding thresholds-The alert comes from AWS CloudWatch or Datadog-The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: >You are an agent specialized in analyzing high CPU usage alerts.When investigating a CPU alert, follow these steps:1. Check the current CPU metrics to verify the alert is still active2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage3. Check logs from around the time the alert started for any errors or unusual activity4. Look for any recent deployments or changes that might explain the high usage5. Check if similar resources are experiencing the same issueBased on your findings, update the incident with:- Current status of the issue- Likely cause based on available evidence- Recommended next steps- Whether this appears to be a critical issue requiring immediate human attentionBe concise but thorough. Include specific metrics, timestamps, and log entriesthat support your analysis.NEVER make up information or assume values you haven't verified.# Tools the agent can usetools:- core_current_datetime- core_convert_to_timezone- metrics_get_metrics_for_node- metrics_list_available_metrics_for_node- graph_get_resource_details- graph_get_neighboring_resources- graph_get_resource_topology- papertrail_search_logs- pagerduty_post_status_update- pagerduty_get_incident_details- aws_describe_ec2_instanceinvestigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
description: >Investigate SSL/TLS connection failures# Instructions for the agentprompt: >- Extract the domain/hostname from the PagerDuty alert about connection failures.- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates- Parse the certificate dates to determine if the cert is expired or expiring soon- If certificate is expired or expiring within 24 hours:-Post high-priority status update to PagerDuty explaining the root cause-Include the exact expiration date and affected resourcestools:- shell_check_cert_expiration_date- pagerduty_post_status_update
Install Unpage and run your first agent in < 5 minutes.
Install Unpage and run your first agent in < 5 minutes.
Get Started
Join the Community
Connect directly with Unpage engineers and other Unpage users to share what you've built, ask questions, request new features, and provide feedback for further improvement.
Connect directly with Unpage engineers and other Unpage users to share what you've built, ask questions, request new features, and provide feedback for further improvement.