Build SRE agents that understand your infrastructure

Just describe a workflow for your agent then connect to Unpage’s ecosystem of plugins and tools. In as little as 5 minutes, your agent will have the context it needs to save your team hours of toil every week.

Get Started

Book a Demo

1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues

1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues

1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues
1. Create
1. Define
1. Run
$ unpage agent create host-not-reporting
1. Create
1. Define
1. Run
description: Use this agent to handle "Host Not Reporting" incidents from PagerDuty.
prompt: >
- Search the logs to see if 'Power key pressed short.' was logged recently.
-If so, post a status update that the host was cleanly shut down, and resolve it.
- Search the logs for any connection errors that indicate a New Relic outage, like a ClientTimeout exception.
-If there are several of these errors in the past hour, post a status update that we were unable to connect to New Relic and resolve it.
- If neither of the above conditions are met, fetch the instance data from AWS to determine the instance's state.
-If it is not in 'running' state, post a status update with the instance's current state and resolve it.
tools:
- pagerduty_*
- aws_get_instance_status
- solarwinds_search_logs
1. Create
1. Define
1. Run
$ echo '{"alert": "EC2 instance i-0717343f2401bf83a has not reported to new Relic in 5 minutes"}' | unpage agent run host-not-reporting
✓ Agent investigation complete
✓ No action required, alert may be resolved.
→ Details: The host not reporting alert for
database-soothing-wheat.internal was caused by New Relic network connectivity
issues, not an actual host problem.
Root Cause Analysis:
-Alert triggered at 2025-08-26T17:05:02Z for host
database-soothing-wheat.internal (AWS instance i-0717343f2401bf83a)
-Investigation found New Relic connection errors at 2025-08-26T17:24:41Z
showing "metric sender can't process" and "Client.Timeout exceeded while
awaiting headers"
-These errors indicate New Relic infrastructure API timeouts preventing the
host from successfully reporting metrics
-No evidence found of host shutdown or actual system issues

Creating infra agents couldn't be easier

Choose an agent from the Unpage library or start your own from scratch.
Detail (or refine) the steps your agent will follow then list the tools it can access.
Test, refine, deploy.

Automate repetitive tasks and save time on investigation

Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update

Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update

Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update

Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update
Reduce alert noise
Unpage agents use context from your infrastructure graph to triage alerts, eliminate false positives, and surface real risks.
Learn more
# cpu-alert-agent.yaml
# Used by the router to determine which agent to use for an alert
description: >
Use this agent to analyze alerts that meet the following criteria:
-The alert is related to CPU usage exceeding thresholds
-The alert comes from AWS CloudWatch or Datadog
-The affected resource is a compute instance (EC2, container, etc.)
# Instructions for the agent
prompt: >
You are an agent specialized in analyzing high CPU usage alerts.
When investigating a CPU alert, follow these steps:
1. Check the current CPU metrics to verify the alert is still active
2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage
3. Check logs from around the time the alert started for any errors or unusual activity
4. Look for any recent deployments or changes that might explain the high usage
5. Check if similar resources are experiencing the same issue
Based on your findings, update the incident with:
- Current status of the issue
- Likely cause based on available evidence
- Recommended next steps
- Whether this appears to be a critical issue requiring immediate human attention
Be concise but thorough. Include specific metrics, timestamps, and log entries
that support your analysis.
NEVER make up information or assume values you haven't verified.
# Tools the agent can use
tools:
- core_current_datetime
- core_convert_to_timezone
- metrics_get_metrics_for_node
- metrics_list_available_metrics_for_node
- graph_get_resource_details
- graph_get_neighboring_resources
- graph_get_resource_topology
- papertrail_search_logs
- pagerduty_post_status_update
- pagerduty_get_incident_details
- aws_describe_ec2_instance
investigate incidents
Create and run agents that investigate common issues that SRE teams regularly respond to, like SSL connection failures or high disk usage alerts.
Learn more
description: >
Investigate SSL/TLS connection failures
# Instructions for the agent
prompt: >
- Extract the domain/hostname from the PagerDuty alert about connection failures.
- Use shell command `shell_check_cert_expiration_date` to check the certificate expiration dates
- Parse the certificate dates to determine if the cert is expired or expiring soon
- If certificate is expired or expiring within 24 hours:
-Post high-priority status update to PagerDuty explaining the root cause
-Include the exact expiration date and affected resources
tools:
- shell_check_cert_expiration_date
- pagerduty_post_status_update

Install Unpage and run your first agent in < 5 minutes.

Get Started

$ curl -fsSL https://install.unpage.ai | bash

Quickstart Guide

Join the Community

Connect directly with Unpage engineers and other Unpage users to share what you've built, ask questions, request new features, and provide feedback for further improvement.

Join Slack