AI powered contextual Runbooks

What are AI Runbooks in Temperstack?

AI-Powered Runbooks in Temperstack are dynamically generated instructional guides designed to assist in resolving system alerts efficiently. When the alert has been raised, an AI-runbook will be generated with a set of tailored, step-by-step instructions to address and mitigate the issue. These runbooks provide specific, actionable guidance based on the alert's context, helping engineers to quickly diagnose and resolve problems.

These AI runbooks will only be generated if the user utilizes Temperstack's notification system as their primary notification mechanism.

These runbooks leverage AI to analyze the specific conditions that led to the alert, such as the affected resources and the metrics that were breached. By doing so, they offer precise and actionable recommendations that are relevant to the current situation. This context-aware approach ensures that the solutions provided are not generic but are instead tailored to the unique circumstances of each alert, making them highly effective and user-friendly.

Note: This runbook is AI generated. User discretion is advised.


How to Access AI-Powered Runbooks in Temperstack?

To access AI-Powered Runbooks in Temperstack, follow these steps:

  1. Navigate to the Alert List: - Go to the Dashboard of your Temperstack account. - Locate and select the Alert List.

  1. Select an Alert: - From the list of alerts, click on the specific alert you are interested in. - On the alert details page, you will find the AI-Generated Runbook displayed on the left side of the screen.

For Example:

If you navigate to the alert named “TestAlertCPUUtilization” in the Temperstack dashboard, you will find a detailed AI-generated runbook displayed on the left side of the alert's details page. This runbook provides a comprehensive set of instructions designed to guide you through resolving the high CPU utilization issue identified by the alert.

  • The instructions cover various diagnostic and recovery steps. First, you are advised to SSH into the virtual machine experiencing the alert. Once connected, you can use commands like `top` to monitor overall CPU usage and `ps aux` to sort and identify processes consuming the most CPU. Additionally, the runbook suggests verifying if the VM's size is appropriate for the workload using Azure CLI commands. If the virtual machine is found to be undersized, the instructions include steps to resize the VM.

  • Further, the runbook provides guidance on optimizing application code if it's contributing to high CPU usage. For ongoing management, it recommends setting up auto scaling rules to automatically adjust the VM's resources based on CPU utilization. Lastly, it proposes considering Azure Virtual Machine Scale Sets for better scalability and load balancing across multiple VMs.

  • This context-specific guidance ensures that you have all the necessary steps to diagnose and resolve the issue efficiently, minimizing downtime and maintaining system performance.

This feature is seamlessly integrated into the alert management system, providing immediate access to actionable instructions whenever an alert is triggered.


How Do AI-Powered Runbooks Work?

When an alert is triggered in Temperstack, the AI-powered system performs a detailed analysis of the alert, including:

  1. Identification of the Alert: - The system identifies the nature of the alert and the specific condition that caused it to trigger.

  2. Resource and Metric Association: - It determines which resources are affected by the alert and the specific metrics that have been breached.

  3. Contextual Instruction Generation: - Based on the analysis of the alert and its associated data, the AI generates a contextual runbook. - This runbook includes a series of steps that should be followed to resolve the issue. These steps are designed to be clear, precise, and directly applicable to the situation at hand.

For instance, if a server is experiencing high CPU usage, the runbook might include steps to identify the processes consuming the most CPU, potential remediation actions like restarting services, and suggestions for optimizing server performance.

Last updated