Temperstack
Main WebsiteFeaturesPricingBlogAbout usRequest a Demo
  • Overview
    • What is Temperstack?
    • Use Cases
  • User Managment
    • Getting started as Admin
      • Inviting Users
      • Mapping multiple services to a Team
      • Single Sign-On (SSO)
      • Customising ALCOM Audit & scanning
    • Getting Started as a User /Responder
    • Managing profile & contact details
  • Integrations
    • Integrating your Observability tools
      • Setting up AWS Integration
        • Multiple AWS Account Integration
        • IAM Setup Guide
          • Creating IAM User: Temperstack with Policy
          • Creating IAM Role: Temperstack with Policy
      • Setting up Microsoft Azure Integration
        • Creating Access for Temperstack in Azure
      • Setting up Google Cloud Platform Integration
        • Creating Access for Temperstack in GCP
      • Setting up Datadog Integration
        • Creating Access for Temperstack in Datadog
        • Managing resources with Datadog
      • Setting up NewRelic Integration
        • Creating Access for Temperstack in NewRelic
        • Managing resources with New Relic
      • Setting up Splunk Integration
        • Creating Access for Temperstack in Splunk
        • Managing resources with Splunk
      • Setting up Appdynamics Integration
        • Creating Access for Temperstack in Appdynamics
        • Managing resources with Appdynamics
      • Setting up Dynatrace Integration
        • Creating Access for Temperstack in Dynatrace
        • Managing resources with Dynatrace
      • Setting up Oracle Cloud Infrastructure
        • Creating Access for Temperstack in OCI
    • Integrating Custom Alerts & Other Alerting sources
      • Webhook Integration
      • Ingesting Emails as alerts
      • Integrating alert listeners from other observability tools
  • Alert routing & Response Managment
    • On-call scheduling and Escalation Policies
    • Setting up Services
    • Alert notification channels
      • Integrating Slack channels
      • Integrating MS Team
    • Mapping resources to Services
      • Rule based resource to Service Mapping
      • Using AI suggested mapping rules
    • Testing Alerting and Notifications
    • Responding to Alerts
  • Monitoring
    • Setting up and maintaining Comprehensive alerting
      • Alerting Templates- metrics & customisation
      • ALCOM and identifying monitoring gaps
      • Programmatically setting up missing alerts in your Observability tool
      • Alert noise Reduction & Optimisation
  • Uptime Monitoring
    • Real time Availability Monitoring
  • Incident analysis & communication
    • External and Internal service Status Pages
      • Instruction to migrate subscribers from Statuspage
  • AI-Powered Issue Resolution
    • AI powered contextual Runbooks
    • Incident command - alert grouping by incident
    • AI Powered Root cause Identification
  • Reporting & Governance
    • Temperstack Dashboard
    • SLO Dashboard
    • MTTA MTTR
  • Billing & Help
    • FAQs
    • Support
Powered by GitBook
On this page
  • Alerting Templates
  • Static Alert Thresholds
  • Anomaly Detection Alert Thresholds
  • Customization
  1. Monitoring
  2. Setting up and maintaining Comprehensive alerting

Alerting Templates- metrics & customisation

Last updated 4 months ago

Alerting Templates

Our platform provides templates for all important resource type, each with specific metrics tailored to your instance's requirements. These templates are designed based on a thorough study and information. These alerts are listed in the thresholds section and were developed by our team of experts.

Now, navigate to the resources section and you'll find a unique name for each resource, known as the cloud ID.

The alerts that should be set up for each cloud ID are specified in the thresholds. These thresholds provide guidance on which alerts are critical for monitoring the resource effectively.

Navigate to Thresholds → AWS Thresholds (for example). Here, you'll find two types of thresholds: Static and Anomaly Detection.


Static Alert Thresholds

Static thresholds are predetermined values set for specific metrics, ensuring proactive monitoring and immediate alerting when performance metrics deviate from acceptable ranges.

Key Features:

  • Critical Role in Cloud Infrastructure: Static thresholds are essential for ensuring the smooth operation of applications relying on AWS components like RDS, EC2, ELB, and SQS.

  • Comprehensive Monitoring with CloudWatch: Amazon CloudWatch provides a wide range of metrics, offering valuable insights into instance performance and health in the dynamic cloud computing landscape.

  • Predefined Performance Ranges: Static thresholds involve setting predetermined values for specific metrics, representing acceptable performance ranges.

Benefits:

  • Proactive Monitoring: By establishing static thresholds, administrators can proactively monitor metrics and detect deviations from acceptable performance ranges.

  • Immediate Alerting: Breaching static thresholds triggers alerts, prompting immediate attention and action from administrators or automated systems, ensuring timely resolution of issues.

Example Scenerio:

Let's consider an example scenario where we're monitoring the Free Storage Space of an Amazon RDS instance.

We set the measurement intervals to 10 seconds. This means that every 10 seconds, we check the amount of free storage space available on the RDS instance.

Static Alert Threshold:

  • Metric Name: FreeStorageSpace

  • Comparison: LessThanOrEqualToThreshold

  • Value: 2,000,000,000 Bytes

  • Unit: Bytes

  • Period: 60 seconds

  • Evaluation Period: 10 intervals

Explanation: In this scenario, we have set up a static alert threshold for the Free Storage Space metric of the RDS instance. If the amount of free storage space falls below 2,000,000,000 Bytes during any 10-second interval over a period of 60 seconds, the alert will be triggered.

Scenario Interpretation: Suppose our RDS instance is experiencing increased usage, causing the free storage space to decrease rapidly. If this condition persists for six consecutive intervals of 10 seconds each (totaling 60 seconds), the alert will be raised. This indicates a potential issue with storage capacity, allowing us to take proactive measures such as scaling up storage or optimising database usage to prevent service disruptions.

By configuring such static alert thresholds, we ensure timely detection of critical conditions and enable prompt action to maintain the health and performance of our AWS resources.


Anomaly Detection Alert Thresholds

Anomaly detection is a powerful technique that identifies data points, events, or observations that deviate from the typical pattern or expected behaviour. Unlike static thresholds, which rely on fixed values, Anomaly Detection thresholds dynamically adapt to your system's behaviour, providing more accurate and efficient alerting.

Key Features:

  1. Dynamic Thresholds: Anomaly Detection alarms produce a dynamic threshold that represents the normal range of values for the metric. This threshold continuously adapts to your system's behaviour, reducing false positives and eliminating the need for manual adjustments.

  2. Alarm Setup: You can configure alarms based on various conditions such as "Outside the threshold," "Greater than the threshold," or "Lower than the threshold," allowing for flexible alerting based on your specific monitoring requirements.

  3. Standard Deviation: Anomaly alerts are based on the standard deviation of the metric. The deviation count indicates the number of standard deviations by which a data point deviates from the expected pattern. Typically, a deviation count of two standard deviations is used to trigger an alarm.

Understanding Standard Deviation:

A standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data points. In the context of Anomaly Detection, standard deviation is used to calculate the normal range of values for a metric. By analysing historical data, the system determines the average value and variability of the metric, allowing it to identify deviations from this expected pattern.

Benefits:

  1. Adaptive Alerting: Anomaly Detection thresholds adapt to changes in your system's behaviour, ensuring accurate and timely alerting without manual intervention.

  2. Reduced False Positives: By dynamically adjusting to your system's performance, Anomaly Detection helps minimise false positive alerts, allowing you to focus on genuine issues requiring attention.

  3. Efficient Resource Management: With Anomaly Detection, you can efficiently monitor a wide range of metrics without the need for manual threshold adjustments, saving time and resources.

Example Scenario:

Let's consider a scenario where we're monitoring the CPU utilisation of an Amazon RDS instance using Anomaly Detection Thresholds.

  • Metric: CPUUtilization

  • Comparison: LessThanLowerOrGreaterThanUpperThreshold

  • Deviation Count: 10

  • Unit: Percent

  • Period: 60 seconds.

  • Evaluation Period: 10 intervals

Explanation:

In this scenario, the anomaly detection threshold is set up to monitor the CPUUtilization metric of the RDS instance. The threshold is configured to trigger an alert if the CPU utilisation deviates significantly from the expected pattern.

Interpretation:

Suppose the CPUUtilization metric suddenly spikes or drops, resulting in a significant change in the trend. If the deviation count exceeds 10 within an evaluation period of 10 intervals (equivalent to 60 seconds), an alert will be raised.

Scenario Outcome:

For example, if the CPU utilisation of the RDS instance suddenly spikes due to increased workload or a performance issue, the anomaly detection threshold will detect this deviation from the normal pattern. Consequently, an alert will be triggered, notifying administrators or automated systems to investigate and address the issue promptly.

By utilising anomaly detection thresholds, organisations can effectively monitor their AWS resources and detect abnormal behaviour that may indicate potential issues or performance anomalies.


Customization

While you have the flexibility to customise alerts according to your needs, we generally advise against it. This is because our default alerts are already equipped with crucial metrics essential for resource monitoring. However, if necessary, you can add additional alerts to the list to meet specific incident management requirements. Should you wish to exclude any alerts, simply disable them using the toggle button located on the right side of the thresholds table.

For instance, if you're using Kubernetes ELK, our software comes with default alerts tailored for this setup. These alerts are developed by us and include important alerts to ensure comprehensive monitoring.

For example - AWS Thresholds
Thresholds List