Alert thresholds - default metrics and customisation

Default Metrics

Our platform provides templates for all important alerts, each with specific metrics tailored to your instance's requirements. These templates are designed based on a thorough study and information. These alerts are listed in the thresholds section and were developed by our team of experts.

Now, navigate to the resources section and you'll find a unique name for each resource, known as the cloud ID.

The alerts that should be set up for each cloud ID are specified in the thresholds. These thresholds provide guidance on which alerts are critical for monitoring the resource effectively.

Navigate to Thresholds → AWS Thresholds (for example). Here, you'll find two types of thresholds: Static and Anomaly Detection.


Static Alert Thresholds

Static thresholds are predetermined values set for specific metrics, ensuring proactive monitoring and immediate alerting when performance metrics deviate from acceptable ranges.

Key Features:

  • Critical Role in Cloud Infrastructure: Static thresholds are essential for ensuring the smooth operation of applications relying on AWS components like RDS, EC2, ELB, and SQS.

  • Comprehensive Monitoring with CloudWatch: Amazon CloudWatch provides a wide range of metrics, offering valuable insights into instance performance and health in the dynamic cloud computing landscape.

  • Predefined Performance Ranges: Static thresholds involve setting predetermined values for specific metrics, representing acceptable performance ranges.

Benefits:

  • Proactive Monitoring: By establishing static thresholds, administrators can proactively monitor metrics and detect deviations from acceptable performance ranges.

  • Immediate Alerting: Breaching static thresholds triggers alerts, prompting immediate attention and action from administrators or automated systems, ensuring timely resolution of issues.

Example Scenerio:

Let's consider an example scenario where we're monitoring the Free Storage Space of an Amazon RDS instance.

We set the measurement intervals to 10 seconds. This means that every 10 seconds, we check the amount of free storage space available on the RDS instance.

Static Alert Threshold:

  • Metric Name: FreeStorageSpace

  • Comparison: LessThanOrEqualToThreshold

  • Value: 2,000,000,000 Bytes

  • Unit: Bytes

  • Period: 60 seconds

  • Evaluation Period: 10 intervals

Explanation: In this scenario, we have set up a static alert threshold for the Free Storage Space metric of the RDS instance. If the amount of free storage space falls below 2,000,000,000 Bytes during any 10-second interval over a period of 60 seconds, the alert will be triggered.

Scenario Interpretation: Suppose our RDS instance is experiencing increased usage, causing the free storage space to decrease rapidly. If this condition persists for six consecutive intervals of 10 seconds each (totaling 60 seconds), the alert will be raised. This indicates a potential issue with storage capacity, allowing us to take proactive measures such as scaling up storage or optimising database usage to prevent service disruptions.

By configuring such static alert thresholds, we ensure timely detection of critical conditions and enable prompt action to maintain the health and performance of our AWS resources.


Anomaly Detection Alert Thresholds

Anomaly detection is a powerful technique that identifies data points, events, or observations that deviate from the typical pattern or expected behaviour. Unlike static thresholds, which rely on fixed values, Anomaly Detection thresholds dynamically adapt to your system's behaviour, providing more accurate and efficient alerting.

Key Features:

  1. Dynamic Thresholds: Anomaly Detection alarms produce a dynamic threshold that represents the normal range of values for the metric. This threshold continuously adapts to your system's behaviour, reducing false positives and eliminating the need for manual adjustments.

  2. Alarm Setup: You can configure alarms based on various conditions such as "Outside the threshold," "Greater than the threshold," or "Lower than the threshold," allowing for flexible alerting based on your specific monitoring requirements.

  3. Standard Deviation: Anomaly alerts are based on the standard deviation of the metric. The deviation count indicates the number of standard deviations by which a data point deviates from the expected pattern. Typically, a deviation count of two standard deviations is used to trigger an alarm.

Understanding Standard Deviation:

A standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data points. In the context of Anomaly Detection, standard deviation is used to calculate the normal range of values for a metric. By analysing historical data, the system determines the average value and variability of the metric, allowing it to identify deviations from this expected pattern.

Benefits:

  1. Adaptive Alerting: Anomaly Detection thresholds adapt to changes in your system's behaviour, ensuring accurate and timely alerting without manual intervention.

  2. Reduced False Positives: By dynamically adjusting to your system's performance, Anomaly Detection helps minimise false positive alerts, allowing you to focus on genuine issues requiring attention.

  3. Efficient Resource Management: With Anomaly Detection, you can efficiently monitor a wide range of metrics without the need for manual threshold adjustments, saving time and resources.

Example Scenario:

Let's consider a scenario where we're monitoring the CPU utilisation of an Amazon RDS instance using Anomaly Detection Thresholds.

  • Metric: CPUUtilization

  • Comparison: LessThanLowerOrGreaterThanUpperThreshold

  • Deviation Count: 10

  • Unit: Percent

  • Period: 60 seconds.

  • Evaluation Period: 10 intervals

Explanation:

In this scenario, the anomaly detection threshold is set up to monitor the CPUUtilization metric of the RDS instance. The threshold is configured to trigger an alert if the CPU utilisation deviates significantly from the expected pattern.

Interpretation:

Suppose the CPUUtilization metric suddenly spikes or drops, resulting in a significant change in the trend. If the deviation count exceeds 10 within an evaluation period of 10 intervals (equivalent to 60 seconds), an alert will be raised.

Scenario Outcome:

For example, if the CPU utilisation of the RDS instance suddenly spikes due to increased workload or a performance issue, the anomaly detection threshold will detect this deviation from the normal pattern. Consequently, an alert will be triggered, notifying administrators or automated systems to investigate and address the issue promptly.

By utilising anomaly detection thresholds, organisations can effectively monitor their AWS resources and detect abnormal behaviour that may indicate potential issues or performance anomalies.


Customization

While you have the flexibility to customise alerts according to your needs, we generally advise against it. This is because our default alerts are already equipped with crucial metrics essential for resource monitoring. However, if necessary, you can add additional alerts to the list to meet specific incident management requirements. Should you wish to exclude any alerts, simply disable them using the toggle button located on the right side of the thresholds table.

For instance, if you're using Kubernetes ELK, our software comes with default alerts tailored for this setup. These alerts are developed by us and include important alerts to ensure comprehensive monitoring.

Last updated