ALCOM and identifying missing alerts

ALCOM Score

The ALCOM Score is a metric that assesses the health and status of performance monitoring comprehensiveness for teams, services, and unmapped resources within an organization's production environment. It represents the ratio of alerts already in place to alerts that should be in place, providing a clear indication of the effectiveness of an organization's monitoring setup.

A low ALCOM Score suggests that potential downtime is typically discovered by customers rather than internal systems, indicating poor monitoring coverage. Conversely, a high ALCOM Score indicates that the organization is able to detect potential downtime before it occurs and affects customers, ensuring proactive issue resolution and minimizing the impact on end-users.

The ALCOM Score is calculated using Temperstack's proprietary scoring mechanism that takes into account factors such as the total number of machines and APIs, the number of alerts triggered by each, and the number of alerts successfully set up for monitoring. The mechanism assigns weights to each alert based on criticality, metric, type of alert (static vs. anomaly detection vs ping monitoring), resource type on which alert is set up (API vs. infrastructure, compute vs. database etc.) , type of threshold and eval period used etc.


Identifying Missing Alerts

Overview of Alert Health:

In the table of Resources Information, you can find the Alert Health section. This section provides an overview of missing alerts that need to be set up, along with how many alerts are currently set up. Alerts that are set up will appear in green, while missing alerts will appear in red.

Viewing Alert Details:

For more detailed information about alerts, you can click on the “Resource Name”. Scroll down to view “Static alert” and “Dynamic alerts” associated with the resource. By clicking on each alert, you can determine whether it has been deployed or not.

Disabling Alerts:

One of the best features of our alert system is the ability to disable alerts that are not needed. Simply toggle the button associated with the alert to disable it, ensuring that you only receive alerts that are relevant to your monitoring needs.

Rescan Resources:

Rescanning resources is a crucial step in setting up alerts effectively. It allows you to identify the services in your account and determine which alerts need to be set up for those specific services.

How Does it Work?

  1. Account Scanning: The software scans your account to identify all the structures you have, including different types of resources, infrastructures, or databases within AWS.

  2. Service Detection: After identifying the services, the software then scans each service to determine which alerts should have been set up. This ensures comprehensive coverage of your monitoring requirements.

How to Rescan Resources:

To rescan resources, simply navigate to the "Bulk Actions" option on the AWS Resources page and click on "Rescan Resources."

Example Scenario

Let's say you're responsible for monitoring resources in your AWS environment, specifically focusing on a resource named "fetch_gcp_resources_dlq." Here's how you can use the provided information to understand its details:

  1. Resource Name: "fetch_gcp_resources_dlq" - This is the name of the resource you're monitoring.

  2. Resource Type: AWS/SQS (Simple Queue Service) - Indicates that the resource belongs to the SQS service within AWS.

  3. Environment: PRODUCTION - Specifies that the resource operates in the production environment, indicating its criticality.

  4. Cloud ID: [https://sqs.ap-south-1.amazonaws.com/016340224242/fetch_gcp_resources_dlq] - This unique identifier represents the resource's cloud ID, allowing you to access it directly within the AWS environment.

  5. Service: NewTestService - Indicates the application or service associated with the resource, providing context for its usage.

By understanding the resource's service and environment, you can determine which alerts should be set up for it. These alerts are typically defined in thresholds based on the resource's type and criticality.

*Note: Know more on “Alert Thresholds.” *

Last updated