ALCOM and identifying monitoring gaps

ALCOM Score

The ALCOM Score is a comprehensive metric that evaluates the effectiveness of performance monitoring in an organization's production environment. It quantifies the ratio of implemented alerts to necessary alerts, providing insight into monitoring coverage for teams, services, and unmapped resources. A low score indicates that customers often detect issues before internal systems, while a high score suggests proactive detection and resolution of potential downtime.

Temperstack's proprietary scoring mechanism calculates the ALCOM Score by considering factors such as total machines and APIs, alert triggers, and successful monitoring setups. The algorithm weighs each alert based on criticality, metric type, resource type, threshold, and evaluation period, offering a nuanced assessment of an organization's monitoring capabilities.

Identifying Missing Alerts in your monitoring

Overview of Alert Health:

In the table of Resources Information, you can find the Alert Health section. This section provides an overview of missing alerts that need to be set up, along with how many alerts are currently set up. Alerts that are set up will appear in green, while missing alerts will appear in red.

Viewing Alert Details:

For more detailed information about alerts, you can click on the “Resource Name”. Scroll down to view “Static alert” and “Dynamic alerts” associated with the resource. By clicking on each alert, you can determine whether it has been deployed or not.

Disabling Alerts:

One of the best features of our alert system is the ability to disable alerts that are not needed. Simply toggle the button associated with the alert to disable it, ensuring that you only receive alerts that are relevant to your monitoring needs.

Rescan Resources:

Rescanning resources is a crucial step in setting up alerts effectively. It allows you to identify the services in your account and determine which alerts need to be set up for those specific services.

How Does it Work?

Account Scanning: The software scans your account to identify all the structures you have, including different types of resources, infrastructures, or databases within AWS.
Service Detection: After identifying the services, the software then scans each service to determine which alerts should have been set up. This ensures comprehensive coverage of your monitoring requirements.

How to Rescan Resources:

To rescan resources, simply navigate to the "Bulk Actions" option on the AWS Resources page and click on "Rescan Resources."

Example Scenario

Let's say you're responsible for monitoring resources in your AWS environment, specifically focusing on a resource named "fetch_gcp_resources_dlq." Here's how you can use the provided information to understand its details:

Resource Name: "fetch_gcp_resources_dlq" - This is the name of the resource you're monitoring.
Resource Type: AWS/SQS (Simple Queue Service) - Indicates that the resource belongs to the SQS service within AWS.
Environment: PRODUCTION - Specifies that the resource operates in the production environment, indicating its criticality.
Cloud ID: [https://sqs.ap-south-1.amazonaws.com/016340224242/fetch_gcp_resources_dlq] - This unique identifier represents the resource's cloud ID, allowing you to access it directly within the AWS environment.
Service: NewTestService - Indicates the application or service associated with the resource, providing context for its usage.

By understanding the resource's service and environment, you can determine which alerts should be set up for it. These alerts are typically defined in thresholds based on the resource's type and criticality.

*Note: Know more on “Alert Thresholds.” *

Last updated 11 months ago