Concepts & Terminology

This page lays out the essential concepts and terminology used in Temperstack

Temperstack is an incident management tool with additional features of alert audit , alert setup automation and AI powered realtime runbooks. Temperstack plays the role of an incident management system, enabling alert notifications (by email, slack, voice, and SMS) to the right teams using the on-call and escalation policies and in addition, provides everything you need to audit and pinpoint missing alerts, automate creation new alerts , and optimize existing alerts to prevent alert fatigue.

Monitoring

Monitoring is the systematic observation and analysis of a system's state and performance. It encompasses tracking changes in system conditions and data flows to detect anomalies and facilitate problem resolution. A monitoring system comprises software components that collect, process, and display relevant data.

Monitoring strategies fall into two categories:

Proactive Monitoring: Involves the continuous examination of visual indicators, such as time series data and dashboards, to preemptively identify potential issues.
Reactive Monitoring: Utilizes automated alerting mechanisms to notify operators of significant system state changes, prompting immediate attention and action.

Together, these approaches enable comprehensive system oversight and timely issue resolution.

Alerting, a key feature of monitoring systems, detects and communicates critical events to operators. Alerts are concise messages delivered via various channels, including email, SMS, instant messaging platforms, or phone calls. These notifications are directed to the appropriate personnel responsible for addressing the identified issues.

Modern monitoring relies on automated alerts. This approach allows development teams to oversee complex systems efficiently, without dedicated monitoring staff.

Alerts and Metric monitors

Alerts notify potential issues through email, SMS, phone calls, or tickets. They're triggered when a system crosses a predefined threshold, detected by a monitor. For example, an alert may notify you when CPU usage exceeds 80% for 10 consecutive minutes.

Metric monitors evaluate time-series data against thresholds, defined by limits and duration. When data points breach these thresholds, the monitor enters an alert state. It returns to a clear state when data falls within limits. Monitor states inform alarm evaluations.

Alert categories typically include:

Infrastructure: e.g., virtual machine CPU usage exceeding 99%
Application: e.g., endpoints returning 5XX status codes
Key transactions: e.g., decreased sign-ins or purchases

Alert coverage

Alert coverage measures how thoroughly a system's components are monitored through metrics and alerts. This includes infrastructure, applications, and key transactions.

Ideally, alert coverage should adapt to system growth and changes. However, alerts are often set up initially and rarely updated. This can lead to outdated monitoring that fails to detect real issues, sometimes until end users are affected.

Maintaining comprehensive alert coverage is crucial for timely problem detection, yet it's frequently overlooked until problems arise.

Are you aware of your alert coverage effectiveness? Inadequate coverage often leads to two extreme outcomes:

Frequent, preventable downtime incidents
Excessive spending on over-provisioned resources to avoid outages

Temperstack notifications

Temperstack offers out-of-the-box capabilities to connect your existing alerts to Temperstack notifications over Slack, Email, WhatsApp and Phone call.

You can integrate your existing observability tool directly with Temperstack for each service.

Step 1: Select the service and the integration type from the drop-down menu.

Step 2: Name the integration and generate an integration key.

Step 3: Copy the integration URL and use it in your alerting systems to post alerts to the URL.

If you do not find your observability tool in the drop-down menu or need a custom event to be captured or alerted on - use the Generic Webhook as the integration type.v

Migrating Pagerduty & Opsgenie - One-click fully automated

Temperstack offers a one-click, fully automated migration from PagerDuty and Opsgenie to Temperstack notifications.

If you are already a user of PagerDuty or Opsgenie, we will migrate and set up an exact replica of your PagerDuty/Opsgenie setup and alerts through an out-of-the-box integration.

Step 1: Go to Admin -> Integrations

Step 2: Click on PagerDuty or Opsgenie, as the case may be.

Step 3: Enter the Pagerduty/ Opsgenie Credentials

Step 4: Click on Save - sit back and relax for 15 minutes

An exact replica of your Pagerduty / Opsgenie Service to alerts , teams, oncall and escalation roster will be created.

Your migration is done !

Please note that at this point, Temperstack has created an exact replica of your alerts and teams. Notifications will be triggered on both PagerDuty/Opsgenie and Temperstack simultaneously.

Many users prefer to run both notification systems in parallel for some time before fully transitioning from PagerDuty/Opsgenie to Temperstack.

ALCOM (Alert Completeness) score

The ALCOM (Alert Completeness) score is a proprietary metric developed by Temperstack to provide a numerical value that measures the alert coverage of infrastructure, applications, and key transactions. It is expressed as a percentage, with 100% being the maximum score.

How is ALCOM computed?

When a customer onboards to Temperstack and connects their infrastructure (e.g., AWS) and APM (e.g., New Relic), Temperstack maps out all active components of the infrastructure and application service resources. It then scans individual infrastructure and application service components to check for alerts that have been set up and compares them with a standard list of critical alerts to provide the ALCOM score. Temperstack repeats this process for every infrastructure and application service to give an aggregated ALCOM score.

Example: For an EC2 instance, there are 7 critical metrics that need to be monitored. If this particular EC2 instance has only 2 of the 7 alerts covered, the ALCOM score is < 30%.

How can I find my ALCOM Score ?

GET YOUR ALCOM SCORE FOR FREE ( you will need to provide IAM credentials)

Setting up AWS Integration

Setting up NewRelic Integration

How often is ALCOM computed?

Each time there is a system expansion or structural change, Temperstack discovers the new changes and re-computes the ALCOM score.

How is the ALCOM score actionable?

The ALCOM score is actionable in two ways

It provides the score at the granularity of a Service group/team and can be used to track improvement and enforce Alert coverage.
It gives the pinpointed list of the missing alerts on each resource that is to be fixed to improve the ALCOM score.

Alert Audit & Setup Automation

Alert setup automation in the Temperstack context refers to three capabilities that are unique to Temperstack:

Temperstack scans your infrastructure and microservices to create an infrastructure and service catalog.
It audits each instance of infrastructure resources and microservices against a defined alerting policy to identify absent alerts.
It enables one-click deployment of all missing alerts in your monitoring systems and ties them to the appropriate team in Temperstack notifications.

Temperstack alert audit and setup automation is currently available for AWS Cloudwatch, NewRelic and Cloud monitoring in GCP. Please keep checking our change log for additional capabilities. We already have Datadog and Azure in pipeline for this capability.

Service as defined in Temperstack

In Temperstack's context, a service is a collection or group of linked resources, micro-services, and team members.

Think of service as a unique key that connects a micro-service (one or multiple APIs / apps ), to its resources (EC2, databases, load balancers), cache, and the team that manages it.

Example :

In the above figure, you can see multiple services listed; for example: Payments is a service

Each service has the following attributes visible in the service listing.

Service Name: The name that allows the team to easily identify the microservice group and its corresponding team.
Group Email: The email address where notifications related to this service should be sent.
Slack Channel: The channel where notifications related to the service will be posted.
On-Call Policy: This refers to the rotation and escalation roster of team members used for making phone calls and handling escalations when an incident occurs.

You can learn more about creating and editing a service here : Setting up Services

The service listing enables you to change notification channels on Slack, change the group email ID/change rotation and escalation policy all in one place without needing to edit individual alerts.

On-call Policy as defined in Temperstack

In the Temperstack context, an On-call policy is linked to a specific team /group and has the roster of first-level responders on call and escalation policy when and to whom the notification should escalate.

Example

Consider a team consisting of Hari, Mohan, Haarvish, and ERA, who rotate weekly for on-call duties. Imagine an incident occurs on November 25 at 11:00 AM:

Level 1:
- Hari is the primary on-call engineer and receives the alert first.
- If Hari cannot acknowledge or resolve the issue within the designated time, it escalates to Level 2.
Level 2:
- Haarvish is on duty at Level 2 during this time (from 10:00 AM, November 25, to 10:00 AM, November 26).
- Haarvish takes over the incident if Hari does not respond.
Level 3:
- If neither Hari nor Haarvish resolves the incident, it escalates to Level 3, where ERA takes responsibility.

Typically one set of people/team has one on-call Policy and can be mapped to multiple services, if the services are going to be responded to and escalated to the same team.

However, each service can have only one on-call Policy.

In the case of two services having the first responders but escalating to different persons, you need to define two different on-call policies which will be mapped to the respective service.

Know more about Temperstack On-call and Scheduling Policy here.

Last updated 9 months ago