Deploypartners White Paper
Actionable Alarm Management
Alarm Management’s goal is to provide a nervous system for your network and IT infrastructure. When something goes wrong it is detected and an alarm is processed for operators to action. The intention is to be notified before or immediately when something goes wrong to minimise the restoration time. The systems were deployed to collect everything that you would otherwise get someone to sit and wait to observe happening. When networks were small, simple, less robust, less redundant and when there was plenty of headcount this was okay but these days operations have less people for much bigger networks. No one wants to manage alarms anymore, and quite rightly. Alarms are just a signal that can trigger a resolution quicker than a customer notices or generates a complaint. Alarm Management is a system that provides organisations a nervous system that with enough effort can allow them to rectify service disruptions as soon as they appear and maybe before.
Alarms are signals which you can choose to act upon or not. What we propose is a system that scans historical data for trends and takes operations input to provide the most actionable alarms so that the business impacting abnormalities are not lost in the noise.
State(ful) of the Art
Our view of the alarm world.
In the Real World
Real-time is too fast.
Tomorrow is likely to be like yesterday.
Predictable alarms and why you should ignore them, just for a little while. Trust us.
Unknown, unknowns. How to detect them and what to do next.
Lost Cause Analysis
Why Root Cause Analysis can become a lost cause, or just a weak excuse for bad Alarm Management
What you need to check for before you trigger a truck roll.
Alarms to Action
Alarms to Filters not Filters to Alarms.
Alarms are meant to initiate resolution action, not to be hoarded and ignored.
Alarm Management Database
Next steps to improve alarm management
It’s our observation that Alarm Management in IT and Telecommunications is in a state of decay. The number and variety of alarms has increased exponentially as new systems are integrated and it is increasingly becoming impossible for operators to distinguish critical events from nuisance alarms. We believe that initially the Alarm Management systems were functional and provided fantastic value and over time, as new alarm streams were added, no rationalisation occurred. This occurred because the cost of adding new alarms is low and the cost of reducing alarms is high.
We see evidence of companies willing to throw away their investment because they no longer have the belief that their Alarm Management system is adding value. Or they want to move the alarms to a Ticketing system since that process is what will enable them to actually force them to rationalise alarms. While we push for automation wherever possible we also think Alarm Management systems have a big role to assist operations to trigger action (manual or automated) and quickly enable them to detect the real root cause and escalate.
In the Real World
Real-time alarming is a major selling point for most Alarm Management systems. In fact it is an expected and not very exciting capability these days but we have come to question the need of it from a real-world perspective. Frequently our analysis of critical alarms found that a vast majority (80%) would actually resolve themselves within two or three minutes. Which means to us that the same problems just keep on happening. We can find these Alarms by looking at the Mean Time to Resolution (MTTR).
We can use the MTTR metric as an input to suppress an alarm from presentation or further actions until the service is given a chance to rectify itself. So if the MTTR is three minutes and the alarm has not cleared after this period then make the alarm actionable.
So why was Real-Time at one stage such a selling point? Well the idea was that you could decrease the MTTR or improve service levels by showing faults as fast as possible to operators to action. Most operators are usually juggling multiple faults at once and real-time notifications are lost in the noise. We propose that real-time alarm management is not required unless something extraordinary occurs (aka, Black Swan).
Many critical alarms are so common they happen practically every day. We see these alarms all the time. Maybe the Network Element is on a low grade network or it is suffering from a known problem that operations cannot assist. The problem is that this alarm just keeps popping up clogging your event list and ticketing systems and most of all taking up precious attention. Sometimes these alarms have outages in seconds and quickly rectify themselves. We call these White Swans and they are predictable problems that reoccur frequently but aren’t really actionable.
A white swan is the predictable and the day to day and we think they should be ignored, at least for a little while. We suggest operations look for the mean time to resolve (MTTR) based on the alarm type and if it is capable of self-resolution. Give it time before triggering a response.
With alarms you will never know what will happen next. Networks live in the real world, like life and stock markets it’s next to impossible to predict what will happen however it’s likely that tomorrow will be largely like yesterday. Most of alarm patterns will fall into a normal distribution where there isn’t a huge amount of variety or volatility.
However it’s likely the biggest outage will be a surprise. Something that has never happened before or very rarely occurs. Unfortunately with current systems it’s impossible to distinguish between the rudimentary and the extraordinary.
There are many techniques to find the variety (colour) of the swan from your alarm patterns however the simplest is often the best. Based on our experience we have found the following rule matrix:
|White||Low mean Time to Resolve||Likely to resolve itself|
|Grey||High Mean Time to Resolve||Unlikely to resolve itself|
|White||Low Mean Time Between Failure||Likely to resolve itself|
|Grey||High Mean Time Between Failure||Unlikely to resolve itself. Likely to have a bigger business impact|
|Black||No MTBF (unknown)||Unlikely to resolve itself. Likely to have a business impact.|
Lost Cause Analysis
Alarm management systems are often accused of overwhelming their users resulting in missed outages or elevated MTTR. A frequently cited resolution to this problem is “Root cause Analysis” (RCA). Unfortunately identification of the true root cause is nearly impossible with current monitoring levels. Since an alarm is almost always a consequence or a symptom of something else that is not monitored or measurable.
We propose a test, if the alarm can trigger an automated resolution action is has truly found the root cause. An end-to-end root cause analysis is a worthy goal, and while possible with significant investment, for those without the scale to invest we propose the simpler approach of giving power to human operators to tune granular alarm life cycles. This approach will also reduce MTTR and business focus by identifying actionable abnormalities using an Alarm Management Database (AMDB).
Right now if there is a common goal in Alarm Management, it is auto-ticketing. Its as if the RCA challenge has been overcome or people have just given up trying to work with alarms. If your network has alarms that trigger a response then they should be ticketed but be careful not to just move your problem of “too many alarms” to another interface, tickets.
So before triggering a truck roll, we suggest the following conditions should be met:
1. Alarm indicates a service disruption (defined by business attributes, transparently)
2. Another alarm that indicates that same service distribution cannot be auto ticketing, detected.
a. Using Root Cause (or alarm clusters site or hostname duplication detection)
b. Duplicate detection
3. Alarm age (time from First Occurrence) is greater than the average MTTR for that specific alarm or an average for that alarm type
a. Unless MTBF is unknown or very high (see White/Black Swan below)
Alarms to Action
Most alarm management systems give the NOC or the operator the ability to filter alarms. The process is generally ad-hoc and results in a fairly large SQL filter. We regard this as being filters match alarms:
Filter > Alarms > Operator
We think instead alarms should be assigned to filters.
Alarm > Filter > Operator
While this doesn’t sound like there is much of a difference, there is. In the old way it is very difficult to measure how many alarms a filter is catching. Without being able to measure it is difficult to improve or understand the daily fluctuations that an operator experiences. This means it is hard to know if the NOC could handle more or less.
Therefore we recommend alarm filter attribute tagging. By giving the NOC the ability to tag alarm-types (i.e., Interface Down) with one or more filters and assign the filter to the tag means you can:
- Measure volume
- Model future volume with new alarm sources
- Provide easier filtering options
- Find easy targets for automation
Root Cause looks for the probable source of the fault. Most Actionable is the process of determining what faults are the most actionable according to:
- Service Impact
- Historic performance
We define an actionable alarm as an outage that impacts a service and/or requires manual intervention to resolve. Full automated responses are not actionable until that process ends and doesn’t provide the resolution outcome. Not all alarms require operator intervention and our analysis’s have found a majority of alarms will resolve themselves in a short enough amount of time that manual intervention wouldn’t be possible. To provide operations focus we suggest the following criteria that every alarm must pass before being presented or ticketed. The answer must be yes to the following questions to warrant the raising of a ticket:
- Does the alarm indicate an outage that could impact the business operation?
- Can the alarmoutage resolve itself?
- Is the alarm out of the ordinary (i.e has it never happened or rarely happens)
- Has the alarm gone past the point of resolving itself?
Most Actionable is compatible with RCA as ideally the output from RCA is Most Actionable. However Most Actionable can be still be used without it. For example, if someone makes a change on a device we receive a Configuration Change alarm and then the EMS detects this was an authorised change and generates an “Un-authorised change” alarm. The root cause is the Change event but arguably the change alarm is the closest thing to the root cause of the Unauthorised change. So for this example the NOC would have correctly identified the “Un-authorised change” as being actionable and to be presented to operations.
Introducing an Alarm Management Database
An Alarm Management (AMDB) is a system that provides operations the ability to define alarm life cycles. An AMDB will provide visibility to what is being managed and enable users to quickly change alarm behavior to changing network or customer needs.We propose a AMDB has the following key capabilities:
- Provide MTTR and MTBF metrics
- Embed these values to suppress events according to the probability of “self resolution”
- Display quickly Alarm Scenarios that have never occurred previously
- Provide an interface to assign alarm lifecycle attributes
- Auto Ticket
- Filter Tags
- Provide an interface to assign business attributes
- Configurable granularity to Alarm Type, Node, Location and Unique Alarm Identifier
DeployPartners and FirstOccurrence were founded and are run by people who develop, deploy and operate network management systems, we understand this space.