Alarm Data Integrity or, Garbage In – ROI Out

Data Integrity, or a lack of, is a very real part of business and we, as systems integrators, are often faced with a disconnect between what the marketing and the reality of a situation where the customer’s data is sub-optimal for the product. In layman’s terms – it’s garbage in and therefore low return on investment out.

We are often faced with this situation. Customer A buys a product to solve a series of problems within the customer’s infrastructure. Generally, the products are fit for purpose and capable of performing the required function…if the correct conditions are met. In this case, those conditions are that the customer’s interpretation of data is clean, consistent and well structured. More often than not this is often missed.

Recently an enterprise customer of ours was faced with an overwhelming data situation. They were flooded with events from multiple systems with operations staff unable to keep up and the situation was rapidly deteriorating.

This is a common scenario that businesses with growing customer bases face – pressure to expand the infrastructure but a requirement to contain operational costs. A difficult goal without automated reduction in both the event load and better determination of probable cause to reduce Mean Time To Repair (MTTR).

As Systems Integrators we, at DeployPartners, get the fun job of sorting this out, sometimes we get in up-front and the issues can be reduced but most often we get asked to fix it up after many attempts internally or with other integrators

Back to our client…

The situation was about 100,000 events per day from about a large number of  EMS’s with regular event storms (we don’t know why yet) – about right for a customer of this type and size. Every system has a common set of core event attributes like “Event Time”, “Source Node”, “Event Summary” etc but the real value is in the extended attributes depending on the source that provides the clues. Sometimes attributes fields are overtyped (field reuse) and sometimes they are mapped into additional fields (you end up with a lot of fields they are only populated for a specific class of alarms ).

A typical event has many attributes and every EMS has their own view about what this should be and how they work internally, put them all together and, without a strict and enforced data conformance. You end up with a very difficult system to manage ongoing

Our solution – data integrity.

Cleaning up the data for a positive ROI

There are several fundamental and intrinsic rules that must be followed for data integrity in Alarm Management, some examples are:

  • Data Cleanliness – If an attribute is supposed to have a node name in it, it should have just that, not the FQDN, not the serial number, not a CSV list of nodes to choose from;
  • More is Less – Discarding away noisy event streams early or not ingesting it at all can remove significant clues as to the underlying causes of faults;
  • Noise – Should be consistent, no control characters, hex OIDs, external references;
  • Logical Enrichment – The summary may have a reference to a problem which needs to be summarised to an additional attribute for grouping; and
  • Non-Logical Enrichment – GeoSpatial mapping, external circuit references, site, location, node type, node usage etc; Information referenced from external “source of truth” data sources.

Now, with better data integrity, a Manager of Managers has a chance. Some time in the not too distant future Machine Learning/Artificial Intelligence may be able to interpret some or all of this but we are not there yet – either from available toolsets or in having enough data scientists working in DevOps roles to model it.

Now we have a clean data set we now have a gold mine of value. Our favourite patterns to get quick value for operations is using clustering techniques. Cluster analysis group the entities together on the basis of their similarities and differences. With clean data, we were able to provide actionable insights almost immediately.

Our methodology is logical, a quick proof-of-concept was done and the results were immediate in days, but – more importantly, the customer now understands why their expectations were not met and that, by improving their data integrity, there is a clear path to get there.

“The nice thing about standards is that you have so many to choose from.”

Andrew S. Tanenbaum

If you are struggling with data integrity and virtualisation in a hybrid or mixed vendor environment get in touch – we can help.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.