DeployPartners

View Original

An Incremental path to Closed-Loop Assurance

At DeployPartners, we started hearing about “Closed-Loop Assurance” a few years ago but it really is starting to kick off now. Initially, it stemmed from combining service orchestration and assurance with the introduction of virtualised components. The nature of virtualisation demands a much stronger collaboration between the Assurance and Fulfilment stacks because of the speed and volume of virtualisation not to mention the decommission and the scaling. 

Right now, Assurance systems could generally be considered open-ended systems, i.e. output of the system doesn’t change its operation. In contrast output from a closed-loop system, the outputs are fed into an automation platform to bring the component back into acceptable performance. Traditionally assurance systems have relied on human operators to close the loop, with limited use of fulfilment or Test & Diagnosis systems to automatically close the loop. 

Today we see our customer’s assurance systems discarding, ignoring, filtering, deduplicating, averaging data into small actionable pieces (alarms or incidents) for a human operator to resolve. The way we see it is that operations are missing out a large amount of data that will help them transition to close the loop on abnormalities in an automated way.

This data is alarms, log files, performance metrics, configuration changes, flow data and session records – everything. The practice of summarising data due to expensive infrastructure and poor scaling must stop. New approaches that leverage modern technologies and cheap infrastructure capture this data and unify it to unleash analytics and machine learning and start automating appropriate responses. 

Applying machine learning tools to the data, we can start identifying triggers on each abnormality. This means we need to revisit what these events and\or scenarios mean and see how we can enrich it with other data sets and trigger an action. Ultimately, what we want to identify is an actionable abnormality which requires manual operator intervention or can be self-corrected via automation.

Operationally the impact of Closed-Loop Assurance is huge. The system must be dynamic, allowing for adjustments for the environment and allowing for greater tolerances in some areas and in others much less. The highly dynamic nature of Closed-Loop Assurance systems flows through to the operational teams and requires a different mindset to accompany the new capacities and demands. We hear from Operations Managers that they never want to deal with the same incident twice. One approach is that automation must be evaluated before an Incident can be closed.

In practical terms we propose the following steps:

  1.   Create an Assurance Lake – raw (from the device) and normalised (from legacy toolsets)

  2.   Leverage machine learning tools –  i.e., find clustered data structures, anomalies and trends  

  3.   Identify Actionable Trigger Scenarios  – manual action and for automation

  4.   Develop Automation (leverage Fulfillment or T&D systems if possible)

  5.   Test & Diagnosis inputs – i.e. go on to the device and double-check and provide some quick summary data into the task

  6.   Self Correction – attempt to resolve common patterns automatically and Close the Loop

We see Closed-Loop assurance as part of quality management and continuous improvement, which is the goal of all Operations team. In our analysis, we have found that legacy systems can be uplifted and extended with modern systems to provide these capabilities. What Closed-Loop Assurance offers is a framework for systems and process design to dramatically improve operational efficiency.