What You Need to Know About MLOps
Posted by Tirthankar RayChaudhuri on Jan 15,2024
Most heavy-duty industrial operations today have monitoring systems in place which gather data records (logs) of the daily operational cycle. Such data is stored, archived for a while and then destroyed. In case of an operational incident the team engaged to resolve the issue will need to refer to these monitoring logs in their efforts to identify the root cause of the incident.
More recently the popular field of MLOps has emerged where AI/ML agents are tasked with analyzing without human intervention the operational monitoring logs of malfunctioning systems to identify the root cause of an incident. As per the IBM definition “MLOps enables IT operations teams to respond more quickly, even proactively to slowdowns and outages, with end-to-end visibility and context”.
Within this discussion on MLOps we would like to present to the reader the AI/ML mechanism to proactively identify and even predict a system slowdown or an outage known as “Anomaly Detection”. This is described as follows
AI/ML systems are first trained on time-series data from numerous monitored parameters to recognize the ‘normal’ operational levels of these parameters. Having learned these normal boundary levels of operation such a system is instructed to identify significant variations from such boundaries when provided with a continuous feed of incoming monitoring data. Automated identification of such variations is known as anomaly detection in MLOps. When an MLOps monitoring system detects such an anomaly the incident management team are alerted and investigations commence to discover the root cause of the anomaly so as to proactively prevent an outage or a slowdown. In addition MLOps systems can also be trained on past data from anomalies and outages whose root causes have previously been detected and resolved. Such trained MLOps agents can thereafter be employed to conduct automated root cause analysis of new incidents and thereby accelerate the time of incident resolution.
Anomaly Detection within MLOps is thus a highly beneficial technique for an Operations Manager to both proactively enhance the uptime and reliability of business systems as well as to have available a means of intelligent and efficient incident data analysis for quickly identifying root causes of system failures. MLOps also includes mechanisms for assessing and scoring risk levels of upcoming planned operational changes to systems based on training a machine learning agent on past data from earlier change records. Such a machine learning agent is thereafter employed to evaluate the risk score of an upcoming planned change to operational parameters.
MLOps systems can also be configured to generate daily reports of operational performance of multiple systems being monitored including health check details, anomalies, incidents, risks of changes, etc. Such reports can be made available by an Operations Manager to executive leaders as needed for detailed tracking and review of the overall performance of various business systems in continuous operation.