When the sensor calls
It is like a dream come true.
Running factories with hundreds of machines for forging, cutting, melting, extruding, all monitored continuously 24/7 by many more sensors. Gigabytes of data streamed directly in the cloud. Every alteration in the production is detected: vibration, temperature, pressure are measured and processed by AI. Eventually, a signal will trigger the red alarm: failure in 42 days, maintenance required. No downtime, no unforeseen costs, a boost of production. This is the Predictive Maintenance world, and no production asset, from chemical plants, to your own washing-machine, will ever, unexpectedly, break again.
Or, at least, this is the dream.
The aim of Predictive Maintenance (PrM) is to timely identify anomalies, and indicate a safe interval of time to perform the necessary repair. Figure 1 shows a typical component lifecycle, where an indicator of the component’s health is plotted on the y-axis against time. This scheme is useful to see the advantages of Predictive vs Preventive Maintenance (PvM) vs Repair to Failure (R2F). In Figure 1, three different sectors, identified by numbers, can be noticed and each one represents a component cycle from the initial deployment (healthy status) until the failure or the repair.
Figure 1. An inspection report. Dashed lines identify maintenance events, intervals are numbered from 1 to 3. The component degradation is random. From this data, one would start a Preventive Maintenance (PvM) scheme choosing the smallest life-cycle, interval 3. This would certainly result in unexploited resources. A Repair to Failure (R2F) scheme would, conversely, exploit all the available resources, yet producing unplanned downtime. A Predictive Maintenance (PrM) scheme would suggest an optimal repair time window to minimize the unexploited time and resources, avoiding unexpected downtime.
In the R2F strategy, the whole component life is used, e.g. a repair is performed only at failure, causing downtime and unpredictability. A preventive strategy would be implemented with calendar-based inspections at a frequency that is the shortest experienced component life-cycle (e.g. in Figure 1 the time interval identified with the number 3), resulting in sub-optimal exploitation of machinery lifetime.
A predictive strategy outperforms the precedent ones in terms of increase in resource utilization and decrease of repair frequency, for example in intervals 1 and 2. An interesting case happens in interval 3, when the component experiences a fast degradation and the time window from the identification of the failure to the predicted failure is very short (few hours). Well, in this case, extra costs for unplanned maintenance would be likely to occur, whatever the strategy is. Nevertheless, PrM would result in shorter or null downtime, and would also give the necessary time to take actions to exclude secondary damages (such as changing the production speed).
The supremacy of PrM over other strategies makes it very attractive. Truth is, that the path to the adoption of a Predictive Maintenance strategy is complex, and many factors affect the success of its design/deployment, such as:
- Uncertainties in production
- Data quality/preprocessing
- Availability of failure data, past maintenance scenario
- Complex feature engineering, domain-specific knowledge
Let’s start analysing the most hindering fact: uncertainties are unavoidable. Lack of knowledge, the so-called epistemic uncertainty, and aleatory variability, like ambient noise, human error, black-swan events, make every model questionable.
For example, machinery usually works under noisy conditions that influence the data obtained from sensors. Of course, noises can be filtered out by applying signal processing methods. But an important distinction is that noises which are of additive nature, let’s say they are superimposed to the sensor data, are easily detected and filtered. out. Conversely, there are noises which are tightly connected to the data we want to process, e.g. multiplicative noises, that are more difficult to identify and preprocess.
Other sources of uncertainty might come from sensor deterioration, human error, variability in the mechanical properties, an so on.
The model we use for the identification of the component health state, and consequently its degradation is then almost never certain. Quoting Georg Box, famous statisticians:
Essentially all models are wrong, but some are useful.
In this context, as uncertainty is unavoidable, there is no other rational option than having models which embrace it. These are useful, and let me add, reliable and robust models.
The payoff is not of philosophical nature. It is a revelation that is a business driver: predictive maintenance strategies that account for uncertainty of the prediction are robust. And this, in turn, gives the possibility to deliver robust business specific indicators for decision makers.
But how do we design such robust strategies? What are the key factors for delivering reliable results? How to calculate the costs and the risks? And how the benefits?
The approach we sketch in this article is based on three keystones:
- The analysis of the data
- The cost-benefit oriented design
- The iterative deployment
In this article we skip the many technicalities with the aim to focus on the conceptual framework for a robust maintenance design.
The origin of everything: The data
Predictive maintenance strategies are based on algorithmic intelligence which monitors the health of an asset, be it a small component or a whole system of many components.
At the origin of the maintenance design there is a classical data flow process: Data collection, pre-processing, features extraction, information extrapolation and prognostic. But which data? Just sensor output? No, not necessarily.
The algorithms learn to discern a healthy from an altered state, caused by some potential anomaly. In general, all available information, structured or unstructured can be used to train the algorithm.
Four data scenarios are encountered in practice.
- Streams of data are available.
Classification algorithms can be used to find patterns that indicate component or system degradation. With the increase of source inputs, dimensionality reduction techniques are required to find the most significant features. Time series forecasting plays a fundamental role in this case, as well as signal preprocessing, resampling, feature extraction.
Classification based on Machine Learning techniques are easier to be interpreted with respect to AI ANN, GAN etc, due to the black box intrinsic nature of the latter. Yet, it must be remarked that there are research progresses in the direction of interpretability of AI results .
- Experimental data is available.
It is very desirable, yet not strictly necessary, to have experimental test data for healthy and faulty conditions of single components. In this case, the data quality is very high at component level. This means that data delivers information ready to be processed for identification of the health status. The sensors’ signals might require preprocessing for the extraction of features that are then used to train a Machine Learning algorithm.
- Simulation data is available.
System modeler softwares, finite element methods analysis tools, computational fluid dynamic softwares, are, among others, useful tools for simulating the different failure modes. This is a physics based approach, which requires domain-knowledge to identify and simulate possible faults.
- Past inspections are available.
Logs, past inspections, maintenance reports and expert assessments can be used to infer features on the equipment’s health status.
From the above considerations, we sum up an easy takeaway: all the available data can and shall be used.
The decision process: Predictive Maintenance Cost-Benefit Oriented Design
How does a decision maker evaluate the advantage of investing in a Predictive Maintenance strategy?
To answer to this question, we need to get slightly more specific about how we can estimate the performances, first. One way is to focus on a measure of the predictive ability that can be easily connected to costs. We consider the Receiver Operating Characteristic (ROC curve, term coming from radar theory in the fifties) depicted in Figure 2. It is a curve in which each point has in the y-axis the probability of detection (also known as recall) and in the x-axis the probability of false alarm (also known as fall-out). Each point in the ROC is the result of tuning many parameters, from the preprocessing of data to the hyper-parameters of the ML algorithm.
Figure 2. The Receiver Operating Characteristic (ROC) curve, expresses the ability of any binary classifier as its discriminant threshold is varied. Each point is calculated from the confusion matrix (bottom right) for a value of the threshold, where TP (True Positive) and TN (True Negative) are correct classifications and FN (False Negative) and FP (False Positive) are misclassifications. Every outcome above the diagonal line is better than tossing a fair coin. The point (0,1) is the point of perfect classification.
Now, to obtain the optimal parameters of the predictive strategy we follow two steps: first we associate a cost to every possible classification (i.e. to every point of the ROC curve) and then we minimise the total cost, that is the maintenance cost.
We start from a clear cost analysis (for a component, a machinery or an ensemble of components/machinery). More specifically, the analysis is conducted calculating the misclassification costs and two main cases are relevant:
- The first case is when an alarm is triggered but no replacement is needed. These are called false alarms or false positive. Costs falling into this category are for example:
- Inspection downtime cost (stop the production to inspect without repair)
- Maintenance cost (cost of the technicians)
- The second case is when a failure happens without being detected. This might cause:
- Production downtime cost (usually higher than the inspection downtime costs)
- Maintenance cost (cost of the technicians)
- Replacement / repair cost
It is worth to notice that false detections and missed detections events are conflicting with respect to the detection threshold, and therefore their sum (the cost functional) has a minimal value.
Lastly, we need to account for the number of failure occurrences in a specific time window (let’s say, per year). To this aim, a probabilistic model that represents the variation of the component health in time (degradation model) can be built. For example, one degradation model would give an indicator such as the so-called Remaining Useful Lifetime, RUL, see Figure 3. At the end of this analysis we get a probabilistic representation of the yearly fluctuation of the component health and, consequently, we can infer the number of misclassifications and their total cost per year.
Figure 3. The Remaining Useful Lifetime (RUL) is the optimal interval of time to perform a repair. Dashed lines in the figure indicate confidence intervals.
Benefits of Predictive Maintenance can be also calculated. To this aim, one commonly used strategy is the cost comparison between PrM versus R2F and PvM. The comparison terms are still the previously defined costs, calculated for the different possible strategies, adding the implementation costs (e.g. sensors, thermographic cameras and the monitoring infrastructure). Monte Carlo simulation, comes very handy to perform this comparison. Once the cost-benefit analysis is done, other statistics for the support of the decision process such as calculation of the probability of extreme events and risks of failure scenario can be evaluated.
Learning from the past (and the errors)
The previous sketches an approach for the evaluation of the costs and the benefits of a PrM strategy, mainly based on the misclassification events.
The whole procedure can now be implemented: we start from processing the data, we feed a ML/AI algorithm whose parameters follows the optimisation of a cost functional, and we end up with a decision, i.e. the RUL. Once the signal “maintenance in 42 days” is triggered, the inspection takes place. As last step, we can feedback the result of the inspection in the algorithm, by using the statistical technique called Bayesian updating.
The powerful idea behind it, is to adjust the belief in the model ability to predict, through the experience. The details are beyond the scope of this introductory blog, but just to understand the concept:
- We start from a model, that is based on the available data and this is our prior belief on the model ability of predict failures.
- Then we apply the model and we get new data. We quantify the likelihood that the new obtained data match well the model.
- Lastly, we evaluate the posterior belief on the model ability.
The three steps happens iteratively, that is, the more data on the performances is available, the better the predictive performances will be.
Calendar-based maintenance is implemented widely in production sectors and causes unnecessary inspections/replacements and downtime costs. As alternative, Predictive Maintenance is based on the concept of monitoring the production assets through sensors, notifying the right time to perform the required repair.
This blog sketches the key concepts to plan and implement a robust Predictive Maintenance strategy that really represents a business driver. Summing up:
- PrM optimizes the production process, increasing reliability and resiliency of the production.
- PrM can be smoothly implemented alongside other maintenance traditional schemes, by letting the PrM algorithm learn from the inspections’ results when they happen. This avoids the problem of having a priori information about the many possible failures’ scenarios.
- Generally, PrM can be implemented with clear investments, following a cost/benefit analysis.
- If correctly planned, it is possible to minimize the risk of sensors’ failure in detection (such as in case of sensor breaking, outliers events, false alarms) increasing the overall reliability of the PrM.
- The optimization of PrM does not necessarily imply the need to change the production process to generate benefits.
- Bayesian update, that is continuous learning, makes PrM strategy iteratively better.
We will continue to report on implementation examples, hands-on and use cases in this blog, so come back soon for more details on Predictive Maintenance!