Evaluating machine learning models: Establishing quality gates

No Comments

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated and compared in a manageable way. If the number of models increases into the dozens or even hundreds, depending on the use case and whether they also have to be constantly retrained, the manual procedure no longer tends to scale.
The preparation of the data, the training of the models and their evaluation can be automated in the form of machine learning pipelines. However, a qualified evaluation of the data or an evaluation of the predictive quality of a model has its pitfalls – especially in automated form.

Example of a simple machine learning pipeline

Example of a simple machine learning pipeline

In classic software development, unit tests, integration tests, end-to-end tests etc. have become established within CI/CD pipelines for quality assurance. However, to create machine learning models, measures that ensure the quality of a model are also necessary. These should accompany the process from the preparation of raw data to the delivery of a model. They can ideally be integrated into the pipeline as well.

Quality gate

The term ‘quality gate’ originates from project management and describes the introduction of ‘checkpoints’ into the project process. The aim is to divide the project duration into shorter, manageable sections in order to be able to track progress. In this context, verifiable success criteria are to be defined in advance for each gate.
For example, a gate contains a list of goals or quality requirements for the resulting artifacts at the end of a project phase. These must be met before the next stage in the process flow is started. However, checking the status can also lead to the project being aborted – because it becomes transparent that essential characteristics have not been achieved or will very likely not be achieved at all. Stopping a project at an early stage may therefore also save time, resources and money.

Quality gates in the project process

Quality gates in the project process

In the machine learning context, quality gates can be integrated as dedicated checks between the individual steps of an automated pipeline. Thus, they can track and monitor the quality of the artifacts in the process. Typical artifacts are the raw and processed data as well as the trained models.

Data

The quality of the data is the basis for being able to successfully train machine learning models at all. Therefore, the first quality gates are dedicated to the data and they ensure that the training of the models happens on a meaningful data basis. First, the raw data can be checked. They must correspond to the previous assumptions/statistics and characteristics, otherwise corrections in the setup of the model and training may become necessary.
In general, the preparation of the raw data is highly individual and may be costly. This means that further test points, which are not shown here, may have to be used between the preparation steps.
In addition, the distribution of the data in the resulting data sets can be tested. A fair and representative distribution of the data as well as a sufficient number of data sets in the sets should be guaranteed. Otherwise, the evaluations on possible validation and test sets are only of limited value [1].
There is a tendency that an insufficient data basis does not lead to the high-quality models we aim for. If these gates are not passed, an early termination already at the beginning of the pipeline can save time, resources and money, especially if these gates can be introduced with manageable effort.

Examples of data quality gates within a machine learning pipeline.

Examples of data quality gates

Metrics

In order to evaluate the quality of a model’s predictions in a pipeline, a set of automatically evaluable criteria is mandatory. Ideally, numerical metrics are used for this purpose. These can be evaluated on predictions of the model on the retained test set.
Of course, the possible metrics and threshold values depend on the particular application. For example, in case of a classification it is necessary to find a good tradeoff between coverage and precision and it has to be defined from which prediction certainty a class is considered to be recognized [2]. But once adequate metrics have been selected and the results or threshold values to be achieved have been defined, these criteria can be checked automatically in an initial quality gate for the model.
However, the question arises of course: Is it worth all the effort to introduce complicated machine learning models? Or aren’t there simpler alternatives to meet the requirements?

Baselines

Rather rudimentary solutions can usually be implemented cheaply and quickly using simpler models. These models can be created by heuristics, simple statistics or even a simple generation of random values and have to be beaten in a test. By comparing them with baseline models, one gets an idea of what performance is easy to achieve. To do this, predictions of the baseline and a candidate model can be generated on the same test set and compared to the selected metrics. How big the difference is to a simpler solution can thus be illustrated.
There should already be a certain performance difference in favor of the more complex model to justify its use. If it is clear at which difference there is a real added value of the ML approach, this can be used as a criterion and thus another gate can be integrated into the process. If the baseline is not beaten, further optimization of the data or the training is necessary or even the chosen approach has to be reconsidered.

Examples for Model Quality Gates within a Machine Learning Pipeline

Examples of Model Quality Gates

A/B tests

However, the actual performance of the model will probably only become apparent in a real-life situation, for example with the help of A/B tests. The A/B test (also split test) is a test method for evaluating two variants of a system, in which the original version is tested against a slightly modified version. [3]. If the infrastructure supports A/B tests, these can be run and provide helpful insights.
For example, correlations between the previously tested offline metrics and the results of the online metrics of the A/B tests might exist. This makes it possible to assess to what extent the offline metrics can predict the performance of the models at all, or which ones should most likely be considered.
What can be done and assessed manually in advance is an A/B test between a first trained model and the baseline.
In addition, if further model approaches are to be tested, the last best model can of course also be compared with a new potentially better model. Thus A/B testing can also be integrated as a quality gate before deployment at the end of the pipeline. However, A/B tests might be time-consuming and therefore not always practical in an automated way.

Production

One difference, compared to the project management approach, however, is that the typical ML process runs in cycles. Ultimately, measuring the quality of a model in the various gates within a machine learning pipeline is just a bet on the future.
Once the model is in production, it is subject to some degeneration, which can vary in degree and speed depending on the domain. This means that models become obsolete over time and may need to be quasi-adapted to new circumstances or data. This can be achieved, for example, by renewing or further training the models, which starts the process all over again.
Quality gates are therefore also to be used during the operation of the models in order to determine whether the model delivers or can continue delivering adequate results in a constantly changing environment.
In the case of a ‘data drift’ the data could have been changed, like the distribution or value ranges of the data or features. Or the patterns that the model has learned are no longer valid and are subject to a concept drift. [4]. For example, seasonal effects or, as just happened in times of a pandemic, unexpected effects can influence the prediction quality of models [5].
Therefore, even when the model is running, a further gate can regularly check new data and verify that it still matches the assumptions.

Machine Learning Quality Gates Chain

Machine Learning quality gates chain including monitoring of current model quality

How often a model has to be retrained and which changes have to be made depends on the application and the characteristics of the new conditions. Whether it is worthwhile at all is also to be estimated, since if necessary the costs around new models to train can be not insignificant.
This trade-off between cost and benefit or even criticality of the model is also a gate. But that gate is only opened for the next step, the new training, if the model is ‘bad’ enough.

Conclusion

As exemplified here, quality gates can be integrated at various points in a pipeline to ensure the quality of the respective artifacts of a sub-process. For this purpose, success criteria that must be met in order to pass a gate must be defined in advance and can be checked automatically.
Stopping a machine learning pipeline early can save time, resources and money. After all, if a gate fails, it is foreseeable that in the end a sufficient prediction quality of the model can no longer be guaranteed.
Finally, models must also be monitored in production to detect potential model degeneration at an early stage and respond accordingly.
In addition, a number of other gates and tests can be considered to ensure that the ecosystem around the machine learning approach works in practice. However, this is beyond the scope of this article, but should not go unmentioned here [6].

References

[1] codecentric Blog – Evaluating machine learning models: The issue with test data sets

[2] codecentric Blog – Evaluating machine learning models: How to tackle metrics

[3] wikipedia – AB-Test

[4] wikipedia – Concept drift

[5] Fortune – A.I. algorithms had to change when COVID-19 changed consumer behavior

[6] Google, Inc. – The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

More than 20 years ago, I started feeding machines with code to create applications.
Nowadays I feed machines with data to get value out of it – and it’s still great!
That’s why I focus on Machine- and Deep-Learning methods and technologies and work as a consultant in the Data Science and AI environment.

Comment

Your email address will not be published.