LANGUAGE

Evaluating machine learning models: Establishing quality gates

7.12.2021 | 8 minutes of reading time

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated and compared in a manageable way. If the number of models increases into the dozens or even hundreds, depending on the use case and whether they also have to be constantly retrained, the manual procedure no longer tends to scale.
The preparation of the data, the training of the models and their evaluation can be automated in the form of machine learning pipelines. However, a qualified evaluation of the data or an evaluation of the predictive quality of a model has its pitfalls – especially in automated form.

Example of a simple machine learning pipeline

In classic software development, unit tests, integration tests, end-to-end tests etc. have become established within CI/CD pipelines for quality assurance. However, to create machine learning models, measures that ensure the quality of a model are also necessary. These should accompany the process from the preparation of raw data to the delivery of a model. They can ideally be integrated into the pipeline as well.

Quality gate

The term ‘quality gate’ originates from project management and describes the introduction of ‘checkpoints’ into the project process. The aim is to divide the project duration into shorter, manageable sections in order to be able to track progress. In this context, verifiable success criteria are to be defined in advance for each gate.
For example, a gate contains a list of goals or quality requirements for the resulting artifacts at the end of a project phase. These must be met before the next stage in the process flow is started. However, checking the status can also lead to the project being aborted – because it becomes transparent that essential characteristics have not been achieved or will very likely not be achieved at all. Stopping a project at an early stage may therefore also save time, resources and money.

Quality gates in the project process

In the machine learning context, quality gates can be integrated as dedicated checks between the individual steps of an automated pipeline. Thus, they can track and monitor the quality of the artifacts in the process. Typical artifacts are the raw and processed data as well as the trained models.

Data

The quality of the data is the basis for being able to successfully train machine learning models at all. Therefore, the first quality gates are dedicated to the data and they ensure that the training of the models happens on a meaningful data basis. First, the raw data can be checked. They must correspond to the previous assumptions/statistics and characteristics, otherwise corrections in the setup of the model and training may become necessary.
In general, the preparation of the raw data is highly individual and may be costly. This means that further test points, which are not shown here, may have to be used between the preparation steps.
In addition, the distribution of the data in the resulting data sets can be tested. A fair and representative distribution of the data as well as a sufficient number of data sets in the sets should be guaranteed. Otherwise, the evaluations on possible validation and test sets are only of limited value [1] .
There is a tendency that an insufficient data basis does not lead to the high-quality models we aim for. If these gates are not passed, an early termination already at the beginning of the pipeline can save time, resources and money, especially if these gates can be introduced with manageable effort.

Examples of data quality gates

Metrics

In order to evaluate the quality of a model’s predictions in a pipeline, a set of automatically evaluable criteria is mandatory. Ideally, numerical metrics are used for this purpose. These can be evaluated on predictions of the model on the retained test set.
Of course, the possible metrics and threshold values depend on the particular application. For example, in case of a classification it is necessary to find a good tradeoff between coverage and precision and it has to be defined from which prediction certainty a class is considered to be recognized [2] . But once adequate metrics have been selected and the results or threshold values to be achieved have been defined, these criteria can be checked automatically in an initial quality gate for the model.
However, the question arises of course: Is it worth all the effort to introduce complicated machine learning models? Or aren’t there simpler alternatives to meet the requirements?

Baselines

Rather rudimentary solutions can usually be implemented cheaply and quickly using simpler models. These models can be created by heuristics, simple statistics or even a simple generation of random values and have to be beaten in a test. By comparing them with baseline models, one gets an idea of what performance is easy to achieve. To do this, predictions of the baseline and a candidate model can be generated on the same test set and compared to the selected metrics. How big the difference is to a simpler solution can thus be illustrated.
There should already be a certain performance difference in favor of the more complex model to justify its use. If it is clear at which difference there is a real added value of the ML approach, this can be used as a criterion and thus another gate can be integrated into the process. If the baseline is not beaten, further optimization of the data or the training is necessary or even the chosen approach has to be reconsidered.

Examples of Model Quality Gates

A/B tests

However, the actual performance of the model will probably only become apparent in a real-life situation, for example with the help of A/B tests. The A/B test (also split test) is a test method for evaluating two variants of a system, in which the original version is tested against a slightly modified version. [3] . If the infrastructure supports A/B tests, these can be run and provide helpful insights.
For example, correlations between the previously tested offline metrics and the results of the online metrics of the A/B tests might exist. This makes it possible to assess to what extent the offline metrics can predict the performance of the models at all, or which ones should most likely be considered.
What can be done and assessed manually in advance is an A/B test between a first trained model and the baseline.
In addition, if further model approaches are to be tested, the last best model can of course also be compared with a new potentially better model. Thus A/B testing can also be integrated as a quality gate before deployment at the end of the pipeline. However, A/B tests might be time-consuming and therefore not always practical in an automated way.

Production

One difference, compared to the project management approach, however, is that the typical ML process runs in cycles. Ultimately, measuring the quality of a model in the various gates within a machine learning pipeline is just a bet on the future.
Once the model is in production, it is subject to some degeneration, which can vary in degree and speed depending on the domain. This means that models become obsolete over time and may need to be quasi-adapted to new circumstances or data. This can be achieved, for example, by renewing or further training the models, which starts the process all over again.
Quality gates are therefore also to be used during the operation of the models in order to determine whether the model delivers or can continue delivering adequate results in a constantly changing environment.
In the case of a ‘data drift’ the data could have been changed, like the distribution or value ranges of the data or features. Or the patterns that the model has learned are no longer valid and are subject to a concept drift. [4] . For example, seasonal effects or, as just happened in times of a pandemic, unexpected effects can influence the prediction quality of models [5] .
Therefore, even when the model is running, a further gate can regularly check new data and verify that it still matches the assumptions.

Machine Learning quality gates chain including monitoring of current model quality

How often a model has to be retrained and which changes have to be made depends on the application and the characteristics of the new conditions. Whether it is worthwhile at all is also to be estimated, since if necessary the costs around new models to train can be not insignificant.
This trade-off between cost and benefit or even criticality of the model is also a gate. But that gate is only opened for the next step, the new training, if the model is ‘bad’ enough.

Conclusion

As exemplified here, quality gates can be integrated at various points in a pipeline to ensure the quality of the respective artifacts of a sub-process. For this purpose, success criteria that must be met in order to pass a gate must be defined in advance and can be checked automatically.
Stopping a machine learning pipeline early can save time, resources and money. After all, if a gate fails, it is foreseeable that in the end a sufficient prediction quality of the model can no longer be guaranteed.
Finally, models must also be monitored in production to detect potential model degeneration at an early stage and respond accordingly.
In addition, a number of other gates and tests can be considered to ensure that the ecosystem around the machine learning approach works in practice. However, this is beyond the scope of this article, but should not go unmentioned here [6] .

References

[1] codecentric Blog – Evaluating machine learning models: The issue with test data sets

[2] codecentric Blog – Evaluating machine learning models: How to tackle metrics

[3] wikipedia – AB-Test

[4] wikipedia – Concept drift

[5] Fortune – A.I. algorithms had to change when COVID-19 changed consumer behavior

[6] Google, Inc. – The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Was this post helpful?

LANGUAGE

Likes

Blog author

Berthold Schulte

Consultant Data & AI

Do you still have questions? Just send me a message.

fromBerthold Schulte

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 Minuten Lesezeit

Berthold Schulte

Evaluating machine learning models: How to tackle metrics

Once a model has been trained, it can be evaluated in different ways and with more or less complex and meaningful procedures and metrics. However, the number and possible criteria for evaluating machine learning models can initially be quite confusing...

Data
Machine Learning
Software development

1.7.2019 | 14 Minuten Lesezeit

Berthold Schulte

Event-driven Microservices & Event Processing

Auf dem Weg von einem Monolithen oder einer grünen Wiese zu einer Landschaft von Microservices sind viele Pfade zu beschreiten und Design-Entscheidungen zu treffen. Neben dem Aufbau fachlich sinnvoll abgegrenzter Serviceeinheiten gilt ein Augenmerk der...

Softwarearchitektur
Microservices

8.8.2016 | 10 Minuten Lesezeit

Berthold Schulte

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Automatische Dependency-Updates mit Renovate

Bei der Softwareentwicklung ist es sinnvoll, bereits bestehende Funktionen wiederzuverwenden. Das spart Zeit und es wird unwahrscheinlicher, auf Probleme zu stoßen, die andere bereits gelöst haben. Funktionen können aus diesem Grund in Libraries gebündelt...

Softwareentwicklung
CI/CD

17.4.2023 | 6 Minuten Lesezeit

Alexander Backes

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Platform Engineering mit BackstageIm folgenden Interview berichten Marc Schnitzius und Pascal Sochacki von ihren ersten Erfahrungen mit Backstage als Platform-Engineering-Lösung.Marco Paga: Marc, Pascal, ihr habt eine Sicht auf Platform Engineering, ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

2.3.2023 | 12 Minuten Lesezeit

Marco Paga

Maximilian Mayer

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

„Platform Engineering ist eine Art von Knowledge Sharing“

Warum „Platform Engineering“ eigentlich der falsche Begriff ist und wie man den Golden Path findet, erklärt Daniel Kocot, Senior Solution Architect, im folgenden Interview.Marco Paga: Warum ist Platform Engineering interessant?Daniel Kocot: Ich habe ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

20.2.2023 | 11 Minuten Lesezeit

Daniel Kocot

Marco Paga

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

Platform Engineering – Eine Einordnung

Aktuell kocht mit Platform Engineering gerade ein Thema hoch, das in den Weiten des World Wide Web für viele Reaktionen sorgt. Gerade auch Kunden aus dem Enterprise-Umfeld führt es zu interessanten Nebeneffekten, wenn aus DevOps-Teams plötzlich Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 4 Minuten Lesezeit

Daniel Kocot

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Tekton Triggers in der Praxis

Tekton Triggers in der PraxisDieser Artikel ist Teil einer Reihe, die sich mit Tekton CI/CD und dem praktischen Einsatz beschäftigt.Im ersten Artikel haben wir die Installation vorgenommen und die erste Pipeline erstellt. Im zweiten Artikel haben wir...

CI/CD

4.3.2022 | 6 Minuten Lesezeit

Marco Paga

Tekton Buildpack Pipeline: Alles schon da?

Im ersten Artikel haben wir die Tekton-Installation gemeistert, erste API-Objekte kennengelernt und dabei eine erste kleine Pipeline erstellt. Hier eine kurze grafische Zusammenfassung als Erinnerung. Jetzt werden wir eine praktisch nutzbare Pipeline...

CI/CD
Softwareentwicklung

11.2.2022 | 5 Minuten Lesezeit

Marco Paga

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

In diesem Artikel möchte ich einen Überblick über Tekton geben mit dem Ziel, die Grundlagen zu erklären und einen schnellen Einstieg zu ermöglichen.Tekton möchte laut eigener Homepage der Standard für CI / CD werden. Zum einen bietet es ein Framework...

CI/CD
Kubernetes
Softwareentwicklung

19.1.2022 | 6 Minuten Lesezeit

Marco Paga

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Evaluating machine learning models: Establishing quality gates

Quality gate

Data

Metrics

Baselines

A/B tests

Production

Conclusion

References

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Evaluating machine learning models: The issue with test data sets

Evaluating machine learning models: How to tackle metrics

Event-driven Microservices & Event Processing

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

CI/CD-Pipelines mit AWS CDK CodePipeline

Große Sprachmodelle: Was ist ein LLM?

Automatische Dependency-Updates mit Renovate

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Bessere SQL-Datenpipelines mit dbt

„Platform Engineering ist eine Art von Knowledge Sharing“

Open Policy Agent – Maschinen, die auf Regeln starren

Platform Engineering – Eine Einordnung

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Tekton Triggers in der Praxis

Tekton Buildpack Pipeline: Alles schon da?

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten