LANGUAGE

Evaluating machine learning models: How to tackle metrics

1.7.2019 | 13 minutes of reading time

Once a model has been trained, it can be evaluated in different ways and with more or less complex and meaningful procedures and metrics. However, the number and possible criteria for evaluating machine learning models can initially be quite confusing to someone who is just starting to deal with the field of machine learning.

For example, it depends on whether the learning is un-supervised or supervised. In the case of supervised learning it also depends on whether we are dealing with a regression or classification, the underlying use case, and so on – to name just a few criteria. I would like to start with supervised learning and classification. In this article I will introduce seven common metrics and methods for evaluating machine learning models using one example throughout this post. However, a detailed discussion would be beyond the scope of this introduction, so I will only touch only briefly on the mathematics underlying the metrics.

Use case

A manufacturer of drinking glasses wants to identify and sort out defective glasses in his production. A model for image classification is to be trained and used for this purpose. The database consists of images of intact and defective glasses. Intact glasses are represented by 0 in this binary classification; accordingly, defective glasses are represented by 1.

In the following test data (y_true), for example, eight defective and two intact glasses are present and seven of them are correctly classified as defective by the model (y_pred).


# Test data  
y_true: [0, 1, 1, 1, 0, 1, 1, 1, 1, 1]

# Forecast/prediction of the model
y_pred: [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]

Accuracy

Accuracy is probably the most easily understood metric and compares the number of correct classifications to all the classifications to be made.

In the following example, only one value was wrongly predicted and therefore an accuracy of 90% was achieved.


y_true:   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  
y_pred:   [0, 1, 0, 0, 0, 1, 1, 1, 1, 1]  
Accuracy: 90%.

If the classes in the data are extremely unevenly distributed, viewing the accuracy alone can unfortunately lead to wrong conclusions. For example, if one class exists nine times more often in the data than the other, simply predicting the more frequent class is enough to achieve 90% accuracy – although the model may not be able to predict the other class. Whether the model is able to distinguish defective glasses from intact glasses is not assessable with this metric and a class distribution, since it always seems to predict a defective glass, as seen in the following example.


y_true:   [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]  
y_pred:   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 90%.

In addition to such imponderables, it naturally also depends on the use case which metrics can be used for evaluating machine learning models.

Binary classification

First, a binary classification [1] can distinguish four cases:

The class was predicted – which is correct (hit, true positive)
The class was not predicted – which is correct (reject, true negative)
The class was predicted – which is wrong (false alarm, false positive)
The class was not predicted – which is wrong (miss, false negative)

For these four cases, however, different designations are common [2]. Since hit, correct reject, false alarm and miss seem to be most descriptive to me, I will use them here.

Trade-Off between coverage and precision

If, for example, it is important to recognize a class in the data as comprehensively as possible, the coverage (recall, hit rate, true positive rate) can be used as a quality criterion. Recall represents the number of hit in relation to the sum of hit and miss, i.e. all defective glasses to be detected:
Recall
If the model as above simply predicts a defective glass, i.e. only one class, the coverage is of course still hit=9 and miss=0 and thus Recall=9/\(9+0\)=1.

In contrast to this, precision describes the ratio of the number of hit to the sum of hit and false alarm, i.e. all glasses supposedly detected as defective:

At this point, predicting only one class would of course increase the number of false alarm to 1 and thus result in less precision – (Precision=9/\(9+1\)=0,9.

Finally, the coverage describes how many supposed hits there were and the precision how many were correct. Achieving both complete coverage and high precision is desirable. In practice, however, this will not always be possible and the focus will not always be on both. The F-measurement can be evaluated so that the two metrics do not have to be checked and weighed up against each other at the same time.

F-measurement / F1-Score

The F-measurement describes the harmonic mean between recall and precision and combines the two metrics to one value [3].
F1-Score
The use of the harmonic mean means that the metric is somewhat more ‘sensitive’ to the smaller of the two values than it would be in the arithmetic mean.
For example, a recall value of (1,0) and a precision value of (0,1) results in an arithmetic mean of 0,55 – which intuitively does not do justice to the small value of precision. However, the harmonic mean is calculated to 0,18 and points to a worse model.

The F1 score is derived from a more general variant, the F-beta score.
F1-beta
With the F1 score, *beta* is set to 1 and thus recall and precision are weighted equally. If the application requires it, Recall and Precision can also be weighted differently. A (beta) value between 0 and 1 weights the precision more strongly, a value greater than 1 weights the recall more strongly.

However, even in this metric an unequal distribution of the classes in the data can falsify the result. In addition, the case correct recject is not considered in the F measure. A method would be desirable that considers all above mentioned cases of a binary classification, proves to be robust against unbalanced class distributions and is easy to represent.

Confusion Matrix

The comparison of the model’s predictions and the correct values in the form of a contingency table provides a more detailed insight into the model’s performance. For each case, the number of times it occurred in the result is counted, and the number is entered into a contingency table. The rows describe the predictions and the columns show the correct values. In the binary case this table is also called Confusion Matrix [2].
The four cases can be found in this matrix, for example:

In the following example, the classes are not distributed equally but are in the ratio (7:3).


y_true: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1] 
y_pred: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1]

hit = 6, false alarm = 2, miss = 1, correct reject = 1

The model has six defective glasses correctly detected (hit) and one correctly classified as correct reject. However, two intact glasses were detected as defective (false alarm) and one defective glass was not detected (miss). The aim is to achieve the highest possible values in the matrix on the diagonal from top left to bottom right – i.e. high values only for hit and correct reject.

The matrix considers all cases, but keeping track is not always easy, and a statement about the performance of the model is intuitively not possible on the fly. Nevertheless, one gets an impression of how the results relate to each other. For example, the values that influence Recall and Precision can be found directly in the matrix.

Matthews Correlation Coefficient (MCC)

A metric that summarizes all cases and is considered suitable for application to data sets with unbalanced class distributions is Matthews Correlation Coefficient [4,5,6]. It can be read and calculated from the Confusion Matrix as follows:

Matthews Correlation Coefficient (MCC)

The value range of the metric is between -1 and 1. The value 1 is desirable, 0 is random, and negative values indicate a contradictory assessment of the model. However, the MCC is not defined if one of the four sums in the denominator is 0.

The crux

The values for Accuracy, Precision and Recall in the above example of the Confusion Matrix are:


Accuracy:  0.70
Precision: 0.75
Recall:    0.86

If we compare this model to another only on the basis of accuracy, this could already lead to the exclusion of the model at this point.
A model with good accuracy and bad values for recall and precision could be preferable, but not necessarily the best model for this application. In this case, the metrics Recall and Precision, for example, would be more interesting, since they deal more precisely with the requirements and thus the model can be better assessed and compared.
This means that pure confidence in accuracy can again lead to unfavorable conclusions about the performance of a model, depending on the application. This phenomenon is commonly known as the Accuracy Paradox [7].

In addition, depending on the distribution of classes in the data, some metrics may not represent changes in the predictions or may tend in different directions. The following examples give another small impression of how different the ratings can be.


y_true:   [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]   
y_pred:   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]   
Accuracy: 0.90
F1 score: 0.95 (0.9474)
Matthews CC: 0.00

With a value of >= 0.90, F1 and Accuracy already seem to be a quite good model. However, the Matthews Correlation Coefficient clouds the picture of the model and points to randomly correct predictions, since the intact glasses are extremely underrepresented and this one was even wrongly classified.

In the following case, undamaged glasses are more often present in the data and thus allow for a better evaluation.


y_true:   [0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
  y_pred:   [1, 0, 1, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 0.90  
F1 score: 0.94 (0.9412)
Matthews CC: 0.67 (0.6667)

Matthews Correclation Coefficient now points to a better model with a value of 0.67, since one of the two intact glasses was recognized correctly at least once. Accuracy and F1 do not react at all or only slightly to the change and they are calculated to the same or quite similar values as in the previous example. But will the decisions still be made by chance? After all, the two intact glasses are classified right in one case and once wrong in the other.

In the following example, three good glasses have been predicted twice correctly and once wrongly, so that the performance of the model can be evaluated even better. Matthews Correlation Coefficient acknowledges this with an increase from 0.67 to 0.76 and F1 with a further devaluation from 0.94 to 0.93.


y_true:   [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]  
y_pred:   [1, 0, 0, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 0.90  
F1 score: 0.93 (0.9333)
  Matthews CC: 0.76 (0.7638)

This means that it is advisable to use the values of several metrics when comparing models and especially in case of unbalanced class distributions and to determine in advance which metrics are relevant for the application case.

However, what is the minimum score that we can be satisfied with regarding this model? This, of course, depends on the use case or the requirements and is often difficult to assess if there are no comparative variables such as the reliability of a human decision maker.
Further graphical presentation of the performance of a model can be instructive and help achieve a good trade-off.

Trade-Off between hit and false alarm rate

While the glass manufacturer’s sales department strives to achieve the highest possible throughput in production, the quality and legal department will be more eager to minimize the number of defective glasses going on sale. This means that the sales department wants to avoid waste consisting of glasses unnecessarily classified as defective (false alarm). The quality assurance department, on the other hand, insists on recognising all defective glasses (hits) – even if this means that a few intact glasses cannot be sold.

Receiver Operating Characteristic

The Receiver Operating Characteristic curve (ROC curve) [8] compares the proportion of objects correctly classified as positive, i.e. the hit rate, with the proportion of objects falsely classified as positive, i.e. the false alarm rate, in a diagram. The false alarm rate describes the ratio of the number of glasses falsely detected as defective to the number of glasses detected as intact.

This means that the proportion of glasses correctly identified as defective is compared with the proportion of glasses falsely identified as defective.

Additionally it is also a matter of optimizing a threshold value. Depending on the method used, the predictions of a model will not always only output the numbers 0 and 1 for classification. Rather, the values lie between 0 and 1 and thus describe a probability of belonging to a class, which is then to be interpreted. This means at what value do you want to accept that a class has been recognized?

The values of the hit rate are entered on the vertical per threshold value and the values of the false alarm rate on the horizontal. The diagonal from bottom left to top right describes the random limit. That means curves that tend far above this limit and into the upper left corner represent a good evaluation. The hit rate there is as high as possible and the false alarm rate very low at the same time. The result is this typical curve (black in the diagram) on which you can now decide which trade-off to enter, or – if you enter several models – which model can be used with which threshold value, or whether to continue training.

With the model in the following example (orange curve in the diagram), a hit rate of 0.86 could be achieved if a false alarm rate of 0.33 is accepted and a threshold value of 0.3 is applied. The thresholds themselves are not shown in the curve, but are listed in the following table.


y_true: [0    0    0    1    1    1    1    1    1    1   ] 
y_pred: [0.22 0.24 0.40 0.23 0.30 0.42 0.70 0.50 0.60 0.80]

Hit rate:         [0.00 0.14 0.71 0.71 0.86 0.86 1.00 1.00]
 False alarm rate: [0.00 0.00 0.00 0.33 0.33 0.67 0.67 1.00] 
Thresholds:       [1.80 0.80 0.42 0.40 0.30 0.24 0.23 0.22]

If you insist on a false alarm rate of 0, a hit rate of 0.71 is possible. For this purpose, the threshold value must be set to at least 0.42. In practice, of course, there are more data records available and correspondingly much finer gradations from which a threshold value can be selected. In addition, the area under the curve Area Under Curve AUC can be interpreted with a value between 0 and 1. In this case, 1 once again represents the best value and 0.5 represents the coincidence or in this case also the worst value [9]. The AUC values for the curves are listed in the key of the diagram.

Evaluating machine learning models – Conclusion

In general, you can try to train a model that achieves the best possible values for all metrics. In practice, however, the effort required to achieve the last per mille of improvement often does not justify the actual benefit. A pragmatic, application-oriented approach is more likely to lead to success and provides the necessary leeway to achieve a meaningful trade-off. After all, with the test results and the resulting metric values, the models can only be compared with each other. How good a model really is will then be shown in practice and a justification as to how the model came to its decision has not yet been given, but in some areas it may well be necessary [10].

But how much test data is actually necessary for a meaningful evaluation?
Read in: Evaluating machine learning models: The issue with test data sets ,
what influence the size of the test data set can have on the comparability of models.

References

1] Wikipedia, Evaluation of a binary classifier
[2] Wikipedia, Confusion Matrix
[3] TYutaka Sasaki, 2007, The truth of the F-measure
4] Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6): e0177678. doi.org/10.1371/journal.pone.0177678
[5] Davide Chicco, 2017, Ten quick tips for machine learning in computational biology. doi:10.1186/s13040-017-0155-3
[6] Wikipedia, Matthews correlation coefficient
[7] Wikipedia, Accuracy paradox
[8] Tom Fawcett, 2005, An introduction to ROC analysis, doi.org/10.1016/j.patrec.2005.10.010
[9] Wikipedia, Receiver Operating Characteristic
[10] Shirin Elsinghorst, Explanability of Machine Learning Methods, The Softwerker Vol. 13, The Softwerker Vol. 13

Was this post helpful?

LANGUAGE

Likes

Blog author

Berthold Schulte

Consultant Data & AI

Do you still have questions? Just send me a message.

fromBerthold Schulte

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 Minuten Lesezeit

Berthold Schulte

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 Minuten Lesezeit

Berthold Schulte

Event-driven Microservices & Event Processing

Auf dem Weg von einem Monolithen oder einer grünen Wiese zu einer Landschaft von Microservices sind viele Pfade zu beschreiten und Design-Entscheidungen zu treffen. Neben dem Aufbau fachlich sinnvoll abgegrenzter Serviceeinheiten gilt ein Augenmerk der...

Softwarearchitektur
Microservices

8.8.2016 | 10 Minuten Lesezeit

Berthold Schulte

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

Dieser Artikel begleitet meinen Vortrag The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren, den ich am 20.10.2020 auf der data2day gehalten habe.Datenvisualisierung ist ausschlaggebend für Verständnis und KommunikationDatenvisualisierung...

Data
Data Science

19.10.2020 | 11 Minuten Lesezeit

Shirin Elsinghorst

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Evaluating machine learning models: How to tackle metrics

Use case

Accuracy

Binary classification

Trade-Off between coverage and precision

F-measurement / F1-Score

Confusion Matrix

Matthews Correlation Coefficient (MCC)

The crux

Trade-Off between hit and false alarm rate

Receiver Operating Characteristic

Evaluating machine learning models – Conclusion

References

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Evaluating machine learning models: Establishing quality gates

Evaluating machine learning models: The issue with test data sets

Event-driven Microservices & Event Processing

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Bessere SQL-Datenpipelines mit dbt

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten