# Evaluating machine learning models: How to tackle metrics

09/06/19

Once a model has been trained, it can be evaluated in different ways and with more or less complex and meaningful procedures and metrics. However, the number and possible criteria for evaluating machine learning models can initially be quite confusing to someone who is just starting to deal with the field of machine learning.

For example, it depends on whether the learning is un-supervised or supervised. In the case of supervised learning it also depends on whether we are dealing with a regression or classification, the underlying use case, and so on – to name just a few criteria. I would like to start with supervised learning and classification. In this article I will introduce seven common metrics and methods for evaluating machine learning models using one example throughout this post. However, a detailed discussion would be beyond the scope of this introduction, so I will only touch only briefly on the mathematics underlying the metrics.

## Use case

A manufacturer of drinking glasses wants to identify and sort out defective glasses in his production. A model for image classification is to be trained and used for this purpose. The database consists of images of intact and defective glasses. Intact glasses are represented by 0 in this binary classification; accordingly, defective glasses are represented by 1.

In the following test data (y_true), for example, eight defective and two intact glasses are present and seven of them are correctly classified as defective by the model (y_pred).

# Test data y_true: [0, 1, 1, 1, 0, 1, 1, 1, 1, 1] # Forecast/prediction of the model y_pred: [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]

## Accuracy

Accuracy is probably the most easily understood metric and compares the number of correct classifications to all the classifications to be made.

$$Accuracy=\frac{\#\ correctly\ classified}{\#\ all\ classifications}$$

In the following example, only one value was wrongly predicted and therefore an accuracy of 90% was achieved.

y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] y_pred: [0, 1, 0, 0, 0, 1, 1, 1, 1, 1] Accuracy: 90%.

If the classes in the data are extremely unevenly distributed, viewing the accuracy alone can unfortunately lead to wrong conclusions. For example, if one class exists nine times more often in the data than the other, simply predicting the more frequent class is enough to achieve 90% accuracy – although the model may not be able to predict the other class. Whether the model is able to distinguish defective glasses from intact glasses is not assessable with this metric and a class distribution, since it always seems to predict a defective glass, as seen in the following example.

y_true: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1] y_pred: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] Accuracy: 90%.

In addition to such imponderables, it naturally also depends on the use case which metrics can be used for evaluating machine learning models.

## Binary classification

First, a binary classification [1] can distinguish four cases:

- The class was predicted – which is correct (hit, true positive)
- The class was not predicted – which is correct (reject, true negative)
- The class was predicted – which is wrong (false alarm, false positive)
- The class was not predicted – which is wrong (miss, false negative)

For these four cases, however, different designations are common [2]. Since *hit*, *correct reject*, *false alarm* and *miss* seem to be most descriptive to me, I will use them here.

## Trade-Off between coverage and precision

If, for example, it is important to recognize a class in the data as comprehensively as possible, the coverage (recall, hit rate, true positive rate) can be used as a quality criterion. Recall represents the number of *hit* in relation to the sum of *hit* and *miss*, i.e. all defective glasses to be detected:

$$Recall=\frac{hit}{hit+miss}$$

If the model as above simply predicts a defective glass, i.e. only one class, the coverage is of course still \(hit=9\) and \(miss=0\) and thus \(Recall=\frac{9}{9+0}=1\).

In contrast to this, precision describes the ratio of the number of *hit* to the sum of *hit* and *false alarm*, i.e. all glasses supposedly detected as defective:

$$Precision=\frac{hit}{hit+false\ alarm}$$

At this point, predicting only one class would of course increase the number of *false alarm* to 1 and thus result in less precision – \(Precision=\frac{9}{9+1}=0,9\).

Finally, the coverage describes how many supposed hits there were and the precision how many were correct. Achieving both complete coverage and high precision is desirable. In practice, however, this will not always be possible and the focus will not always be on both. The F-measurement can be evaluated so that the two metrics do not have to be checked and weighed up against each other at the same time.

## F-measurement / F1-Score

The F-measurement describes the harmonic mean between recall and precision and combines the two metrics to one value [3].

$$F1=2*\frac{Recall*Precision}{Recall+Precision}$$

The use of the harmonic mean means that the metric is somewhat more ‘sensitive’ to the smaller of the two values than it would be in the arithmetic mean.

For example, a recall value of \(1,0\) and a precision value of \(0,1\) results in an arithmetic mean of \(\frac{1,0+0,1}{2}=0,55\) – which intuitively does not do justice to the small value of precision. However, the harmonic mean is calculated to \(\frac{2*1,0*0,1}{1,0+0,1}=0,18\) and points to a worse model.

The F1 score is derived from a more general variant, the F-beta score.

$$F_{\beta}=(1+\beta^2)*\frac{Precision * Recall}{(\beta^2*Precision)+Recall}$$

With the F1 score, \(\beta\) is set to 1 and thus recall and precision are weighted equally. If the application requires it, Recall and Precision can also be weighted differently. A \(\beta\) value between 0 and 1 weights the precision more strongly, a value greater than 1 weights the recall more strongly.

However, even in this metric an unequal distribution of the classes in the data can falsify the result. In addition, the case *correct recject* is not considered in the F measure. A method would be desirable that considers all above mentioned cases of a binary classification, proves to be robust against unbalanced class distributions and is easy to represent.

## Confusion Matrix

The comparison of the model’s predictions and the correct values in the form of a contingency table provides a more detailed insight into the model’s performance. For each case, the number of times it occurred in the result is counted, and the number is entered into a contingency table. The rows describe the predictions and the columns show the correct values. In the binary case this table is also called Confusion Matrix [2].

The four cases can be found in this matrix, for example:

In the following example, the classes are not distributed equally but are in the ratio \(7:3\).

y_true: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1] y_pred: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1] hit = 6, false alarm = 2, miss = 1, correct reject = 1

The model has six defective glasses correctly detected (hit) and one correctly classified as correct reject. However, two intact glasses were detected as defective (false alarm) and one defective glass was not detected (miss). The aim is to achieve the highest possible values in the matrix on the diagonal from top left to bottom right – i.e. high values only for *hit* and *correct reject*.

The matrix considers all cases, but keeping track is not always easy, and a statement about the performance of the model is intuitively not possible on the fly. Nevertheless, one gets an impression of how the results relate to each other. For example, the values that influence Recall and Precision can be found directly in the matrix.

## Matthews Correlation Coefficient (MCC)

A metric that summarizes all cases and is considered suitable for application to data sets with unbalanced class distributions is Matthews Correlation Coefficient [4,5,6]. It can be read and calculated from the Confusion Matrix as follows:

$$MCC = \frac{hit*miss-false\ alarm*correct\ reject}{\sqrt{(hit+f.alarm)(hit+miss)(c.rej+f.alarm)(c.rej.+miss)}}$$

The value range of the metric is between -1 and 1. The value 1 is desirable, 0 is random, and negative values indicate a contradictory assessment of the model. However, the MCC is not defined if one of the four sums in the denominator is 0.

## The crux

The values for Accuracy, Precision and Recall in the above example of the Confusion Matrix are:

Accuracy: 0.70 Precision: 0.75 Recall: 0.86

If we compare this model to another only on the basis of accuracy, this could already lead to the exclusion of the model at this point.

A model with good accuracy and bad values for recall and precision could be preferable, but not necessarily the best model for this application. In this case, the metrics Recall and Precision, for example, would be more interesting, since they deal more precisely with the requirements and thus the model can be better assessed and compared.

This means that pure confidence in accuracy can again lead to unfavorable conclusions about the performance of a model, depending on the application. This phenomenon is commonly known as the Accuracy Paradox [7].

In addition, depending on the distribution of classes in the data, some metrics may not represent changes in the predictions or may tend in different directions. The following examples give another small impression of how different the ratings can be.

y_true: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1] y_pred: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] Accuracy: 0.90 F1 score: 0.95 (0.9474) Matthews CC: 0.00

With a value of >= 0.90, F1 and Accuracy already seem to be a quite good model. However, the Matthews Correlation Coefficient clouds the picture of the model and points to randomly correct predictions, since the intact glasses are extremely underrepresented and this one was even wrongly classified.

In the following case, undamaged glasses are more often present in the data and thus allow for a better evaluation.

y_true: [0, 0, 1, 1, 1, 1, 1, 1, 1, 1] y_pred: [1, 0, 1, 1, 1, 1, 1, 1, 1, 1] Accuracy: 0.90 F1 score: 0.94 (0.9412) Matthews CC: 0.67 (0.6667)

Matthews Correclation Coefficient now points to a better model with a value of 0.67, since one of the two intact glasses was recognized correctly at least once. Accuracy and F1 do not react at all or only slightly to the change and they are calculated to the same or quite similar values as in the previous example. But will the decisions still be made by chance? After all, the two intact glasses are classified right in one case and once wrong in the other.

In the following example, three good glasses have been predicted twice correctly and once wrongly, so that the performance of the model can be evaluated even better. Matthews Correlation Coefficient acknowledges this with an increase from 0.67 to 0.76 and F1 with a further devaluation from 0.94 to 0.93.

y_true: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1] y_pred: [1, 0, 0, 1, 1, 1, 1, 1, 1, 1] Accuracy: 0.90 F1 score: 0.93 (0.9333) Matthews CC: 0.76 (0.7638)

This means that it is advisable to use the values of several metrics when comparing models and especially in case of unbalanced class distributions and to determine in advance which metrics are relevant for the application case.

However, what is the minimum score that we can be satisfied with regarding this model? This, of course, depends on the use case or the requirements and is often difficult to assess if there are no comparative variables such as the reliability of a human decision maker.

Further graphical presentation of the performance of a model can be instructive and help achieve a good trade-off.

## Trade-Off between hit and false alarm rate

While the glass manufacturer’s sales department strives to achieve the highest possible throughput in production, the quality and legal department will be more eager to minimize the number of defective glasses going on sale. This means that the sales department wants to avoid waste consisting of glasses unnecessarily classified as defective (false alarm). The quality assurance department, on the other hand, insists on recognising all defective glasses (hits) – even if this means that a few intact glasses cannot be sold.

## Receiver Operating Characteristic

The Receiver Operating Characteristic curve (ROC curve) [8] compares the proportion of objects correctly classified as positive, i.e. the hit rate, with the proportion of objects falsely classified as positive, i.e. the false alarm rate, in a diagram. The false alarm rate describes the ratio of the number of glasses falsely detected as defective to the number of glasses detected as intact.

$$False\ Alarm\ Rate = \frac{false\ alarm}{false\ alarm + correct\ reject}$$

This means that the proportion of glasses correctly identified as defective is compared with the proportion of glasses falsely identified as defective.

Additionally it is also a matter of optimizing a threshold value. Depending on the method used, the predictions of a model will not always only output the numbers 0 and 1 for classification. Rather, the values lie between 0 and 1 and thus describe a probability of belonging to a class, which is then to be interpreted. This means at what value do you want to accept that a class has been recognized?

The values of the hit rate are entered on the vertical per threshold value and the values of the false alarm rate on the horizontal. The diagonal from bottom left to top right describes the random limit. That means curves that tend far above this limit and into the upper left corner represent a good evaluation. The hit rate there is as high as possible and the false alarm rate very low at the same time. The result is this typical curve (black in the diagram) on which you can now decide which trade-off to enter, or – if you enter several models – which model can be used with which threshold value, or whether to continue training.

With the model in the following example (orange curve in the diagram), a hit rate of 0.86 could be achieved if a false alarm rate of 0.33 is accepted and a threshold value of 0.3 is applied. The thresholds themselves are not shown in the curve, but are listed in the following table.

y_true: [0 0 0 1 1 1 1 1 1 1 ] y_pred: [0.22 0.24 0.40 0.23 0.30 0.42 0.70 0.50 0.60 0.80] Hit rate: [0.00 0.14 0.71 0.71 0.86 0.86 1.00 1.00] False alarm rate: [0.00 0.00 0.00 0.33 0.33 0.67 0.67 1.00] Thresholds: [1.80 0.80 0.42 0.40 0.30 0.24 0.23 0.22]

If you insist on a false alarm rate of 0, a hit rate of 0.71 is possible. For this purpose, the threshold value must be set to at least 0.42. In practice, of course, there are more data records available and correspondingly much finer gradations from which a threshold value can be selected. In addition, the area under the curve *Area Under Curve* AUC can be interpreted with a value between 0 and 1. In this case, 1 once again represents the best value and 0.5 represents the coincidence or in this case also the worst value [9]. The AUC values for the curves are listed in the key of the diagram.

## Evaluating machine learning models – Conclusion

In general, you can try to train a model that achieves the best possible values for all metrics. In practice, however, the effort required to achieve the last per mille of improvement often does not justify the actual benefit. A pragmatic, application-oriented approach is more likely to lead to success and provides the necessary leeway to achieve a meaningful trade-off. After all, with the test results and the resulting metric values, the models can only be compared with each other. How good a model really is will then be shown in practice and a justification as to how the model came to its decision has not yet been given, but in some areas it may well be necessary [10].

But how much test data is actually necessary for a meaningful evaluation?

Read in: Evaluating machine learning models: The issue with test data sets,

what influence the size of the test data set can have on the comparability of models.

## References

1] Wikipedia, Evaluation of a binary classifier

[2] Wikipedia, Confusion Matrix

[3] TYutaka Sasaki, 2007, The truth of the F-measure https://www.toyota-ti.ac.jp/Lab/Denshi/COIN/people/yutaka.sasaki/F-measure-YS-26Oct07.pdf

4] Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6): e0177678. doi.org/10.1371/journal.pone.0177678

[5] Davide Chicco, 2017, Ten quick tips for machine learning in computational biology. doi:10.1186/s13040-017-0155-3

[6] Wikipedia, Matthews correlation coefficient

[7] Wikipedia, Accuracy paradox

[8] Tom Fawcett, 2005, An introduction to ROC analysis, doi.org/10.1016/j.patrec.2005.10.010

[9] Wikipedia, Receiver Operating Characteristic

[10] Shirin Elsinghorst, Explanability of Machine Learning Methods, The Softwerker Vol. 13, The Softwerker Vol. 13