# Evaluating machine learning models: The issue with test data sets

04/21/20

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends on the use case and the performance of the trained models. Therefore it is important to clarify how well a model can actually support the respective challenge. In this article I would like to explain how the evaluation of the performance could be interpreted, especially depending on how much test data is or should be available.

## Test data scope

By using a retained representative test set and various metrics, scores can be calculated and the models can be evaluated and compared. But the test set is not used in the training. To optimize the model, further validation sets are only generated from the remaining data.

Once meaningful metrics have been found for the respective use case and target values to be achieved have been defined, the question arises as to what extent the values achieved can actually be trusted. After all, these values can only be based on a reduced amount of example data. How much test data is necessary for a meaningful evaluation depends on the score to be achieved and the desired confidence in the evaluation.

However, collecting and, in the case of supervised learning, labeling the data often requires manual steps and may be a cost factor that should not be underestimated. A good trade-off must be found between confidence in the assessment and the expected costs of collecting and preparing the test data.

## Use case

For further explanation, I will use the example of the article: Evaluating machine learning models: How to tackle metrics.

“*A manufacturer of drinking glasses wants to identify and sort out defective glasses in his production. A model for image classification is to be trained and used for this purpose. The database consists of images of intact and defective glasses.*” [1]

The number of pictures of defective glasses is very limited here, so that, with some effort, pictures of about 500 intact and 500 defective glasses will be available for training and testing of the models only after some time – especially because defective glasses occur rather rarely in production. From these 1000 pictures a representative test set is then retained before training.

How much test data is required for a meaningful evaluation of a machine learning model? Or what does “meaningful” mean in this context? Is 10% to 20% of the data base, in this case 100 to 200 images, sufficient?

In this example, the metric *accuracy* is selected and after a little training and optimization, the model will achieve a performance of 80% of correct classifications – on a basis of 100 test images.

To estimate how trustworthy this result actually is, you can use standard tools of statistics.

## Confidence interval

Whether an image has been classified correctly by the model can be seen as an experiment with the two possible results *success* or *failure*. Testing a model is also a series of similar independent experiments, so that the binomial distribution or its approximation [2] to the normal distribution can be used to estimate the result.

The extent to which one can now “trust” the determined value can be shown with the help of a confidence interval.

The benefit of a confidence interval is the possibility of quantifying the uncertainty of a sample, for example a test run on 100 images, and the resulting estimate. Estimation because the test data represent only a small part of the possible data set or population, and thus the model has only been tested with a small part of data and not with all data that may ever occur.

“*The confidence interval indicates the range which, with infinite repetition of a random experiment, includes the true location of the parameter with a certain probability (the confidence level).*” [3]

The interval is represented by a lower and an upper limit value and the assumption that the test runs have been repeated quite often on different independent test data sets of the same size. This means, for example, that on average in 95 % of the imaginary test runs the resulting limit values include the determined score.

## Calculating intervals

The limit values can be calculated as follows, for example [4]:

$$i = \pm p-z*{\sqrt{\frac{p*(1-p)}{n}}}$$

With \(p=\frac{1}{score}\), \(n\) is the number of data and \(z\) is a constant that can be read from the standard normal distribution table for the desired “confidence” (confidence level). Common values are, for example:

level | 90 % | 95% | 97% | 99% | 99.5 % |

\(z\) | 1.28 | 1.64 | 1.96 | 2.33 | 2.58 |

In case of the 95% confidence level, with 100 test data sets and a measured score of the model of 80%, the interval is 72% to 88%. This range seems to be quite large and probably not accurate enough for some applications.

## The crux

But even when doubling the test data to 200 data sets, the resulting interval: 74% to 86% is not much smaller. The following diagram shows a few more examples of accuracy scores of 80%, 90%, 95% and 99% for the 95% confidence level and for the test data size 100, 200, 1000, 10000. On a basis of 10000 data sets, the range is +-1% and could be acceptable for a score of 80%.

However, for an accuracy of 85% determined on basis of 100 test data, the interval would be 78% to 92%. It therefore also covers a score value of 80%. This suggests that it may be possible to work with less training data and better equip the test data set. Finally, it is possible that if a poorer score is obtained, for example by training on less data, the confidence interval limits will still include the originally better score.

Furthermore, focusing on the last per mil improvement, determined on the basis of a small test set, may not be a goal-oriented endeavour. Or even the effect may occur that after increasing the test data, a model that may not have appeared to be so good before may perform better than the model that was originally preferred due to an insignificantly higher score.

This means that a statement about the performance of the model and differentiation from other models on the basis of a manageable number of test data is only possible to a limited extent.

In general, the larger the sample from which the estimate was drawn, the more precise is the estimate and the smaller and better is the confidence interval.

## Conclusion

Ultimately, the evaluation of a model should be done with a sense of proportion and the size of a test set should be included in the evaluation. Particularly with results that do not differ significantly, the selection of a model based solely on these evaluations may not always be promising. A field test in practice, for example by A/B testing, with several models that cannot be clearly distinguished, can support a decision.

## References:

[1] codecentric blog, Evaluating machine learning models: How to tackle metrics

[2] Wikipedia, De Moivre–Laplace theorem

[3] Wikipedia, Confidence interval (German version)

[4] Wikipedia, Binomial proportion confidence interval