The Machinery behind Machine Learning – A Benchmark for Linear Regression

14.1.2016 | 10 minutes of reading time

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests from interested readers in this direction. Today we are going to present results for Linear Regression as prototype for a Regression method, follow-up posts will cover Logistic Regression as prototype for a Classification method and a Collaborative Filtering / Matrix Factorization algorithm as prototype for a Recommender System. It is worth noting that these prototypes for Regression, Classification and Recommendation are relying on the same underlying Optimization framework – despite their rather different interpretation or usage in the field of Machine Learning.

To be a bit more precise: In all three applications the training part is driven by an algorithm for Unconstrained Optimization, in our case Gradient Descent. As we want to emphasize the tuning options and measure the impact of the stepsize in particular we provide a comparison of the standard Armijo rule, the Armijo rule with widening and the exact stepsize. In later articles – after we learned something about Conjugate Gradient and maybe some other advanced methods – we are going to repeat this benchmark, but for now let’s focus on the performance of Gradient Descent.

Linear Regression

Linear Regression is a conceptually simple approach for modeling a relationship between a set of numerical features – represented by the independent variables \(x_1,…,x_n\) – and a given numerical variable \(y\), the dependent variable. When we assume that we have \(m\) different data points or vectors \(x^{(j)}\) and values or real numbers \(y_j\), the model takes the following form:

\(y_j \approx c_0 + c_1 x^{(j)}_1 + … + c_n x^{(j)}_n\)

with \(c_i,\ i=0,..,n\), being some real-valued parameters. Using the matrix-vector notation from Linear Algebra we derive a more compact formulation. We put the parameter values \(c_i\) in the parameter vector \(c\) of length \(n+1\), collect the row vectors \(x^{(j)}:=[1,x^{(j)}_1,…,x^{(j)}_n]\) into a \(m\times (n+1)\)-matrix \(X\) – the leading \(1\) in each vector belongs to the coefficient \(c_0\) – and end up with something close to a linear system of equations:

\(Xc\approx y\).

The interpretation is that we want to find a parameter vector c that satisfies the linear system of equation as good as possible, thus we are looking for the best approximate solution because an exact solution does not exist in general if \(m>n+1\), which is assumed to be the case here.

The objective function for Linear Regression

One mathematical translation of as good as possible is to minimize the residual or error measured by the (squared) Euclidean norm:

\(\min_c f(c):=\|Xc – y\|^2\).

The Euclidean norm is the usual notion of distance so nothing spectacular here. We square the expression to get rid of the square root that hides within the norm, it’s simpler and better from a computational point of view. Of course it does influence the concrete value of the error but does not change the solution or optimal parameter vector \(c^*\) that we are looking for.
Now we have arrived at an unconstrained minimization problem, the process of minimizing the error theoretically involves all possible values for the parameters \(c_i\), there are no restrictions. It’s time to show what we have learned so far. First, let’s unfold the compact expression to see what exactly is measured by the objective function \(f\) defined above:

\(f(c)=\|Xc – y\|^2 = \sum_j \left( c^Tx^{(j)}-y_j \right)^2\).

In the implementation we have included an optional scaling factor of \(1/(2m)\), that normalizes the value of the objective function with respect to the number of data points. It’s presence or absence does not change the concrete solution, it’s an implementational detail that we have omitted here. The important ingredients are the summands \(\left( c^Tx^{(j)}-y_j \right)^2\) that quantify the pointwise deviation of the model from the input data.

Visualizing Linear Regression

Effectively we sum up the squared prediction errors for every data point, as is illustrated in the following plot. This example has been generated using the mtcars dataset that comes with R. The point-wise (squared) error is the (squared) length of the vertical line between the true data point (black) and the predicted point (blue) on the regression line.

The variables for the optimization algorithm are the coefficients \(c_i,\ i=0,…,n\), and the objective function \(f(c)\) is nothing but a polynomial of degree \(2\) in these variables. In fact there is a structural similarity to some of the simple test functions from part 2 of this blog series, but now let us look at the concrete test case. As usual you can find all information and the R code in the github repository , the code is self-contained.

The algorithmic setup

As we want to check the scaling behaviour in higher dimensions I have decided to create artificial data for this test. The setup is as follows:

The first column of the \(m\times (n+1)\)-matrix \(X\) contains only \(1\)s, the remaining \(n\) columns contain random numbers uniformly distributed in the unit interval \([0,1]\).
We define an auxiliary vector \(\hat{c}\) of length \(n+1\) by\(\hat{c} := [1,2,3,…,n+1]\).
We define a random vector \(z\) of length \(m\) – the number of data points – containing random numbers uniformly distributed in the unit interval, scaled to unit length. \(z\) is used as noise generator.
We define a weight \(\varepsilon\) – a real number that allows us to scale the noise vector \(z\).
Finally, we define the vector \(y\) by\(y:= X\hat{c} + \varepsilon z\).

The vector \(\hat{c}\) is by construction an approximate solution of \(y\approx Xc\) as long as the noise \(\varepsilon z\) is small. This setup might look somewhat complicated but it gives us a nice parametrization of the interesting things. The main parameters are

The number of data points: \(m\).
The number of parameters of the linear model: \(n+1\).
The amount of noise, i.e. a trivial upper bound of the expected value of the objective function: \(\varepsilon\).

The last parameter allows for a quick sanity check, as \(\varepsilon^2\) always is a trivial upper bound because we already know the possible solution \(\hat{c}\) with the property
\(f(\hat{c})=\varepsilon^2\).

Furthermore it makes the scenario somewhat realistic. Typically you want to reconstruct the unknown solution – modeled by \(\hat{c}\) – but what you get from any algorithm almost always is a perturbed solution \(c^*\) that still contains some of the noise. Using this parametrization you can get a feeling for how much of the added noise actually is present in the solution, which might lead to a deeper understanding of Linear Regression.

Benchmarking Linear Regression

We compare the performance of the standard Armijo rule, the Armijo rule with widening, the exact stepsize for several choices of \(m,n\) and \(\varepsilon\). We also include a single comparison for some choices of a fixed stepsize at the end, that indicate what you can expect from this choice and where it can fail. In all cases, the algorithm terminates if the norm of the gradient is below \(1e-6\) or if the number of iterations exceeds \(100,000\).

Impact of data complexity

The first test is about the simplest possible model, one-dimensional Linear Regression, i.e. the number of model parameters is \(2\). Don’t be confused by this, there is always one parameter for the “offset” \(c_0\), such that the \(n\)-dimensional model depends on \(n+1\) parameters. The optimization algorithm is initialized with the parameter vector \(c=(c_0,c_1)=(0,0)\). We vary the number of data points from \(5\) to \(10,000\) in order to experience the scaling with respect to the amount of data.

That’s more or less the expected behaviour. The number \(m\) of data points can grow but the underlying optimization problem still has the same dimension \(n+1=2\), thus the number of iterations remains roughly constant – given that there are sufficiently many data points – and the runtime increases linearly with \(m\). It is interesting that the exact stepsize needs so few iterations making it almost as fast as the standard Armijo rule, this indicates the simple structure of the problem. Nevertheless, the Armijo rule with widening is the best choice here.

Impact of model complexity

The second test varies the number of model parameters from \(2\) to \(16\), the number of data points is fixed at \(1,000\). Here, we expect the scaling behaviour to be somewhat different.

Still, the exact stepsize produces the smallest number of iterations but the difference is much smaller for more complex models which indicates that inexact stepsizes are a good choice here. On the other hand, the price for this slight advantage is prohibitively high. We can also see that not only the runtime per iteration, but also the number of iterations is increasing with the model complexity, this is something to be considered when choosing a model for a real-world problem.

Fixed stepsize

As promised we provide some results for fixed stepsizes as well. Conceptually, it does not make sense to use a fixed stepsize at all unless you do know in advance a good value that definitely works – which you typically don’t. In reality, you have to test several values in order to find something that allows the algorithm to converge in a reasonable time – which is nothing but a stepsize rule that requires a full algorithmic run in order to perform an update. But let’s look at the numbers.

Only for the one-dimensional – and almost trivial – Linear Regression problem there is a value of the stepsize for which the performance is competitive, but this choice is way too risky for non-trivial settings. Even for slightly more complex Linear regression problems and a fixed stepsize of \(1e-1\) the performance is worse by a factor of \(10\). And for more complicated objective functions, e.g. the Rosenbrock function that has been introduced in the last blog post, the value would have to be smaller than \(1e-6\) implying that the number of iterations and the runtime would explode.

Summary

The results for Linear Regression already indicate that it is indeed useful to keep an eye on the underlying machinery of Optimization methods. The Armijo rule with widening shows the best performance, but the exact stepsize leads to the smallest number of iterations. These two findings imply that the direction of steepest descent is a good or at least reasonable choice for Linear Regression. The reasoning behind this is twofold. Remember that one argument for inexact stepsizes was that they can help avoiding the risk of overfitting that the exact stepsize cannot avoid in case the search direction is bad. Overfitting – which can be interpreted as significant deviation between the “local” search direction and the “global” optimal direction pointing directly to the solution – should lead to a higher number of iterations or at least not to less iterations. So in reverse, if the number of iterations is smaller, there can be no overfitting and the search direction should be a reasonable approximation of the unknown optimal direction.
The second argument – which directly applies to widening but also to the exact stepsize – considers the step length. Widening leads to larger steps which again only make sense if the local search direction is a reasonably good approximation of the global optimal direction in a larger environment of the current iterate. As a general rule: Larger steps imply that the model is a “good fit” in a larger environment of the current point. The exact stepsize also did produce larger steps which we did not mention here but you can check it on your own.
Feel free to take a look at the code in the github repository and give it a try yourself, you can even apply it to your own data. For other choices of the parameters the numbers can look differnt, sometimes the widening idea has no effect and sometimes it clearly outperforms the exact stepsize in terms of number of iterations, but my goal was to show you the “average” case as this is what mostly matters in practice.
The results for Logistic Regression are coming soon and here the impact on the runtime and the overall performance will be even more visible. Thanks for reading!

The Machinery behind Machine Learning – Part 2
The Machinery behind Machine Learning – Part 1

Was this post helpful?

Likes

Blog author

Stefan Kühn

Do you still have questions? Just send me a message.

fromStefan Kühn

data2day 2015 – Ein Konferenzbericht

Auch dieses Jahr war wieder ein Team der codecentric AG auf der data2day. Die Konferenz hat sich sehr positiv weiterentwickelt und hat sich in unseren Augen als fester Bestandteil der deutschen Data-, Big-Data- und Data-Science-Konferenzlandschaft etabliert...

14.10.2015 | 3 Minuten Lesezeit

Stefan Kühn

The Machinery behind Machine Learning – Part 2

Machine Learning is one of the hottest topic on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize. This is part 2 of the series, and after all the preparation it definitely...

Machine Learning

20.4.2015 | 17 Minuten Lesezeit

Stefan Kühn

The Machinery behind Machine Learning – Part 1

Machine Learning is one of the hottest topics on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize. This is the first blog post in a small series about the machinery that ...

Machine Learning

19.3.2015 | 15 Minuten Lesezeit

Stefan Kühn

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

BIE Spotty – unsere Lösung beim BIE City Hackathon

Typischerweise sind bei Hackathons viele Soft- und Hardware-Entwickler zu finden, die innerhalb eines begrenzten Zeitraums versuchen, kreative und ungewöhnliche Lösungen in Form von Code und ersten Prototypen für vorher definierte Challenges zu erarbeiten...

IoT
Computer Vision
IT-Security
Machine Learning

2.7.2020 | 5 Minuten Lesezeit

Meike Wocken

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Machine Learning und künstliche Intelligenz sind aktuell in aller Munde und versprechen vielfältige Einsatzmöglichkeiten im Unternehmen. Trotzdem tun sich viele Unternehmen aktuell noch schwer, das Potential der Technologie zu nutzen. „Der Fokus liegt...

Künstliche Intelligenz
Data
Community
Machine Learning

27.5.2020 | 1 Minuten Lesezeit

Matthias Niehoff

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Machine-Learning-Technologien lassen sich erfolgreich und praxisnah im Unternehmensumfeld einsetzen. Ein konkreter, überschaubarer Anwendungsfall und somit fokussierter Einsatz von Machine-Learning-Modellen kann dabei echten Mehrwert generieren. Dieser...

Data
Machine Learning
Data Science

25.3.2020 | 5 Minuten Lesezeit

Berthold Schulte

Deployment von Machine-Learning-Modellen mit Seldon Core

In diesem Artikel sehen wir uns an, wie wir Machine-Learning- und Deep-Learning-Modelle mit Seldon Core deployen können. Seldon Core ist eine Open-Source-Plattform, um Modelle auf einem Kubernetes-Cluster in Betrieb zu nehmen. Bevor wir uns Seldon Core...

Softwarearchitektur
Data
Künstliche Intelligenz
Machine Learning

9.9.2019 | 7 Minuten Lesezeit

Nico Axtmann

Data Science in der Praxis: Häufige Fehler und Vorgehen

In diesem Artikel gehen wir auf die Besonderheiten von Data Science in der Praxis ein. Wir konzentrieren uns auf die technischen Unterschiede, häufige Fehler und Herausforderungen. Dabei lassen wird die sozialen und kommunikativen Aspekte außen vor. ...

Agilität
Machine Learning
Data

28.8.2019 | 11 Minuten Lesezeit

Nico Axtmann

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Dieser Artikel befasst sich mit dem Deployment eines Machine-Learning-Modells, das den Wert eines Hauses in Boston anhand gewisser Merkmale wie der Kriminalitätsrate des Bezirks und der Anzahl der Räume in einer Wohnung bestimmen kann. Im ersten Schritt...

Data
Python
Künstliche Intelligenz
Machine Learning

6.8.2019 | 3 Minuten Lesezeit

Nico Axtmann

Machine-Learning-Modelle bewerten – die Crux mit der Metrik

Ist ein Modell erst einmal trainiert, kann es auf verschiedene Art und Weise und mit mehr oder weniger komplexen und aussagekräftigen Verfahren und Metriken bewertet werden. Die Anzahl und möglichen Kriterien, ein Modell zu bewerten, sind allerdings....

Data
Machine Learning
Softwareentwicklung

1.7.2019 | 13 Minuten Lesezeit

Berthold Schulte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

The Machinery behind Machine Learning – A Benchmark for Linear Regression

Linear Regression

The objective function for Linear Regression

Visualizing Linear Regression

The algorithmic setup

Benchmarking Linear Regression

Impact of data complexity

Impact of model complexity

Fixed stepsize

Summary

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

data2day 2015 – Ein Konferenzbericht

The Machinery behind Machine Learning – Part 2

The Machinery behind Machine Learning – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

BIE Spotty – unsere Lösung beim BIE City Hackathon

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Deployment von Machine-Learning-Modellen mit Seldon Core

Data Science in der Praxis: Häufige Fehler und Vorgehen

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Machine-Learning-Modelle bewerten – die Crux mit der Metrik

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten