Overview

# The Machinery behind Machine Learning – A Benchmark for Linear Regression

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests from interested readers in this direction. Today we are going to present results for Linear Regression as prototype for a Regression method, follow-up posts will cover Logistic Regression as prototype for a Classification method and a Collaborative Filtering / Matrix Factorization algorithm as prototype for a Recommender System. It is worth noting that these prototypes for Regression, Classification and Recommendation are relying on the same underlying Optimization framework – despite their rather different interpretation or usage in the field of Machine Learning.

To be a bit more precise: In all three applications the training part is driven by an algorithm for Unconstrained Optimization, in our case Gradient Descent. As we want to emphasize the tuning options and measure the impact of the stepsize in particular we provide a comparison of the standard Armijo rule, the Armijo rule with widening and the exact stepsize. In later articles – after we learned something about Conjugate Gradient and maybe some other advanced methods – we are going to repeat this benchmark, but for now let’s focus on the performance of Gradient Descent.

## Linear Regression

Linear Regression is a conceptually simple approach for modeling a relationship between a set of numerical features – represented by the independent variables $latex x_1,…,x_n$ – and a given numerical variable $latex y$, the dependent variable. When we assume that we have $latex m$ different data points or vectors $latex x^{(j)}$ and values or real numbers $latex y_j$, the model takes the following form:

$latex y_j \approx c_0 + c_1 x^{(j)}_1 + … + c_n x^{(j)}_n$

with $latex c_i,\ i=0,..,n$, being some real-valued parameters. Using the matrix-vector notation from Linear Algebra we derive a more compact formulation. We put the parameter values $latex c_i$ in the parameter vector $latex c$ of length $latex n+1$, collect the row vectors $latex x^{(j)}:=[1,x^{(j)}_1,…,x^{(j)}_n]$ into a $latex m\times (n+1)$-matrix $latex X$ – the leading $latex 1$ in each vector belongs to the coefficient $latex c_0$ – and end up with something close to a linear system of equations:

$latex Xc\approx y$.

The interpretation is that we want to find a parameter vector c that satisfies the linear system of equation as good as possible, thus we are looking for the best approximate solution because an exact solution does not exist in general if $latex m>n+1$, which is assumed to be the case here.

### The objective function for Linear Regression

One mathematical translation of as good as possible is to minimize the residual or error measured by the (squared) Euclidean norm:

$latex \min_c f(c):=\|Xc – y\|^2$.

The Euclidean norm is the usual notion of distance so nothing spectacular here. We square the expression to get rid of the square root that hides within the norm, it’s simpler and better from a computational point of view. Of course it does influence the concrete value of the error but does not change the solution or optimal parameter vector $latex c^*$ that we are looking for.
Now we have arrived at an unconstrained minimization problem, the process of minimizing the error theoretically involves all possible values for the parameters $latex c_i$, there are no restrictions. It’s time to show what we have learned so far. First, let’s unfold the compact expression to see what exactly is measured by the objective function $latex f$ defined above:

$latex f(c)=\|Xc – y\|^2 = \sum_j \left( c^Tx^{(j)}-y_j \right)^2$.

## Stefan Kühn

Dr. Stefan Kühn develops and applies fast, robust and intelligent algorithms for analyzing large amounts of data.