# The Machinery behind Machine Learning – A Benchmark for Linear Regression

01/14/16

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests from interested readers in this direction. Today we are going to present results for Linear Regression as prototype for a *Regression* method, follow-up posts will cover Logistic Regression as prototype for a *Classification* method and a Collaborative Filtering / Matrix Factorization algorithm as prototype for a *Recommender System*. It is worth noting that these prototypes for Regression, Classification and Recommendation are relying on the same underlying Optimization framework – despite their rather different interpretation or usage in the field of Machine Learning.

To be a bit more precise: In all three applications the training part is driven by an algorithm for Unconstrained Optimization, in our case Gradient Descent. As we want to emphasize the tuning options and measure the impact of the stepsize in particular we provide a comparison of the standard Armijo rule, the Armijo rule with widening and the exact stepsize. In later articles – after we learned something about Conjugate Gradient and maybe some other advanced methods – we are going to repeat this benchmark, but for now let’s focus on the performance of Gradient Descent.

## Linear Regression

Linear Regression is a conceptually simple approach for modeling a relationship between a set of numerical features – represented by the independent variables \(x_1,…,x_n\) – and a given numerical variable \(y\), the dependent variable. When we assume that we have \(m\) different data points or vectors \(x^{(j)}\) and values or real numbers \(y_j\), the model takes the following form:

with \(c_i,\ i=0,..,n\), being some real-valued parameters. Using the matrix-vector notation from Linear Algebra we derive a more compact formulation. We put the parameter values \(c_i\) in the parameter vector \(c\) of length \(n+1\), collect the row vectors \(x^{(j)}:=[1,x^{(j)}_1,…,x^{(j)}_n]\) into a \(m\times (n+1)\)-matrix \(X\) – the leading \(1\) in each vector belongs to the coefficient \(c_0\) – and end up with something close to a linear system of equations:

The interpretation is that we want to find a parameter vector c that satisfies the linear system of equation *as good as possible*, thus we are looking for the best approximate solution because an exact solution does not exist in general if \(m>n+1\), which is assumed to be the case here.

### The objective function for Linear Regression

One mathematical translation of *as good as possible* is to minimize the residual or error measured by the (squared) Euclidean norm:

The Euclidean norm is the usual notion of distance so nothing spectacular here. We square the expression to get rid of the square root that hides within the norm, it’s simpler and better from a computational point of view. Of course it does influence the concrete value of the error but does not change the solution or optimal parameter vector \(c^*\) that we are looking for.

Now we have arrived at an unconstrained minimization problem, the process of minimizing the error theoretically involves all possible values for the parameters \(c_i\), there are no restrictions. It’s time to show what we have learned so far. First, let’s unfold the compact expression to see what exactly is measured by the objective function \(f\) defined above:

In the implementation we have included an optional scaling factor of \(1/(2m)\), that normalizes the value of the objective function with respect to the number of data points. It’s presence or absence does not change the concrete solution, it’s an implementational detail that we have omitted here. The important ingredients are the summands \(\left( c^Tx^{(j)}-y_j \right)^2\) that quantify the pointwise deviation of the model from the input data.

### Visualizing Linear Regression

Effectively we sum up the squared prediction errors for every data point, as is illustrated in the following plot. This example has been generated using the mtcars dataset that comes with R. The point-wise (squared) error is the (squared) length of the vertical line between the true data point (black) and the predicted point (blue) on the regression line.

The variables for the optimization algorithm are the coefficients \(c_i,\ i=0,…,n\), and the objective function \(f(c)\) is nothing but a polynomial of degree \(2\) in these variables. In fact there is a structural similarity to some of the simple test functions from part 2 of this blog series, but now let us look at the concrete test case. As usual you can find all information and the R code in the github repository, the code is self-contained.

## The algorithmic setup

As we want to check the scaling behaviour in higher dimensions I have decided to create artificial data for this test. The setup is as follows:

- The first column of the \(m\times (n+1)\)-matrix \(X\) contains only \(1\)s, the remaining \(n\) columns contain random numbers uniformly distributed in the unit interval \([0,1]\).
- We define an auxiliary vector \(\hat{c}\) of length \(n+1\) by
\(\hat{c} := [1,2,3,…,n+1]\). - We define a random vector \(z\) of length \(m\) – the number of data points – containing random numbers uniformly distributed in the unit interval, scaled to unit length. \(z\) is used as noise generator.
- We define a weight \(\varepsilon\) – a real number that allows us to scale the noise vector \(z\).
- Finally, we define the vector \(y\) by
\(y:= X\hat{c} + \varepsilon z\).

The vector \(\hat{c}\) is by construction an approximate solution of \(y\approx Xc\) as long as the noise \(\varepsilon z\) is small. This setup might look somewhat complicated but it gives us a nice parametrization of the interesting things. The main parameters are

- The number of data points: \(m\).
- The number of parameters of the linear model: \(n+1\).
- The amount of noise, i.e. a trivial upper bound of the expected value of the objective function: \(\varepsilon\).

The last parameter allows for a quick sanity check, as \(\varepsilon^2\) always is a trivial upper bound because we already know the possible solution \(\hat{c}\) with the property

Furthermore it makes the scenario somewhat realistic. Typically you want to reconstruct the unknown solution – modeled by \(\hat{c}\) – but what you get from any algorithm almost always is a perturbed solution \(c^*\) that still contains some of the noise. Using this parametrization you can get a feeling for how much of the added noise actually is present in the solution, which might lead to a deeper understanding of Linear Regression.

## Benchmarking Linear Regression

We compare the performance of the standard Armijo rule, the Armijo rule with widening, the exact stepsize for several choices of \(m,n\) and \(\varepsilon\). We also include a single comparison for some choices of a fixed stepsize at the end, that indicate what you can expect from this choice and where it can fail. In all cases, the algorithm terminates if the norm of the gradient is below \(1e-6\) or if the number of iterations exceeds \(100,000\).

### Impact of data complexity

The first test is about the simplest possible model, one-dimensional Linear Regression, i.e. the number of model parameters is \(2\). Don’t be confused by this, there is always one parameter for the “offset” \(c_0\), such that the \(n\)-dimensional model depends on \(n+1\) parameters. The optimization algorithm is initialized with the parameter vector \(c=(c_0,c_1)=(0,0)\). We vary the number of data points from \(5\) to \(10,000\) in order to experience the scaling with respect to the amount of data.

That’s more or less the expected behaviour. The number \(m\) of data points can grow but the underlying optimization problem still has the same dimension \(n+1=2\), thus the number of iterations remains roughly constant – given that there are sufficiently many data points – and the runtime increases linearly with \(m\). It is interesting that the exact stepsize needs so few iterations making it almost as fast as the standard Armijo rule, this indicates the simple structure of the problem. Nevertheless, the Armijo rule with widening is the best choice here.

### Impact of model complexity

The second test varies the number of model parameters from \(2\) to \(16\), the number of data points is fixed at \(1,000\). Here, we expect the scaling behaviour to be somewhat different.

Still, the exact stepsize produces the smallest number of iterations but the difference is much smaller for more complex models which indicates that inexact stepsizes are a good choice here. On the other hand, the price for this slight advantage is prohibitively high. We can also see that not only the runtime per iteration, but also the number of iterations is increasing with the model complexity, this is something to be considered when choosing a model for a real-world problem.

### Fixed stepsize

As promised we provide some results for fixed stepsizes as well. Conceptually, it does not make sense to use a fixed stepsize at all unless you do know in advance a good value that definitely works – which you typically don’t. In reality, you have to test several values in order to find something that allows the algorithm to converge in a reasonable time – which is nothing but a stepsize rule that requires a full algorithmic run in order to perform an update. But let’s look at the numbers.

Only for the one-dimensional – and almost trivial – Linear Regression problem there is a value of the stepsize for which the performance is competitive, but this choice is way too risky for non-trivial settings. Even for slightly more complex Linear regression problems and a fixed stepsize of \(1e-1\) the performance is worse by a factor of \(10\). And for more complicated objective functions, e.g. the Rosenbrock function that has been introduced in the last blog post, the value would have to be smaller than \(1e-6\) implying that the number of iterations and the runtime would explode.

## Summary

The results for Linear Regression already indicate that it is indeed useful to keep an eye on the underlying machinery of Optimization methods. The Armijo rule with widening shows the best performance, but the exact stepsize leads to the smallest number of iterations. These two findings imply that the direction of steepest descent is a good or at least reasonable choice for Linear Regression. The reasoning behind this is twofold. Remember that one argument for inexact stepsizes was that they can help avoiding the risk of overfitting that the exact stepsize cannot avoid in case the search direction is bad. Overfitting – which can be interpreted as significant deviation between the “local” search direction and the “global” optimal direction pointing directly to the solution – should lead to a higher number of iterations or at least not to less iterations. So in reverse, if the number of iterations is smaller, there can be no overfitting and the search direction should be a reasonable approximation of the unknown optimal direction.

The second argument – which directly applies to widening but also to the exact stepsize – considers the step length. Widening leads to larger steps which again only make sense if the local search direction is a reasonably good approximation of the global optimal direction in a larger environment of the current iterate. As a general rule: Larger steps imply that the model is a “good fit” in a larger environment of the current point. The exact stepsize also did produce larger steps which we did not mention here but you can check it on your own.

Feel free to take a look at the code in the github repository and give it a try yourself, you can even apply it to your own data. For other choices of the parameters the numbers can look differnt, sometimes the widening idea has no effect and sometimes it clearly outperforms the exact stepsize in terms of number of iterations, but my goal was to show you the “average” case as this is what mostly matters in practice.

The results for Logistic Regression are coming soon and here the impact on the runtime and the overall performance will be even more visible. Thanks for reading!

The Machinery behind Machine Learning – Part 2

The Machinery behind Machine Learning – Part 1