# The Machinery behind Machine Learning – Part 2

04/21/15

Machine Learning is one of the hottest topic on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize.

This is part 2 of the series, and after all the preparation it definitely is time to do the first numerical experiments. You can find the respective R code here.

As promised in part 1 we now start with the obvious choice of the descent direction – the negative gradient. From this choice the name Gradient Descent is somewhat obvious but the method has another name: Steepest Descent. Let’s figure out why this name is adequate and what Gradient Descent can do for us.

## Gradient Descent

The idea of Gradient Descent is following the path that ensures the quickest win near a given point. The direction specifying this path is given by the negative gradient at the current point:

The negative gradient corresponds to an angle of \(\alpha=180^{\circ}=\pi\), thus is exactly at the center of all possible descent directions given by \(90^{\circ} < \alpha < 270^{\circ}[/latex]. It's an almost trivial, natural choice since it obviously satisfies [latex]s^T\nabla f(p)<0[/latex], the defining property of a decent direction. Due to this property the negative gradient stands out, but there is more. In Optimization Theory we love to formulate anything as minimization problem and in fact the (scaled) negative gradient always is a solution to the following auxiliary *constrained* minimization problem

We have already seen that any \(s\) with \(s^T\nabla f(p)<0[/latex] is a descent direction at the point [latex]p[/latex]. Furthermore, if we fix the length of [latex]s[/latex] as done here by [latex]\|s \|=1[/latex], the value of [latex]s^T\nabla f(p)[/latex] is a direct estimate for the progress that can be made near [latex]p[/latex] in terms of the achievable reduction of the function value when actually performing a step in direction [latex]s[/latex]. This follows from a Taylor series expansion of [latex]f\) around \(p\) in direction \(s\) which is beyond the scope of this post. But we keep in mind that the solution of this auxiliary problem – the negative gradient – guarantees “optimal” progress near the given point. This is the reason why it is called the direction of steepest descent, the “best” descent direction in a *local* sense because it allows for the maximum immediate progress.

## Greedy methods

Algorithms like Gradient Descent that are based on locally optimal steps are often called *greedy* because they take the quickest win offered at the current point. It would not be easy to convince us that we might consider something else than this greedy choice if there is an urgent need for progress – e.g. if we just are thirsty enough, to refer to our find-water-in-the-desert scenario from part 1. But – as it is the case with greedy choices quite often – taking the quick win might not be the best in the long run. The most prominent example for this phenomenon probably is the Travelling Salesman Problem. Here, the greedy approach can even lead to the worst possible solution. The situation with Gradient Descent is somewhat comparable. Fortunately, it always works in the sense that finally you end up near a critical point. But depending on the curvature of the function the path to the minimum can be arbitrarily long and the number of steps can be very high even if you start “close” to the optimum.

## Test functions

We now demonstrate the typical behaviour of Gradient Descent on four test functions, starting with a trivial case and ending with a worst case scenario for Gradient Descent. For each of the functions there will be a contour plot for the \(2-D\) case with some additional information. A blue arrow is indicating the optimal search direction pointing directly towards the optimum – the red plus – and a grey arrow indicates the negative gradient.

The first test function is our simple example function, half of the squared norm, that we now call

It is a so-called radial function, i.e. the function value only depends on the norm of \(p\) and not on the concrete coordinate entries, thus minimizing it essentially boils down to a one-dimensional problem. We expect our algorithms to notice that and to benefit from it.

The contour lines are circles, and here the negative gradient always coincides with the optimal decent direction, it always points directly to the optimum. This example is the best case scenario, we will see that the method converges immediately to the unique global minimum, there is no dependence on the dimensionality of the problem nor the initial value. The function is invariant under all rotations that leave the origin fixed, in particular symmetric, i.e. you can interchange \(p_i\) and \(p_j\) and the function value remains the same.

The second function \(f_2\) is an anisotropic variant of \(f_1\), given by

i.e. the function value changes depending on the direction. There is no rotation invariance and no symmetry, when interchanging variables. This should slow down the convergence of our algorithms a bit but in the end the function still is, well, mostly harmless.

The contour lines are ellipses, and depending on the scaling of the different axes – in our case the scaling is represented by the factors \(i\) – the negative gradient can be almost orthogonal to the optimal direction. Here, dimensionality and initial conditions start to matter significantly. In fact Gradient descent in general is highly sensitive to the starting conditions. Concerning \(f_2\) this dependency can roughly be characterized as follows: If the starting point is located on one of the coordinate axes, then Gradient descent is optimal, but if the starting point is far away from the axes then the negative direction is clearly suboptimal as already mentioned.

The third function is

again a simple symmetric function, but not radial or rotational invariant and thus slightly less trivial than \(f_1\).

The contour lines look like squares with rounded corners, and again the negative gradient can differ significantly from the optimal direction depending on the current point.

The last and most interesting function \(f_4\) is a variant of the Rosenbrock function, a well-known test function for optimization routines, see e.g. the Wikipedia entry for the Rosenbrock function:

the standard parameter choice being \(a=1\) and \(b=100\). The Rosenbrock function is highly anisotropic as we will see from the contour plot. Furthermore, there are “mixed” terms, i.e. nonlinear dependencies between the different variables \(p_1,…,p_n\), take e.g. the term \((p_2-p_1^2)^2 = p_2^2 -2*p_2p_1^2+ p_1^4\). The Rosenbrock function is also known as Rosenbrock’s banana function, a name that indicates the somewhat special shape which exhibits a noticeable curvature.

You might have noticed that the contour plot has a different parameter setting than the standard choice as mentioned in the definition of the function \(f_4\). The plot shows the contour lines for \(a=1,b=5\) and not \(a=1,b=100\). It’s somewhat funny that I actually had to change the parameter \(b\) to a much less critical value in order to make the extreme curvature visible. With \(b=100\) the curvature is so extreme that a standard contour plot cannot resolve it accurately, everything seems to be symmetric, take a look yourself.

From this plot it is rather obvious that the negative gradient – they grey arrow – can be a very poor search direction despite its property of being locally optimal. Furthermore, due to the enormous length of the gradient the stepsize will be tiny resulting in a very high number of function evaluations as we will see later.

## Stepsize methods

In order to get a feeling for the importance and the impact of stepsize methods we are going to compare three different variants, the standard Armijo rule, the Armijo rule with the widening approach I have already mentioned in part 1) of this series, and an exact stepsize method based on a simple but rather robust bisection method, that basically finds the root of a univariate function in an appropriate interval. We apply this bisection method to our line search problem, i.e. finding the root of the directional derivative with the direction being the negative gradient, which is a one-dimensional problem. The exact method is in general not competitive in terms of runtime or numerical costs because it needs a huge amount of gradient evaluations but here we are mostly interested in how the number of iterations is affected by this choice, a qualitative comparison. Intuitively one would guess that solving the line search problem exact would result in faster convergence, i.e. less iterations. Let’s figure out how good our intuition is.

## The first numerical tests

For a sound comparison of the performance of the different methods that we intend to introduce in this series one cannot look at the runtime only. Since all our methods are iterative methods we have to consider at least the number of steps and the most important computations that are involved. To get a basic overview we use four different values: runtime, number of iterations, number of function evaluations and number of gradient evaluations. At first we try to get a feeling for how the dimension of the problem and the choice of the initial values affects the algorithmic behaviour. The optimization procedure stops if the norm of the gradient at the current iterate is smaller than \(tol=1e^{-6}\).

### Initial values for \(f_1,f_2,f_3\)

For these functions that are rather similar anyway and share the same minimum at \((0,…,0)\) we use three initial values with the meaningful names good_point / average_point / bad_point that are chosen to cause an algorithmic behaviour that hopefully justifies these names. The concrete values are

\begin{matrix}

good\_point & = & \dfrac{1}{10}*(1,…,1)\\

& & \\

average\_point & = & -\dfrac{10}{n} * (1,2,…,n)\\

& & \\

bad\_point & = & \dfrac{100}{n} * (1,2,…,n)

\end{matrix}

$

### Initial values for Rosenbrock function \(f_4\)

The Rosenbrock function \(f_4\) is rather different from the others, so I have prepared some special initial values as well. The minimum is at \((1,…,1)\) and due to the curvature of the function I have chosen the following set of points

\begin{matrix}

good\_point & = & \dfrac{9}{10}*(1,…,1)\\

& & \\

average\_point & = &-\dfrac{100}{n} * (1,2,…,n) + (2,…,2)\\

& & \\

bad\_point & = & \dfrac{100}{n} * (1,2,…,n)

\end{matrix}

$

We do not use all points in all tests, but you can easily do experiments on your own.

### Test 1 – Impact of problem dimension

This is just to get a feeling how things scale in higher dimensions. We have used the bad starting point for \(f_1,f_2,f_3\), the good starting point for \(f_4\) and always the standard Armijo rule.

The trivial problem \(\min f_1\) – essentially one-dimensional – is solved in a constant number of iterations regardless of the dimension which is the perfect scaling. Due to our parameter choice it takes exactly one step since a full step in direction of the negative gradient always ends in the global minimum.

The second problem \(\min f_2\) – with anisotropies getting more serious with increasing dimension – shows a quadratic complexity as the number of iterations, function and gradient evaluations increases by a factor of roughly \(4=2^2\) when the problem dimensions doubles. Such a scaling is problematic as it clearly limits the applicability to small or medium dimensions. Typically one wants to achieve linear or log-linear scaling, i.e. linear up to logarithmic factors.

The third problem \(\min f_3\) that is supposed to be more complex as it has a higher degree seems to be nicer. The number of iterations etc. increases only by a factor of roughly \(1.26\) when the problem dimensions doubles. This is a sublinear growth, which implies good scalability.

The fourth problem \(\min f_4\) appears to be surprisingly harmless, in fact it also shows perfect scaling. Of course the runtimes always increase at a higher rate because the function/gradient evaluations etc. induce a higher number of arithmetical operations in higher dimensions but this is not our primary focus here.

As a general note: We should be careful when drawing conclusion from our data. All we have seen yet is a single data point, so we should not overemphasize these results or generalize from these values too much. Nevertheless, there is something we can take with us: The dimensionality of the minimization problem and the complexity of the objective function are not necessarily related. This we can safely derive as we need only one counterexample – in this case \(f_4\) – when disproving hypotheses. It’s much harder to prove them.

Furthermore, the results depend on factors that we tend to overlook easily. For this first test the termination criterion plays a crucial rule. In higher dimensions it is much harder to reduce the norm below an absolute bound like \(tol=1e^{-6}\), and if we would e.g. change it such that we apply the criterion *to every component*, i.e. replacing the Euclidean norm by the maximum norm, we would see a different behaviour. Especially in really high dimensions one has to carefully think about the implications of such things as e.g. criteria using the Euclidean norm can imply accuracy requirements in each coordinate that are way beyond machine precision. Quite often it makes sense to use relative criteria or a mixture of both absolute and relative criteria, or to choose problem-specific norms.

### Test 2 – Impact of stepsize – Part 1

Now I would like to draw your attention to the stepsize rules, that are the second-most important ingredients in our algorithmic setup. First we try to evaluate how good our intuition was concerning the difference between exact and inexact stepsizes.

Actually – and if one wants to judge things well-balanced this is quite often the case – we cannot say much about it in general. There are cases where the exact stepsize is dramatically better, but the opposite can happen as well, see Table 4. We take with us, that it’s always a good idea to test things yourself that are not clear a priori. Each stepsize method has it’s applications and you will achieve the best results when finding the one that fits to your specific problem best.

### Test 3 – Impact of stepsize – Part 2

In the last test there has been one case – \(\min f_3\) – where the Armijo rule turned out to be a surprisingly bad choice. But what exactly is causing this? When you take a closer look you will notice that the algorithm is almost always (except 4 times) accepting the full stepsize, i.e. effectively the Armijo rule is not doing anything. The reason for this is that the length of the negative gradient is rather small, while the search direction is quite good for this problem. And here comes the widening idea into play. If the full negative gradient step leads to sufficient decrease, why not trying even larger steps? Here are the results.

As you can see here, a minimal modification led to a massive improvement, and the widening approach even outperforms the exact stepsize in terms of iterations. It turns out that not the general idea of the Armijo rule – iteratively adapting the steplength – has been “wrong”, it’s just the artificial restriction of not trying larger stepsizes that resulted in a severe degradation of performance. In some sense the standard Armijo rule is greedy as well, it always accepts the first admissible stepsize. Again, this turns out to not be the best choice in general.

### Test 4 – Impact of stepsize – Part 3

This is the final test for now, and personally I think this is the most interesting case as well. I have set the parameters of the Rosenbrock function to \(a=1,b=5\), which makes it less hard for Gradient Descent but still hard enough.

In dimension \(n=20\) everythings looks somewhat unsurprising. In terms of iterations the exact stepsize is the best choice, while widening makes almost no difference at all given the total number of iterations. But in dimension \(n=25\) several interesting things start to happen.

First of all, you should forget about the exact stepsize for Gradient Descent applied to the Rosenbrock function even if it has been slightly better for \(n=20\). On a general level the reason for this simply is that the search direction is bad anyway, so there is not much you can expect when optimizing in this direction. When you look at the stepsizes you will notice that the behaviour of Armijo and the exaxct stepsize rule is rather different. After the first few iterations Armijo takes an almost constant stepsize, while the exact method exhibits some kind of alternating behaviour. If your search direction is not good you should not overoptimize, it is better to not try making larger steps at any cost, but this is what the exact method is doing regularly. After such a long step, the next step can be orders of magnitude smaller which implies that the situation has become even worse. The path of steps using the exact method proceeds almost at the bottom of the banana-shaped valley. The inexact methods slightly stay away from the very bottom so the convergence is better in the end. If you want to see this on your own, just print out the first 50-100 stepsizes and compare them to the stepsizes after 10000 iterations or so. In the beginning the exact method is taking larger steps from time to time but once it is trapped deep down in the valley the stepsizes remain very small. After some thousands of iterations the Armijo stepsize leads to larger steps on average which is what enables faster convergence.

But the most interesting aspect is the effect of the three(!) widening steps. This tiny change – one widening step every 12000 steps – improves the perfomance by more than 13%. This shows the sometimes extreme sensitivity of Gradient Descent once more and it proves that it can massively pay off when investing in this kind of optimization. That’s exactly the message here.

## Do it yourself

If you are interested in experiencing these things on your own feel free to download the code and give it a try. And if you want to see the sometimes strange behaviour of Gradient Descent in action – even stranger than what we have already seen in the above tests, then I can recommend the following setup: Choose dimension \(n=8\) (or \(n=80\) if you really have a lot of spare time, don’t forget to adapt the max iterations parameter) and try to minimize the Rosenbrock function \(f_4\) with the parameters \(a=1,b=100\), the standard Armijo rule and the average starting point (this is important, it only works with the average point). Then slightly change the starting point by adding \(10^{-2}\) (\(10^{-1}\) if you can’t wait anymore and want a quick result) to any coordinate of the average starting point (R does this for you automatically, you can just add it as a constant: + 1e-2 or 1e-1). You need approximately ten minutes for this in \(n=8\) dimensions, just in case you are planning to test the \(n=80\) scenario first.

### Fixed stepsize

One comment on fixed stepsizes: Quite often you can find descriptions of Gradient Descent without any stepsize rule, just using a fixed small stepsize, see e.g. the Wikipedia entry for Gradient Descent. While this in theory works if the stepsize is small enough I cannot recommend this by any means. The small effort invested in choosing a reasonable stepsize definitely pays off and the method runs orders of magnitude faster. You can test this yourself if you like, just follow the instructions in the Readme. I did not present any numbers here, but if you want to have some fun then try minimizing \(f_3\) in dimension \(n=7\) starting from the average point in less than 1 million iterations – and don’t just relax \(tol\).

## Summary

Finally, we have our first interesting results at hand. In particular we have learned something about stepsizes and how Gradient Descent behaves in different settings. We have seen best and worst cases, sensitivity with respect to initial values and the massive impact of minimal changes, e.g. Armijo with widening in Test 3.

Let me thank you for your attention, it’s been a long post but I hope it’s been worth the time. In the next part of this series we will see how the Nonlinear Conjugate Gradient method performs when applied to our test functions, especially the Rosenbrock function. And of course we will learn how Nonlinear CG works algorithmically and why and when it might be a good idea to choose this approach. See you soon…

Other articles in this series:

Machinery – Linear Regression

Machinery – Part 1