The Machinery behind Machine Learning – Part 2

20.4.2015 | 17 minutes of reading time

Machine Learning is one of the hottest topic on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize.
This is part 2 of the series, and after all the preparation it definitely is time to do the first numerical experiments. You can find the respective R code here .

As promised in part 1 we now start with the obvious choice of the descent direction – the negative gradient. From this choice the name Gradient Descent is somewhat obvious but the method has another name: Steepest Descent. Let’s figure out why this name is adequate and what Gradient Descent can do for us.

Gradient Descent

The idea of Gradient Descent is following the path that ensures the quickest win near a given point. The direction specifying this path is given by the negative gradient at the current point:

$s = – \nabla f(p)$

The negative gradient corresponds to an angle of $\alpha=180^{\circ}=\pi$, thus is exactly at the center of all possible descent directions given by \(90^{\circ} < \alpha < 270^{\circ}[/latex]. It's an almost trivial, natural choice since it obviously satisfies [latex]s^T\nabla f(p)<0[ latex],="" the="" defining="" property="" of="" a="" decent="" direction.="" Due="" to="" this="" negative="" gradient="" stands="" out,="" but="" there="" is="" more.="" In="" Optimization="" Theory="" we="" love="" formulate="" anything="" as="" minimization="" problem="" and="" in="" fact="" (scaled)="" always="" solution="" following="" auxiliary="" constrained minimization problem

[latex]\min\ s^T\nabla f(p),\quad {s\in \mathbb{R}^n,\ \|s \|=1}.\)

We have already seen that any $s$ with $s^T\nabla f(p)<0[ latex]="" is="" a="" descent="" direction="" at="" the="" point="" [latex]p[="" latex].="" Furthermore,="" if="" we="" fix="" length="" of="" [latex]s[="" as="" done="" here="" by="" [latex]\|s="" \|="1[/latex]," value="" [latex]s^T\nabla="" f(p)[="" direct="" estimate="" for="" progress="" that="" can="" be="" made="" near="" in="" terms="" achievable="" reduction="" function="" when="" actually="" performing="" step="" This="" follows="" from="" Taylor series expansion of [latex]f$ around $p$ in direction $s$ which is beyond the scope of this post. But we keep in mind that the solution of this auxiliary problem – the negative gradient – guarantees “optimal” progress near the given point. This is the reason why it is called the direction of steepest descent, the “best” descent direction in a local sense because it allows for the maximum immediate progress.

Greedy methods

Algorithms like Gradient Descent that are based on locally optimal steps are often called greedy because they take the quickest win offered at the current point. It would not be easy to convince us that we might consider something else than this greedy choice if there is an urgent need for progress – e.g. if we just are thirsty enough, to refer to our find-water-in-the-desert scenario from part 1 . But – as it is the case with greedy choices quite often – taking the quick win might not be the best in the long run. The most prominent example for this phenomenon probably is the Travelling Salesman Problem . Here, the greedy approach can even lead to the worst possible solution. The situation with Gradient Descent is somewhat comparable. Fortunately, it always works in the sense that finally you end up near a critical point. But depending on the curvature of the function the path to the minimum can be arbitrarily long and the number of steps can be very high even if you start “close” to the optimum.

Test functions

We now demonstrate the typical behaviour of Gradient Descent on four test functions, starting with a trivial case and ending with a worst case scenario for Gradient Descent. For each of the functions there will be a contour plot for the $2-D$ case with some additional information. A blue arrow is indicating the optimal search direction pointing directly towards the optimum – the red plus – and a grey arrow indicates the negative gradient.
The first test function is our simple example function, half of the squared norm, that we now call

$f_1(p)=f_1(p_1,…,p_n) := 0.5* \sum\limits_{i=1}^n p_i^2=0.5*\left\|p \right\|^2.$

It is a so-called radial function , i.e. the function value only depends on the norm of $p$ and not on the concrete coordinate entries, thus minimizing it essentially boils down to a one-dimensional problem. We expect our algorithms to notice that and to benefit from it.

The contour lines are circles, and here the negative gradient always coincides with the optimal decent direction, it always points directly to the optimum. This example is the best case scenario, we will see that the method converges immediately to the unique global minimum, there is no dependence on the dimensionality of the problem nor the initial value. The function is invariant under all rotations that leave the origin fixed, in particular symmetric, i.e. you can interchange $p_i$ and $p_j$ and the function value remains the same.
The second function $f_2$ is an anisotropic variant of $f_1$, given by

$f_2(p)=f_2(p_1,…,p_n) := 0.5* \sum\limits_{i=1}^n (i*p_i)^2$,

i.e. the function value changes depending on the direction. There is no rotation invariance and no symmetry, when interchanging variables. This should slow down the convergence of our algorithms a bit but in the end the function still is, well, mostly harmless.

The contour lines are ellipses, and depending on the scaling of the different axes – in our case the scaling is represented by the factors $i$ – the negative gradient can be almost orthogonal to the optimal direction. Here, dimensionality and initial conditions start to matter significantly. In fact Gradient descent in general is highly sensitive to the starting conditions. Concerning $f_2$ this dependency can roughly be characterized as follows: If the starting point is located on one of the coordinate axes, then Gradient descent is optimal, but if the starting point is far away from the axes then the negative direction is clearly suboptimal as already mentioned.
The third function is

$f_3(p):=f_3(p_1,…,p_n):=\sum\limits_{i=1}^n p_i^4$,

again a simple symmetric function, but not radial or rotational invariant and thus slightly less trivial than $f_1$.

The contour lines look like squares with rounded corners, and again the negative gradient can differ significantly from the optimal direction depending on the current point.
The last and most interesting function $f_4$ is a variant of the Rosenbrock function, a well-known test function for optimization routines, see e.g. the Wikipedia entry for the Rosenbrock function :

$f_4(p):=f_4(p_1,…,p_n):=\sum\limits_{i=1}^{n-1} [(a-p_i)^2 + b* (p_{i+1}-p_i^2)^2]$,

the standard parameter choice being $a=1$ and $b=100$. The Rosenbrock function is highly anisotropic as we will see from the contour plot. Furthermore, there are “mixed” terms, i.e. nonlinear dependencies between the different variables $p_1,…,p_n$, take e.g. the term $(p_2-p_1^2)^2 = p_2^2 -2*p_2p_1^2+ p_1^4$. The Rosenbrock function is also known as Rosenbrock’s banana function, a name that indicates the somewhat special shape which exhibits a noticeable curvature.

You might have noticed that the contour plot has a different parameter setting than the standard choice as mentioned in the definition of the function $f_4$. The plot shows the contour lines for $a=1,b=5$ and not $a=1,b=100$. It’s somewhat funny that I actually had to change the parameter $b$ to a much less critical value in order to make the extreme curvature visible. With $b=100$ the curvature is so extreme that a standard contour plot cannot resolve it accurately, everything seems to be symmetric, take a look yourself.

From this plot it is rather obvious that the negative gradient – they grey arrow – can be a very poor search direction despite its property of being locally optimal. Furthermore, due to the enormous length of the gradient the stepsize will be tiny resulting in a very high number of function evaluations as we will see later.

Stepsize methods

In order to get a feeling for the importance and the impact of stepsize methods we are going to compare three different variants, the standard Armijo rule, the Armijo rule with the widening approach I have already mentioned in part 1 ) of this series, and an exact stepsize method based on a simple but rather robust bisection method, that basically finds the root of a univariate function in an appropriate interval. We apply this bisection method to our line search problem, i.e. finding the root of the directional derivative with the direction being the negative gradient, which is a one-dimensional problem. The exact method is in general not competitive in terms of runtime or numerical costs because it needs a huge amount of gradient evaluations but here we are mostly interested in how the number of iterations is affected by this choice, a qualitative comparison. Intuitively one would guess that solving the line search problem exact would result in faster convergence, i.e. less iterations. Let’s figure out how good our intuition is.

The first numerical tests

For a sound comparison of the performance of the different methods that we intend to introduce in this series one cannot look at the runtime only. Since all our methods are iterative methods we have to consider at least the number of steps and the most important computations that are involved. To get a basic overview we use four different values: runtime, number of iterations, number of function evaluations and number of gradient evaluations. At first we try to get a feeling for how the dimension of the problem and the choice of the initial values affects the algorithmic behaviour. The optimization procedure stops if the norm of the gradient at the current iterate is smaller than $tol=1e^{-6}$.

Initial values for $f_1,f_2,f_3$

For these functions that are rather similar anyway and share the same minimum at $(0,…,0)$ we use three initial values with the meaningful names good_point / average_point / bad_point that are chosen to cause an algorithmic behaviour that hopefully justifies these names. The concrete values are

$latex
\begin{matrix}
good\_point & = & \dfrac{1}{10}*(1,…,1)\\
& & \\
average\_point & = & -\dfrac{10}{n} * (1,2,…,n)\\
& & \\
bad\_point & = & \dfrac{100}{n} * (1,2,…,n)
\end{matrix}
$

Initial values for Rosenbrock function $f_4$

The Rosenbrock function $f_4$ is rather different from the others, so I have prepared some special initial values as well. The minimum is at $(1,…,1)$ and due to the curvature of the function I have chosen the following set of points

$latex
\begin{matrix}
good\_point & = & \dfrac{9}{10}*(1,…,1)\\
& & \\
average\_point & = &-\dfrac{100}{n} * (1,2,…,n) + (2,…,2)\\
& & \\
bad\_point & = & \dfrac{100}{n} * (1,2,…,n)
\end{matrix}
$

We do not use all points in all tests, but you can easily do experiments on your own.

Test 1 – Impact of problem dimension

This is just to get a feeling how things scale in higher dimensions. We have used the bad starting point for $f_1,f_2,f_3$, the good starting point for $f_4$ and always the standard Armijo rule.

The trivial problem $\min f_1$ – essentially one-dimensional – is solved in a constant number of iterations regardless of the dimension which is the perfect scaling. Due to our parameter choice it takes exactly one step since a full step in direction of the negative gradient always ends in the global minimum.
The second problem $\min f_2$ – with anisotropies getting more serious with increasing dimension – shows a quadratic complexity as the number of iterations, function and gradient evaluations increases by a factor of roughly $4=2^2$ when the problem dimensions doubles. Such a scaling is problematic as it clearly limits the applicability to small or medium dimensions. Typically one wants to achieve linear or log-linear scaling, i.e. linear up to logarithmic factors.
The third problem $\min f_3$ that is supposed to be more complex as it has a higher degree seems to be nicer. The number of iterations etc. increases only by a factor of roughly $1.26$ when the problem dimensions doubles. This is a sublinear growth, which implies good scalability.
The fourth problem $\min f_4$ appears to be surprisingly harmless, in fact it also shows perfect scaling. Of course the runtimes always increase at a higher rate because the function/gradient evaluations etc. induce a higher number of arithmetical operations in higher dimensions but this is not our primary focus here.

As a general note: We should be careful when drawing conclusion from our data. All we have seen yet is a single data point, so we should not overemphasize these results or generalize from these values too much. Nevertheless, there is something we can take with us: The dimensionality of the minimization problem and the complexity of the objective function are not necessarily related. This we can safely derive as we need only one counterexample – in this case $f_4$ – when disproving hypotheses. It’s much harder to prove them.
Furthermore, the results depend on factors that we tend to overlook easily. For this first test the termination criterion plays a crucial rule. In higher dimensions it is much harder to reduce the norm below an absolute bound like $tol=1e^{-6}$, and if we would e.g. change it such that we apply the criterion to every component, i.e. replacing the Euclidean norm by the maximum norm, we would see a different behaviour. Especially in really high dimensions one has to carefully think about the implications of such things as e.g. criteria using the Euclidean norm can imply accuracy requirements in each coordinate that are way beyond machine precision. Quite often it makes sense to use relative criteria or a mixture of both absolute and relative criteria, or to choose problem-specific norms.

Test 2 – Impact of stepsize – Part 1

Now I would like to draw your attention to the stepsize rules, that are the second-most important ingredients in our algorithmic setup. First we try to evaluate how good our intuition was concerning the difference between exact and inexact stepsizes.

Actually – and if one wants to judge things well-balanced this is quite often the case – we cannot say much about it in general. There are cases where the exact stepsize is dramatically better, but the opposite can happen as well, see Table 4. We take with us, that it’s always a good idea to test things yourself that are not clear a priori. Each stepsize method has it’s applications and you will achieve the best results when finding the one that fits to your specific problem best.

Test 3 – Impact of stepsize – Part 2

In the last test there has been one case – $\min f_3$ – where the Armijo rule turned out to be a surprisingly bad choice. But what exactly is causing this? When you take a closer look you will notice that the algorithm is almost always (except 4 times) accepting the full stepsize, i.e. effectively the Armijo rule is not doing anything. The reason for this is that the length of the negative gradient is rather small, while the search direction is quite good for this problem. And here comes the widening idea into play. If the full negative gradient step leads to sufficient decrease, why not trying even larger steps? Here are the results.

As you can see here, a minimal modification led to a massive improvement, and the widening approach even outperforms the exact stepsize in terms of iterations. It turns out that not the general idea of the Armijo rule – iteratively adapting the steplength – has been “wrong”, it’s just the artificial restriction of not trying larger stepsizes that resulted in a severe degradation of performance. In some sense the standard Armijo rule is greedy as well, it always accepts the first admissible stepsize. Again, this turns out to not be the best choice in general.

Test 4 – Impact of stepsize – Part 3

This is the final test for now, and personally I think this is the most interesting case as well. I have set the parameters of the Rosenbrock function to $a=1,b=5$, which makes it less hard for Gradient Descent but still hard enough.

In dimension $n=20$ everythings looks somewhat unsurprising. In terms of iterations the exact stepsize is the best choice, while widening makes almost no difference at all given the total number of iterations. But in dimension $n=25$ several interesting things start to happen.
First of all, you should forget about the exact stepsize for Gradient Descent applied to the Rosenbrock function even if it has been slightly better for $n=20$. On a general level the reason for this simply is that the search direction is bad anyway, so there is not much you can expect when optimizing in this direction. When you look at the stepsizes you will notice that the behaviour of Armijo and the exaxct stepsize rule is rather different. After the first few iterations Armijo takes an almost constant stepsize, while the exact method exhibits some kind of alternating behaviour. If your search direction is not good you should not overoptimize, it is better to not try making larger steps at any cost, but this is what the exact method is doing regularly. After such a long step, the next step can be orders of magnitude smaller which implies that the situation has become even worse. The path of steps using the exact method proceeds almost at the bottom of the banana-shaped valley. The inexact methods slightly stay away from the very bottom so the convergence is better in the end. If you want to see this on your own, just print out the first 50-100 stepsizes and compare them to the stepsizes after 10000 iterations or so. In the beginning the exact method is taking larger steps from time to time but once it is trapped deep down in the valley the stepsizes remain very small. After some thousands of iterations the Armijo stepsize leads to larger steps on average which is what enables faster convergence.

But the most interesting aspect is the effect of the three(!) widening steps. This tiny change – one widening step every 12000 steps – improves the perfomance by more than 13%. This shows the sometimes extreme sensitivity of Gradient Descent once more and it proves that it can massively pay off when investing in this kind of optimization. That’s exactly the message here.

Do it yourself

If you are interested in experiencing these things on your own feel free to download the code and give it a try. And if you want to see the sometimes strange behaviour of Gradient Descent in action – even stranger than what we have already seen in the above tests, then I can recommend the following setup: Choose dimension $n=8$ (or $n=80$ if you really have a lot of spare time, don’t forget to adapt the max iterations parameter) and try to minimize the Rosenbrock function $f_4$ with the parameters $a=1,b=100$, the standard Armijo rule and the average starting point (this is important, it only works with the average point). Then slightly change the starting point by adding $10^{-2}$ ($10^{-1}$ if you can’t wait anymore and want a quick result) to any coordinate of the average starting point (R does this for you automatically, you can just add it as a constant: + 1e-2 or 1e-1). You need approximately ten minutes for this in $n=8$ dimensions, just in case you are planning to test the $n=80$ scenario first.

Fixed stepsize

One comment on fixed stepsizes: Quite often you can find descriptions of Gradient Descent without any stepsize rule, just using a fixed small stepsize, see e.g. the Wikipedia entry for Gradient Descent . While this in theory works if the stepsize is small enough I cannot recommend this by any means. The small effort invested in choosing a reasonable stepsize definitely pays off and the method runs orders of magnitude faster. You can test this yourself if you like, just follow the instructions in the Readme . I did not present any numbers here, but if you want to have some fun then try minimizing $f_3$ in dimension $n=7$ starting from the average point in less than 1 million iterations – and don’t just relax $tol$.

Summary

Finally, we have our first interesting results at hand. In particular we have learned something about stepsizes and how Gradient Descent behaves in different settings. We have seen best and worst cases, sensitivity with respect to initial values and the massive impact of minimal changes, e.g. Armijo with widening in Test 3.
Let me thank you for your attention, it’s been a long post but I hope it’s been worth the time. In the next part of this series we will see how the Nonlinear Conjugate Gradient method performs when applied to our test functions, especially the Rosenbrock function. And of course we will learn how Nonlinear CG works algorithmically and why and when it might be a good idea to choose this approach. See you soon…

Other articles in this series:
Machinery – Linear Regression
Machinery – Part 1

Was this post helpful?

Likes

Blog author

Stefan Kühn

Do you still have questions? Just send me a message.

fromStefan Kühn

The Machinery behind Machine Learning – A Benchmark for Linear Regression

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests ...

Machine Learning

14.1.2016 | 10 Minuten Lesezeit

Stefan Kühn

data2day 2015 – Ein Konferenzbericht

Auch dieses Jahr war wieder ein Team der codecentric AG auf der data2day. Die Konferenz hat sich sehr positiv weiterentwickelt und hat sich in unseren Augen als fester Bestandteil der deutschen Data-, Big-Data- und Data-Science-Konferenzlandschaft etabliert...

14.10.2015 | 3 Minuten Lesezeit

Stefan Kühn

The Machinery behind Machine Learning – Part 1

Machine Learning is one of the hottest topics on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize. This is the first blog post in a small series about the machinery that ...

Machine Learning

19.3.2015 | 15 Minuten Lesezeit

Stefan Kühn

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

BIE Spotty – unsere Lösung beim BIE City Hackathon

Typischerweise sind bei Hackathons viele Soft- und Hardware-Entwickler zu finden, die innerhalb eines begrenzten Zeitraums versuchen, kreative und ungewöhnliche Lösungen in Form von Code und ersten Prototypen für vorher definierte Challenges zu erarbeiten...

IoT
Computer Vision
IT-Security
Machine Learning

2.7.2020 | 5 Minuten Lesezeit

Meike Wocken

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Machine Learning und künstliche Intelligenz sind aktuell in aller Munde und versprechen vielfältige Einsatzmöglichkeiten im Unternehmen. Trotzdem tun sich viele Unternehmen aktuell noch schwer, das Potential der Technologie zu nutzen. „Der Fokus liegt...

Künstliche Intelligenz
Data
Community
Machine Learning

27.5.2020 | 1 Minuten Lesezeit

Matthias Niehoff

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Machine-Learning-Technologien lassen sich erfolgreich und praxisnah im Unternehmensumfeld einsetzen. Ein konkreter, überschaubarer Anwendungsfall und somit fokussierter Einsatz von Machine-Learning-Modellen kann dabei echten Mehrwert generieren. Dieser...

Data
Machine Learning
Data Science

25.3.2020 | 5 Minuten Lesezeit

Berthold Schulte

Deployment von Machine-Learning-Modellen mit Seldon Core

In diesem Artikel sehen wir uns an, wie wir Machine-Learning- und Deep-Learning-Modelle mit Seldon Core deployen können. Seldon Core ist eine Open-Source-Plattform, um Modelle auf einem Kubernetes-Cluster in Betrieb zu nehmen. Bevor wir uns Seldon Core...

Softwarearchitektur
Data
Künstliche Intelligenz
Machine Learning

9.9.2019 | 7 Minuten Lesezeit

Nico Axtmann

Data Science in der Praxis: Häufige Fehler und Vorgehen

In diesem Artikel gehen wir auf die Besonderheiten von Data Science in der Praxis ein. Wir konzentrieren uns auf die technischen Unterschiede, häufige Fehler und Herausforderungen. Dabei lassen wird die sozialen und kommunikativen Aspekte außen vor. ...

Agilität
Machine Learning
Data

28.8.2019 | 11 Minuten Lesezeit

Nico Axtmann

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Dieser Artikel befasst sich mit dem Deployment eines Machine-Learning-Modells, das den Wert eines Hauses in Boston anhand gewisser Merkmale wie der Kriminalitätsrate des Bezirks und der Anzahl der Räume in einer Wohnung bestimmen kann. Im ersten Schritt...

Data
Python
Künstliche Intelligenz
Machine Learning

6.8.2019 | 3 Minuten Lesezeit

Nico Axtmann

Machine-Learning-Modelle bewerten – die Crux mit der Metrik

Ist ein Modell erst einmal trainiert, kann es auf verschiedene Art und Weise und mit mehr oder weniger komplexen und aussagekräftigen Verfahren und Metriken bewertet werden. Die Anzahl und möglichen Kriterien, ein Modell zu bewerten, sind allerdings....

Data
Machine Learning
Softwareentwicklung

1.7.2019 | 13 Minuten Lesezeit

Berthold Schulte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

The Machinery behind Machine Learning – Part 2

Gradient Descent

Greedy methods

Test functions

Stepsize methods

The first numerical tests

Initial values for \(f_1,f_2,f_3\)

Initial values for Rosenbrock function \(f_4\)

Test 1 – Impact of problem dimension

Test 2 – Impact of stepsize – Part 1

Test 3 – Impact of stepsize – Part 2

Test 4 – Impact of stepsize – Part 3

Do it yourself

Fixed stepsize

Summary

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

The Machinery behind Machine Learning – A Benchmark for Linear Regression

data2day 2015 – Ein Konferenzbericht

The Machinery behind Machine Learning – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

BIE Spotty – unsere Lösung beim BIE City Hackathon

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Machine-Learning-Modelle bewerten – die Crux mit den Testdaten

Deployment von Machine-Learning-Modellen mit Seldon Core

Data Science in der Praxis: Häufige Fehler und Vorgehen

Inbetriebnahme eines scikit-learn-Modells mit ONNX und FastAPI

Machine-Learning-Modelle bewerten – die Crux mit der Metrik

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten