Welcome to roadstat.com on July 5 2009.
This is an internet experiment running to monitor browsing habbits of individuals through wikipedia contents.

Gradient descent

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

Contents

[edit] Description

Illustration of gradient descent

Gradient descent is based on the observation that if the real-valued function F(\mathbf{x}) is defined and differentiable in a neighborhood of a point \mathbf{a}, then F(\mathbf{x}) decreases fastest if one goes from \mathbf{a} in the direction of the negative gradient of F at \mathbf{a}, -\nabla F(\mathbf{a}). It follows that, if

\mathbf{b}=\mathbf{a}-\gamma\nabla F(\mathbf{a})

for γ > 0 a small enough number, then F(\mathbf{a})\geq F(\mathbf{b}). With this observation in mind, one starts with a guess \mathbf{x}_0 for a local minimum of F, and considers the sequence \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots such that

\mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n),\ n \ge 0.

We have

F(\mathbf{x}_0)\ge F(\mathbf{x}_1)\ge F(\mathbf{x}_2)\ge \dots,

so hopefully the sequence (\mathbf{x}_n) converges to the desired local minimum. Note that the value of the step size γ is allowed to change at every iteration.

This process is illustrated in the picture to the right. Here F is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of F is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function F is minimal.

[edit] Examples

Gradient descent has problems with pathological functions such as the Rosenbrock function shown here. The Rosenbrock function has a narrow curved valley which contains the minimum. The bottom of the valley is very flat. Because of the curved flat valley the optimization is zig-zagging slowly with small stepsizes towards the minimum.

The gradient ascent method applied to F(x,y)=\sin\left(\frac{1}{2} x^2 - \frac{1}{4} y^2 + 3 \right) \cos(2 x+1-e^y):

The gradient descent algorithm in action. (1: contour)The gradient descent algorithm in action. (2: surface)



Gradient descent can also be used to solve a system of nonlinear equations. Below is an example that shows how to use the gradient descent to solve for three unknown variables, x1, x2, and x3. This example shows one iteration of the gradient descent.

Consider a nonlinear system of equations:


Image:Steepestdescent1.PNG



There must be an initial assumption as to what the solutions for x1, x2, and x3 are. In this case assume that all the solutions are zero. The next step is to form the Jacobian matrix of the given system of equations.


Image:Steepestdescent2.PNG


Plug in the initial assumption of zero into the Jacobian matrix and original system to get the answers below.

Image:Steepestdescent3.PNG

Now the gradient and z0 must be found. This is done by using the formula below and the previous two calculations made. The value for z0 is found by simply calculating the p-norm of the gradient, where p is equal to 2.


Image:Steepestdescent5.PNG

Now the value for z must be found.

Image:Steepestdescent6.PNG

The next step is to find solutions for g1, g2, and g3. First there must be a value chosen for α1. Set α1 to zero and α3 to one.

Image:Steepestdescent7.PNG

There is a condition that must be met when the solutions for g1 and g3 are found. That condition is that g3 must be less than g1 and if this is true we accept α1 and α3. If this is not true then a different value must be chosen for α3, one which will satisfy this condition. Try choosing a smaller value for α3 like 1/2 or 1/4.

To find α2 all that needs to be done is to divide α3 by 2 → α2 = α3 / 2 . With α2 found, g2 can be calculated.


Image:Steepestdescent8.PNG

The next step is to form a Newton polynomial, Newton's forward divided difference interpolating polynomial.

Image:Steepestdescent9.PNG

Using all the calculations of α1, α2, α3, g1, g2, and g3 the unknown values of h1, h2, and h3 can be found, which is shown below.

Image:Steepestdescent10.PNG

With all the values for h1, h2, and h3 known, the next step is to plug them into the polynomial and solve for α. A value for g0 must be found and must satisfy the condition of being less than g1 and g3. If this condition is met then use the following equation below to solve for x(1) which involves the initial x(0), α which becomes α0, and z.

Image:Steepestdescent11.PNG

And now there is a set of solutions that satisfy the system of nonlinear equations.

[edit] Comments

Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case the search space is typically a function space, and one calculates the Gâteaux derivative of the functional to be minimized to determine the descent direction.

Two weaknesses of gradient descent are:

  1. The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different.
  2. Finding the optimal γ per step can be time-consuming. Conversely, using a fixed γ can yield poor results. Methods based on Newton's method and inversion of the Hessian using conjugate gradient techniques are often a better alternative.

A more powerful algorithm is given by the BFGS method which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated line search algorithm, to find the "best" value of γ.

Gradient descent is in fact Euler's method for solving ordinary differential equations applied to a gradient flow. As the goal is to find the minimum, not the flow line, the error in finite methods is less significant.

[edit] A computational example

The gradient descent algorithm is applied to find a local minimum of the function f(x)=x4-3x3+2 , with derivative f'(x)=4x3-9x2. Here is an implementation in the C programming language.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
 
int main ()
{
	// From calculation, we expect that the local minimum occurs at x=9/4
	// The algorithm starts at x=6
 
	double xOld = 0;
	double xNew = 6;
	double eps = 0.01; // step size
	double precision = 0.00001;
	while (fabs(xNew - xOld) > precision)
	{
             xOld = xNew;
             xNew = xNew - eps*(4*xNew*xNew*xNew-9*xNew*xNew);
	}
 
	printf ("Local minimum occurs at %lg\n", xNew);
 
}

With this precision, the algorithm converges to a local minimum of 2.24996 in 70 iterations.

A more robust implementation of the algorithm would also check whether the function value indeed decreases at every iteration and would make the step size smaller otherwise. One can also use an adaptive step size which may make the algorithm converge faster.

[edit] See also

[edit] References

  • Mordecai Avriel (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
  • Jan A. Snyman (2005). Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer Publishing. ISBN 0-387-24348-8
Personal tools

Visit joltnews for the latest headlines
Visit bloit.com for company information
Geed Media does computer consulting on long island.
This page viewed times. See Logs