Chapter 4

Numerical Computation

Machine learning algorithms usually require a high amount of numerical compu-

tation. This typically refers to algorithms that solve mathematical problems by

methods that update estimates of the solution via an iterative process, rather

than analytically deriving a formula to provide a symbolic expression for the

correct solution. Common operations include optimization (ﬁnding the value of an

argument that minimizes or maximizes a function) and solving systems of linear

equations. Even just evaluating a mathematical function on a digital computer can

be diﬃcult when the function involves real numbers, which cannot be represented

precisely using a ﬁnite amount of memory.

4.1 Overﬂow and Underﬂow

The fundamental diﬃculty in performing continuous math on a digital computer

is that we need to represent inﬁnitely many real numbers with a ﬁnite number

of bit patterns. This means that for almost all real numbers, we incur some

approximation error when we represent the number in the computer. In many

cases, this is just rounding error. Rounding error is problematic, especially when

it compounds across many operations, and can cause algorithms that work in

theory to fail in practice if they are not designed to minimize the accumulation of

rounding error.

One form of rounding error that is particularly devastating is

underﬂow

Underﬂow occurs when numbers near zero are rounded to zero. Many functions

behave qualitatively diﬀerently when their argument is zero rather than a small

positive number. For example, we usually want to avoid division by zero (some

CHAPTER 4. NUMERICAL COMPUTATION

software environments will raise exceptions when this occurs, others will return a

result with a placeholder not-a-number value) or taking the logarithm of zero (this

is usually treated as

−∞

, which then becomes not-a-number if it is used for many

further arithmetic operations).

Another highly damaging form of numerical error is

overﬂow

. Overﬂow occurs

when numbers with large magnitude are approximated as

∞

−∞

. Further

arithmetic will usually change these inﬁnite values into not-a-number values.

One example of a function that must be stabilized against underﬂow and

overﬂow is the

softmax function

. The softmax function is often used to predict

the probabilities associated with a multinoulli distribution. The softmax function

is deﬁned to be

softmax(x)

exp(x

)



j=1

exp(x

)

. (4.1)

Consider what happens when all the

are equal to some constant

. Analytically,

we can see that all the outputs should be equal to

. Numerically, this may

not occur when

has large magnitude. If

is very negative, then

exp

(

) will

underﬂow. This means the denominator of the softmax will become 0, so the ﬁnal

result is undeﬁned. When

is very large and positive,

exp

(

) will overﬂow, again

resulting in the expression as a whole being undeﬁned. Both of these diﬃculties

can be resolved by instead evaluating

softmax

(

) where

x − max

. Simple

algebra shows that the value of the softmax function is not changed analytically by

adding or subtracting a scalar from the input vector. Subtracting

max

results

in the largest argument to

exp

being 0, which rules out the possibility of overﬂow.

Likewise, at least one term in the denominator has a value of 1, which rules out

the possibility of underﬂow in the denominator leading to a division by zero.

There is still one small problem. Underﬂow in the numerator can still cause

the expression as a whole to evaluate to zero. This means that if we implement

log softmax

(

) by ﬁrst running the softmax subroutine then passing the result to

the log function, we could erroneously obtain

−∞

. Instead, we must implement

a separate function that calculates

log softmax

in a numerically stable way. The

log softmax

function can be stabilized using the same trick as we used to stabilize

the softmax function.

For the most part, we do not explicitly detail all the numerical considerations

involved in implementing the various algorithms described in this book. Developers

of low-level libraries should keep numerical issues in mind when implementing

deep learning algorithms. Most readers of this book can simply rely on low-

level libraries that provide stable implementations. In some cases, it is possible

to implement a new algorithm and have the new implementation automatically

CHAPTER 4. NUMERICAL COMPUTATION

stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example

of a software package that automatically detects and stabilizes many common

numerically unstable expressions that arise in the context of deep learning.

4.2 Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small changes

in its inputs. Functions that change rapidly when their inputs are perturbed slightly

can be problematic for scientiﬁc computation because rounding errors in the inputs

can result in large changes in the output.

Consider the function

(

) =

−1

. When

A ∈ R

n×n

has an eigenvalue

decomposition, its condition number is

max

i,j



. (4.2)

This is the ratio of the magnitude of the largest and smallest eigenvalue. When

this number is large, matrix inversion is particularly sensitive to error in the input.

This sensitivity is an intrinsic property of the matrix itself, not the result

of rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice, the

error will be compounded further by numerical errors in the inversion process itself.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization

refers to the task of either minimizing or maximizing some function

(

) by altering

. We usually phrase most optimization problems in terms of minimizing

(

Maximization may be accomplished via a minimization algorithm by minimizing

−f(x).

The function we want to minimize or maximize is called the objective func-

tion

, or

criterion

. When we are minimizing it, we may also call it the

cost

function

loss function

, or

error function

. In this book, we use these terms

interchangeably, though some machine learning publications assign special meaning

to some of these terms.

We often denote the value that minimizes or maximizes a function with a

superscript ∗. For example, we might say x

∗

= arg min f(x).

CHAPTER 4. NUMERICAL COMPUTATION

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

Global minimum at x = 0.

Since f



(x) = 0, gradient

descent halts here.

For x < 0, we have f



(x) < 0,

so we can decrease f by

moving rightward.

For x > 0, we have f



(x) > 0,

so we can decrease f by

moving leftward.

f(x) =



(x) = x

Figure 4.1: Gradient descent. An illustration of how the gradient descent algorithm uses

the derivatives of a function to follow the function downhill to a minimum.

We assume the reader is already familiar with calculus but provide a brief

review of how calculus concepts relate to optimization here.

Suppose we have a function

(

), where both

and

are real numbers.

The

derivative

of this function is denoted as



(

) or as

. The derivative



(

)

gives the slope of

(

) at the point

. In other words, it speciﬁes how to scale

a small change in the input to obtain the corresponding change in the output:

f(x + ) ≈ f(x) + f



(x).

The derivative is therefore useful for minimizing a function because it tells us

how to change

in order to make a small improvement in

. For example, we

know that

(

x −  sign

(



(

))) is less than

(

) for small enough



. We can thus

reduce

(

) by moving

in small steps with the opposite sign of the derivative.

This technique is called

gradient descent

(Cauchy, 1847). See ﬁgure 4.1 for an

example of this technique.

When



(

) = 0, the derivative provides no information about which direction

to move. Points where



(

) = 0 are known as

critical points

, or

stationary

points

. A

local minimum

is a point where

(

) is lower than at all neighboring

points, so it is no longer possible to decrease

(

) by making inﬁnitesimal steps.

local maximum

is a point where

(

) is higher than at all neighboring points,

CHAPTER 4. NUMERICAL COMPUTATION

Minimum Maximum Saddle point

Figure 4.2: Types of critical points. Examples of the three types of critical points in one

dimension. A critical point is a point with zero slope. Such a point can either be a local

minimum, which is lower than the neighboring points; a local maximum, which is higher

than the neighboring points; or a saddle point, which has neighbors that are both higher

and lower than the point itself.

so it is not possible to increase

(

) by making inﬁnitesimal steps. Some critical

points are neither maxima nor minima. These are known as

saddle points

. See

ﬁgure 4.2 for examples of each type of critical point.

A point that obtains the absolute lowest value of

(

) is a

global minimum

There can be only one global minimum or multiple global minima of the function.

It is also possible for there to be local minima that are not globally optimal. In the

context of deep learning, we optimize functions that may have many local minima

that are not optimal and many saddle points surrounded by very ﬂat regions. All

of this makes optimization diﬃcult, especially when the input to the function is

multidimensional. We therefore usually settle for ﬁnding a value of

that is very

low but not necessarily minimal in any formal sense. See ﬁgure 4.3 for an example.

We often minimize functions that have multiple inputs:

→ R

. For the

concept of “minimization” to make sense, there must still be only one (scalar)

output.

For functions with multiple inputs, we must make use of the concept of

partial

derivatives

. The partial derivative

∂

∂x

(

) measures how

changes as only the

variable

increases at point

. The

gradient

generalizes the notion of derivative

to the case where the derivative is with respect to a vector: the gradient of

the vector containing all the partial derivatives, denoted

∇

(

). Element

of the

gradient is the partial derivative of

with respect to

. In multiple dimensions,

CHAPTER 4. NUMERICAL COMPUTATION

f(x)

Ideally, we would like

to arrive at the global

minimum, but this

might not be possible.

This local minimum

performs nearly as well as

the global one,

so it is an acceptable

halting point.

This local minimum performs

poorly and should be avoided.

Figure 4.3: Approximate minimization. Optimization algorithms may fail to ﬁnd a global

minimum when there are multiple local minima or plateaus present. In the context of

deep learning, we generally accept such solutions even though they are not truly minimal,

so long as they correspond to signiﬁcantly low values of the cost function.

critical points are points where every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of the

function

in direction

. In other words, the directional derivative is the derivative

of the function

(

αu

) with respect to

, evaluated at

= 0. Using the chain

rule, we can see that

∂

∂α

f(x + αu) evaluates to u



∇

f(x) when α = 0.

To minimize

, we would like to ﬁnd the direction in which

decreases the

fastest. We can do this using the directional derivative:

min

u,u



u=1



∇

f(x) (4.3)

= min

u,u



u=1

||u||

||∇

f(x)||

cos θ (4.4)

where

is the angle between

and the gradient. Substituting in

||u||

= 1 and

ignoring factors that do not depend on

, this simpliﬁes to

min

cos θ

. This is

minimized when

points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directly

downhill. We can decrease

by moving in the direction of the negative gradient.

This is known as the method of steepest descent, or gradient descent.

Steepest descent proposes a new point



= x − ∇

f(x) (4.5)

CHAPTER 4. NUMERICAL COMPUTATION

where



is the

learning rate

, a positive scalar determining the size of the step.

We can choose



in several diﬀerent ways. A popular approach is to set



to a small

constant. Sometimes, we can solve for the step size that makes the directional

derivative vanish. Another approach is to evaluate

f (x − ∇

f(x))

for several

values of



and choose the one that results in the smallest objective function value.

This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, in

practice, very close to zero). In some cases, we may be able to avoid running

this iterative algorithm and just jump directly to the critical point by solving the

equation ∇

f(x) = 0 for x.

Although gradient descent is limited to optimization in continuous spaces, the

general concept of repeatedly making a small move (that is approximately the best

small move) toward better conﬁgurations can be generalized to discrete spaces.

Ascending an objective function of discrete parameters is called

hill climbing

(Russel and Norvig, 2003).

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices

Sometimes we need to ﬁnd all the partial derivatives of a function whose input

and output are both vectors. The matrix containing all such partial derivatives is

known as a

Jacobian matrix

. Speciﬁcally, if we have a function

→ R

then the Jacobian matrix J ∈ R

n×m

of f is deﬁned such that J

i,j

∂

∂x

f(x)

We are also sometimes interested in a derivative of a derivative. This is known

as a

second derivative

. For example, for a function

→ R

, the derivative

with respect to

of the derivative of

with respect to

is denoted as

∂

∂x

In a single dimension, we can denote



(

). The second derivative tells

us how the ﬁrst derivative will change as we vary the input. This is important

because it tells us whether a gradient step will cause as much of an improvement

as we would expect based on the gradient alone. We can think of the second

derivative as measuring

curvature

. Suppose we have a quadratic function (many

functions that arise in practice are not quadratic but can be approximated well

as quadratic, at least locally). If such a function has a second derivative of zero,

then there is no curvature. It is a perfectly ﬂat line, and its value can be predicted

using only the gradient. If the gradient is 1, then we can make a step of size



along the negative gradient, and the cost function will decrease by



. If the second

derivative is negative, the function curves downward, so the cost function will

actually decrease by more than



. Finally, if the second derivative is positive, the

function curves upward, so the cost function can decrease by less than



. See

CHAPTER 4. NUMERICAL COMPUTATION

f(x)

Negative curvature

f(x)

No curvature

f(x)

Positive curvature

Figure 4.4: The second derivative determines the curvature of a function. Here we show

quadratic functions with various curvature. The dashed line indicates the value of the

cost function we would expect based on the gradient information alone as we make a

gradient step downhill. With negative curvature, the cost function actually decreases

faster than the gradient predicts. With no curvature, the gradient predicts the decrease

correctly. With positive curvature, the function decreases more slowly than expected and

eventually begins to increase, so steps that are too large can actually increase the function

inadvertently.

ﬁgure 4.4 to see how diﬀerent forms of curvature aﬀect the relationship between

the value of the cost function predicted by the gradient and the true value.

When our function has multiple input dimensions, there are many second

derivatives. These derivatives can be collected together into a matrix called the

Hessian matrix. The Hessian matrix H(f)(x) is deﬁned such that

H(f)(x)

i,j

∂

∂x

f(x). (4.6)

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the diﬀerential

operators are commutative; that is, their order can be swapped:

∂

∂x

f(x) =

∂

∂x

f(x). (4.7)

This implies that

i,j

j,i

, so the Hessian matrix is symmetric at such points.

Most of the functions we encounter in the context of deep learning have a symmetric

Hessian almost everywhere. Because the Hessian matrix is real and symmetric,

we can decompose it into a set of real eigenvalues and an orthogonal basis of

CHAPTER 4. NUMERICAL COMPUTATION

eigenvectors. The second derivative in a speciﬁc direction represented by a unit

vector

is given by



. When

is an eigenvector of

, the second derivative

in that direction is given by the corresponding eigenvalue. For other directions of

, the directional second derivative is a weighted average of all the eigenvalues,

with weights between 0 and 1, and eigenvectors that have a smaller angle with

receiving more weight. The maximum eigenvalue determines the maximum

second derivative, and the minimum eigenvalue determines the minimum second

derivative.

The (directional) second derivative tells us how well we can expect a gradient

descent step to perform. We can make a second-order Taylor series approximation

to the function f (x) around the current point x

(0)

f(x) ≈ f(x

(0)

) + (x − x

(0)

)



g +

(x − x

(0)

)



H(x − x

(0)

), (4.8)

where

is the gradient and

is the Hessian at

(0)

. If we use a learning rate



, then the new point

will be given by

(0)

− g

. Substituting this into our

approximation, we obtain

f(x

(0)

− g) ≈ f (x

(0)

) − g



g +





Hg. (4.9)

There are three terms here: the original value of the function, the expected

improvement due to the slope of the function, and the correction we must apply

to account for the curvature of the function. When this last term is too large, the

gradient descent step can actually move uphill. When



is zero or negative,

the Taylor series approximation predicts that increasing



forever will decrease

forever. In practice, the Taylor series is unlikely to remain accurate for large



, so

one must resort to more heuristic choices of



in this case. When



is positive,

solving for the optimal step size that decreases the Taylor series approximation of

the function the most yields



∗



. (4.10)

In the worst case, when

aligns with the eigenvector of

corresponding to the

maximal eigenvalue

max

, then this optimal step size is given by

max

. To the

extent that the function we minimize can be approximated well by a quadratic

function, the eigenvalues of the Hessian thus determine the scale of the learning

rate.

The second derivative can be used to determine whether a critical point is

a local maximum, a local minimum, or a saddle point. Recall that on a critical

point,



(

) = 0. When the second derivative



(

)

0, the ﬁrst derivative



(

)

CHAPTER 4. NUMERICAL COMPUTATION

increases as we move to the right and decreases as we move to the left. This means



(

x − 

)

0 and



(



)

0 for small enough



. In other words, as we move

right, the slope begins to point uphill to the right, and as we move left, the slope

begins to point uphill to the left. Thus, when



(

) = 0 and



(

)

0, we can

conclude that

is a local minimum. Similarly, when



(

) = 0 and



(

)

0, we

can conclude that

is a local maximum. This is known as the

second derivative

test

. Unfortunately, when



(

) = 0, the test is inconclusive. In this case

may

be a saddle point or a part of a ﬂat region.

In multiple dimensions, we need to examine all the second derivatives of the

function. Using the eigendecomposition of the Hessian matrix, we can generalize

the second derivative test to multiple dimensions. At a critical point, where

∇

(

) = 0, we can examine the eigenvalues of the Hessian to determine whether

the critical point is a local maximum, local minimum, or saddle point. When the

Hessian is positive deﬁnite (all its eigenvalues are positive), the point is a local

minimum. This can be seen by observing that the directional second derivative

in any direction must be positive, and making reference to the univariate second

derivative test. Likewise, when the Hessian is negative deﬁnite (all its eigenvalues

are negative), the point is a local maximum. In multiple dimensions, it is actually

possible to ﬁnd positive evidence of saddle points in some cases. When at least

one eigenvalue is positive and at least one eigenvalue is negative, we know that

is a local maximum on one cross section of

but a local minimum on another

cross section. See ﬁgure 4.5 for an example. Finally, the multidimensional second

derivative test can be inconclusive, just as the univariate version can. The test

is inconclusive whenever all the nonzero eigenvalues have the same sign but at

least one eigenvalue is zero. This is because the univariate second derivative test is

inconclusive in the cross section corresponding to the zero eigenvalue.

In multiple dimensions, there is a diﬀerent second derivative for each direction

at a single point. The condition number of the Hessian at this point measures

how much the second derivatives diﬀer from each other. When the Hessian has a

poor condition number, gradient descent performs poorly. This is because in one

direction, the derivative increases rapidly, while in another direction, it increases

slowly. Gradient descent is unaware of this change in the derivative, so it does not

know that it needs to explore preferentially in the direction where the derivative

remains negative for longer. Poor condition number also makes choosing a good

step size diﬃcult. The step size must be small enough to avoid overshooting

the minimum and going uphill in directions with strong positive curvature. This

usually means that the step size is too small to make signiﬁcant progress in other

directions with less curvature. See ﬁgure 4.6 for an example.

CHAPTER 4. NUMERICAL COMPUTATION





󰤓









󰤓















󰤓





Figure 4.5: A saddle point containing both positive and negative curvature. The function

in this example is

(

) =

− x

. Along the axis corresponding to

, the function

curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue.

Along the axis corresponding to

, the function curves downward. This direction is an

eigenvector of the Hessian with negative eigenvalue. The name “saddle point” derives from

the saddle-like shape of this function. This is the quintessential example of a function with

a saddle point. In more than one dimension, it is not necessary to have an eigenvalue of 0

to get a saddle point: it is only necessary to have both positive and negative eigenvalues.

We can think of a saddle point with both signs of eigenvalues as being a local maximum

within one cross section and a local minimum within another cross section.

CHAPTER 4. NUMERICAL COMPUTATION

−30 −20 −10 0 10 20

−30

−20

−10

Figure 4.6: Gradient descent fails to exploit the curvature information contained in the

Hessian matrix. Here we use gradient descent to minimize a quadratic function

(

) whose

Hessian matrix has condition number 5. This means that the direction of most curvature

has ﬁve times more curvature than the direction of least curvature. In this case, the most

curvature is in the direction [1



, and the least curvature is in the direction [1

, −



The red lines indicate the path followed by gradient descent. This very elongated quadratic

function resembles a long canyon. Gradient descent wastes time repeatedly descending

canyon walls because they are the steepest feature. Since the step size is somewhat too

large, it has a tendency to overshoot the bottom of the function and thus needs to descend

the opposite canyon wall on the next iteration. The large positive eigenvalue of the Hessian

corresponding to the eigenvector pointed in this direction indicates that this directional

derivative is rapidly increasing, so an optimization algorithm based on the Hessian could

predict that the steepest direction is not actually a promising search direction in this

context.

CHAPTER 4. NUMERICAL COMPUTATION

This issue can be resolved by using information from the Hessian matrix to guide

the search. The simplest method for doing so is known as

Newton’s method

Newton’s method is based on using a second-order Taylor series expansion to

approximate f(x) near some point x

(0)

f(x) ≈ f(x

(0)

)+(x−x

(0)

)



∇

f(x

(0)

(x−x

(0)

)



H(f)(x

(0)

)(x−x

(0)

). (4.11)

If we then solve for the critical point of this function, we obtain

∗

= x

(0)

− H(f)(x

(0)

)

−1

∇

f(x

(0)

). (4.12)

When

is a positive deﬁnite quadratic function, Newton’s method consists of

applying equation 4.12 once to jump to the minimum of the function directly.

When

is not truly quadratic but can be locally approximated as a positive

deﬁnite quadratic, Newton’s method consists of applying equation 4.12 multiple

times. Iteratively updating the approximation and jumping to the minimum of

the approximation can reach the critical point much faster than gradient descent

would. This is a useful property near a local minimum, but it can be a harmful

property near a saddle point. As discussed in section 8.2.3, Newton’s method is

only appropriate when the nearby critical point is a minimum (all the eigenvalues

of the Hessian are positive), whereas gradient descent is not attracted to saddle

points unless the gradient points toward them.

Optimization algorithms that use only the gradient, such as gradient descent,

are called

ﬁrst-order optimization algorithms

. Optimization algorithms that

also use the Hessian matrix, such as Newton’s method, are called

second-order

optimization algorithms (Nocedal and Wright, 2006).

The optimization algorithms employed in most contexts in this book are

applicable to a wide variety of functions but come with almost no guarantees. Deep

learning algorithms tend to lack guarantees because the family of functions used in

deep learning is quite complicated. In many other ﬁelds, the dominant approach to

optimization is to design optimization algorithms for a limited family of functions.

In the context of deep learning, we sometimes gain some guarantees by restrict-

ing ourselves to functions that are either

Lipschitz continuous

or have Lipschitz

continuous derivatives. A Lipschitz continuous function is a function

whose rate

of change is bounded by a Lipschitz constant L:

∀x, ∀y, |f(x) − f(y)| ≤ L||x − y||

. (4.13)

This property is useful because it enables us to quantify our assumption that a

small change in the input made by an algorithm such as gradient descent will have

CHAPTER 4. NUMERICAL COMPUTATION

a small change in the output. Lipschitz continuity is also a fairly weak constraint,

and many optimization problems in deep learning can be made Lipschitz continuous

with relatively minor modiﬁcations.

Perhaps the most successful ﬁeld of specialized optimization is

convex op-

timization

. Convex optimization algorithms are able to provide many more

guarantees by making stronger restrictions. These algorithms are applicable only

to convex functions—functions for which the Hessian is positive semideﬁnite ev-

erywhere. Such functions are well-behaved because they lack saddle points, and

all their local minima are necessarily global minima. However, most problems in

deep learning are diﬃcult to express in terms of convex optimization. Convex

optimization is used only as a subroutine of some deep learning algorithms. Ideas

from the analysis of convex optimization algorithms can be useful for proving

the convergence of deep learning algorithms, but in general, the importance of

convex optimization is greatly diminished in the context of deep learning. For

more information about convex optimization, see Boyd and Vandenberghe (2004)

or Rockafellar (1997).

4.4 Constrained Optimization

Sometimes we wish not only to maximize or minimize a function

(

) over all

possible values of

. Instead we may wish to ﬁnd the maximal or minimal

value of

(

) for values of

in some set

. This is known as

constrained

optimization

. Points

that lie within the set

are called

feasible

points in

constrained optimization terminology.

We often wish to ﬁnd a solution that is small in some sense. A common

approach in such situations is to impose a norm constraint, such as ||x|| ≤ 1.

One simple approach to constrained optimization is simply to modify gradient

descent taking the constraint into account. If we use a small constant step size



we can make gradient descent steps, then project the result back into

. If we use

a line search, we can search only over step sizes



that yield new

points that are

feasible, or we can project each point on the line back into the constraint region.

When possible, this method can be made more eﬃcient by projecting the gradient

into the tangent space of the feasible region before taking the step or beginning

the line search (Rosen, 1960).

A more sophisticated approach is to design a diﬀerent, unconstrained opti-

mization problem whose solution can be converted into a solution to the original,

constrained optimization problem. For example, if we want to minimize

(

) for

CHAPTER 4. NUMERICAL COMPUTATION

x ∈ R

with

constrained to have exactly unit

norm, we can instead minimize

(

) =

([

cos θ, sin θ

]



) with respect to

, then return [

cos θ, sin θ

] as the solution

to the original problem. This approach requires creativity; the transformation

between optimization problems must be designed speciﬁcally for each case we

encounter.

The

Karush–Kuhn–Tucker

(KKT) approach

provides a very general so-

lution to constrained optimization. With the KKT approach, we introduce a

new function called the

generalized Lagrangian

generalized Lagrange

function.

To deﬁne the Lagrangian, we ﬁrst need to describe

in terms of equations

and inequalities. We want a description of

in terms of

functions

(i)

and

functions

(j)

so that

{x | ∀i, g

(i)

(

) = 0

and ∀j, h

(j)

(

)

≤

}

. The equations

involving

(i)

are called the

equality constraints

, and the inequalities involving

(j)

are called inequality constraints.

We introduce new variables

and

for each constraint, these are called the

KKT multipliers. The generalized Lagrangian is then deﬁned as

L(x, λ, α) = f(x) +



(i)

(x) +



(j)

(x). (4.14)

We can now solve a constrained minimization problem using unconstrained

optimization of the generalized Lagrangian. As long as at least one feasible point

exists and f(x) is not permitted to have value ∞, then

min

max

α,α≥0

L(x, λ, α) (4.15)

has the same optimal objective function value and set of optimal points x as

min

x∈S

f(x). (4.16)

This follows because any time the constraints are satisﬁed,

max

α,α≥0

L(x, λ, α) = f(x), (4.17)

while any time a constraint is violated,

max

α,α≥0

L(x, λ, α) = ∞. (4.18)

These properties guarantee that no infeasible point can be optimal, and that the

optimum within the feasible points is unchanged.

The KKT approach generalizes the method of

Lagrange multipliers

, which allows equality

constraints but not inequality constraints.

CHAPTER 4. NUMERICAL COMPUTATION

To perform constrained maximization, we can construct the generalized La-

grange function of −f (x), which leads to this optimization problem:

min

max

α,α≥0

−f(x) +



(i)

(x) +



(j)

(x). (4.19)

We may also convert this to a problem with maximization in the outer loop:

max

min

α,α≥0

f(x) +



(i)

(x) −



(j)

(x). (4.20)

The sign of the term for the equality constraints does not matter; we may deﬁne it

with addition or subtraction as we wish, because the optimization is free to choose

any sign for each λ

The inequality constraints are particularly interesting. We say that a constraint

(i)

(

) is

active

(i)

(

∗

) = 0. If a constraint is not active, then the solution to

the problem found using that constraint would remain at least a local solution if

that constraint were removed. It is possible that an inactive constraint excludes

other solutions. For example, a convex problem with an entire region of globally

optimal points (a wide, ﬂat region of equal cost) could have a subset of this

region eliminated by constraints, or a nonconvex problem could have better local

stationary points excluded by a constraint that is inactive at convergence. Yet the

point found at convergence remains a stationary point whether or not the inactive

constraints are included. Because an inactive

(i)

has negative value, then the

solution to

min

max

α,α≥0

(

x, λ, α

) will have

= 0. We can thus observe

that at the solution,

α h

(

) =

. In other words, for all

, we know that at least

one of the constraints

≥

0 or

(i)

(

)

≤

0 must be active at the solution. To gain

some intuition for this idea, we can say that either the solution is on the boundary

imposed by the inequality and we must use its KKT multiplier to inﬂuence the

solution to

, or the inequality has no inﬂuence on the solution and we represent

this by zeroing out its KKT multiplier.

A simple set of properties describe the optimal points of constrained opti-

mization problems. These properties are called the Karush-Kuhn-Tucker (KKT)

conditions (Karush, 1939; Kuhn and Tucker, 1951). They are necessary conditions,

but not always suﬃcient conditions, for a point to be optimal. The conditions are:

• The gradient of the generalized Lagrangian is zero.

• All constraints on both x and the KKT multipliers are satisﬁed.

•

The inequality constraints exhibit “complementary slackness”:

α h

(

) =

For more information about the KKT approach, see Nocedal and Wright (2006).

CHAPTER 4. NUMERICAL COMPUTATION

4.5 Example: Linear Least Squares

Suppose we want to ﬁnd the value of x that minimizes

f(x) =

||Ax − b||

. (4.21)

Specialized linear algebra algorithms can solve this problem eﬃciently; however,

we can also explore how to solve it using gradient-based optimization as a simple

example of how these techniques work.

First, we need to obtain the gradient:

∇

f(x) = A



(Ax − b) = A



Ax − A



b. (4.22)

We can then follow this gradient downhill, taking small steps. See algorithm 4.1

for details.

Algorithm 4.1

An algorithm to minimize

(

) =

||Ax − b||

with respect to

using gradient descent, starting from an arbitrary value of x.

Set the step size () and tolerance (δ) to small, positive numbers.

while ||A



Ax − A



b||

> δ do

x ← x − 





Ax − A





end while

One can also solve this problem using Newton’s method. In this case, because

the true function is quadratic, the quadratic approximation employed by Newton’s

method is exact, and the algorithm converges to the global minimum in a single

step.

Now suppose we wish to minimize the same function, but subject to the

constraint x



x ≤ 1. To do so, we introduce the Lagrangian

L(x, λ) = f(x) + λ





x − 1



. (4.23)

We can now solve the problem

min

max

λ,λ≥0

L(x, λ). (4.24)

The smallest-norm solution to the unconstrained least-squares problem may be

found using the Moore-Penrose pseudoinverse:

. If this point is feasible,

then it is the solution to the constrained problem. Otherwise, we must ﬁnd a

CHAPTER 4. NUMERICAL COMPUTATION

solution where the constraint is active. By diﬀerentiating the Lagrangian with

respect to x, we obtain the equation



Ax − A



b + 2λx = 0. (4.25)

This tells us that the solution will take the form

x = (A



A + 2λI)

−1



b. (4.26)

The magnitude of

must be chosen such that the result obeys the constraint. We

can ﬁnd this value by performing gradient ascent on λ. To do so, observe

∂

∂λ

L(x, λ) = x



x − 1. (4.27)

When the norm of

exceeds 1, this derivative is positive, so to follow the derivative

uphill and increase the Lagrangian with respect to

, we increase

. Because the

coeﬃcient on the



penalty has increased, solving the linear equation for

will now yield a solution with a smaller norm. The process of solving the linear

equation and adjusting

continues until

has the correct norm and the derivative

on λ is 0.

This concludes the mathematical preliminaries that we use to develop machine

learning algorithms. We are now ready to build and analyze some full-ﬂedged

learning systems.