Chapter 5

Machine Learning Basics

Deep learning is a speciﬁc kind of machine learning. To understand deep learning

well, one must have a solid understanding of the basic principles of machine learning.

This chapter provides a brief course in the most important general principles that

are applied throughout the rest of the book. Novice readers or those who want a

wider perspective are encouraged to consider machine learning textbooks with a

more comprehensive coverage of the fundamentals, such as Murphy (2012) or Bishop

(2006). If you are already familiar with machine learning basics, feel free to skip

ahead to section 5.11. That section covers some perspectives on traditional machine

learning techniques that have strongly inﬂuenced the development of deep learning

algorithms.

We begin with a deﬁnition of what a learning algorithm is and present an

example: the linear regression algorithm. We then proceed to describe how the

challenge of ﬁtting the training data diﬀers from the challenge of ﬁnding patterns

that generalize to new data. Most machine learning algorithms have settings

called hyperparameters, which must be determined outside the learning algorithm

itself; we discuss how to set these using additional data. Machine learning is

essentially a form of applied statistics with increased emphasis on the use of

computers to statistically estimate complicated functions and a decreased emphasis

on proving conﬁdence intervals around these functions; we therefore present the

two central approaches to statistics: frequentist estimators and Bayesian inference.

Most machine learning algorithms can be divided into the categories of supervised

learning and unsupervised learning; we describe these categories and give some

examples of simple learning algorithms from each category. Most deep learning

algorithms are based on an optimization algorithm called stochastic gradient

CHAPTER 5. MACHINE LEARNING BASICS

descent. We describe how to combine various algorithm components, such as

an optimization algorithm, a cost function, a model, and a dataset, to build a

machine learning algorithm. Finally, in section 5.11, we describe some of the

factors that have limited the ability of traditional machine learning to generalize.

These challenges have motivated the development of deep learning algorithms that

overcome these obstacles.

5.1 Learning Algorithms

A machine learning algorithm is an algorithm that is able to learn from data.

But what do we mean by learning? Mitchell (1997) provides a succinct deﬁnition:

“A computer program is said to learn from experience

with respect to some

class of tasks

and performance measure

, if its performance at tasks in

, as

measured by

, improves with experience

.” One can imagine a wide variety of

experiences

, tasks

, and performance measures

, and we do not attempt in

this book to formally deﬁne what may be used for each of these entities. Instead,

in the following sections, we provide intuitive descriptions and examples of the

diﬀerent kinds of tasks, performance measures, and experiences that can be used

to construct machine learning algorithms.

5.1.1 The Task, T

Machine learning enables us to tackle tasks that are too diﬃcult to solve with

ﬁxed programs written and designed by human beings. From a scientiﬁc and

philosophical point of view, machine learning is interesting because developing our

understanding of it entails developing our understanding of the principles that

underlie intelligence.

In this relatively formal deﬁnition of the word “task,” the process of learning

itself is not the task. Learning is our means of attaining the ability to perform the

task. For example, if we want a robot to be able to walk, then walking is the task.

We could program the robot to learn to walk, or we could attempt to directly write

a program that speciﬁes how to walk manually.

Machine learning tasks are usually described in terms of how the machine

learning system should process an

example

. An example is a collection of

features

that have been quantitatively measured from some object or event that we want

the machine learning system to process. We typically represent an example as a

vector

x ∈ R

where each entry

of the vector is another feature. For example,

the features of an image are usually the values of the pixels in the image.

CHAPTER 5. MACHINE LEARNING BASICS

Many kinds of tasks can be solved with machine learning. Some of the most

common machine learning tasks include the following:

• Classiﬁcation

: In this type of task, the computer program is asked to specify

which of

categories some input belongs to. To solve this task, the learning

algorithm is usually asked to produce a function

→ {

, . . . , k}

. When

(

), the model assigns an input described by vector

to a category

identiﬁed by numeric code

. There are other variants of the classiﬁcation

task, for example, where

outputs a probability distribution over classes.

An example of a classiﬁcation task is object recognition, where the input

is an image (usually described as a set of pixel brightness values), and the

output is a numeric code identifying the object in the image. For example,

the Willow Garage PR2 robot is able to act as a waiter that can recognize

diﬀerent kinds of drinks and deliver them to people on command (Good-

fellow et al., 2010). Modern object recognition is best accomplished with

deep learning (Krizhevsky et al., 2012; Ioﬀe and Szegedy, 2015). Object

recognition is the same basic technology that enables computers to recognize

faces (Taigman et al., 2014), which can be used to automatically tag people

in photo collections and for computers to interact more naturally with their

users.

• Classiﬁcation with missing inputs

: Classiﬁcation becomes more chal-

lenging if the computer program is not guaranteed that every measurement in

its input vector will always be provided. To solve the classiﬁcation task, the

learning algorithm only has to deﬁne a single function mapping from a vector

input to a categorical output. When some of the inputs may be missing,

rather than providing a single classiﬁcation function, the learning algorithm

must learn a set of functions. Each function corresponds to classifying

with

a diﬀerent subset of its inputs missing. This kind of situation arises frequently

in medical diagnosis, because many kinds of medical tests are expensive or

invasive. One way to eﬃciently deﬁne such a large set of functions is to

learn a probability distribution over all the relevant variables, then solve the

classiﬁcation task by marginalizing out the missing variables. With

input

variables, we can now obtain all 2

diﬀerent classiﬁcation functions needed

for each possible set of missing inputs, but the computer program needs

to learn only a single function describing the joint probability distribution.

See Goodfellow et al. (2013b) for an example of a deep probabilistic model

applied to such a task in this way. Many of the other tasks described in this

section can also be generalized to work with missing inputs; classiﬁcation

with missing inputs is just one example of what machine learning can do.

CHAPTER 5. MACHINE LEARNING BASICS

• Regression

: In this type of task, the computer program is asked to predict a

numerical value given some input. To solve this task, the learning algorithm

is asked to output a function

→ R

. This type of task is similar to

classiﬁcation, except that the format of output is diﬀerent. An example of

a regression task is the prediction of the expected claim amount that an

insured person will make (used to set insurance premiums), or the prediction

of future prices of securities. These kinds of predictions are also used for

algorithmic trading.

• Transcription

: In this type of task, the machine learning system is asked

to observe a relatively unstructured representation of some kind of data

and transcribe the information into discrete textual form. For example, in

optical character recognition, the computer program is shown a photograph

containing an image of text and is asked to return this text in the form of

a sequence of characters (e.g., in ASCII or Unicode format). Google Street

View uses deep learning to process address numbers in this way (Goodfellow

et al., 2014d). Another example is speech recognition, where the computer

program is provided an audio waveform and emits a sequence of characters or

word ID codes describing the words that were spoken in the audio recording.

Deep learning is a crucial component of modern speech recognition systems

used at major companies, including Microsoft, IBM and Google (Hinton

et al., 2012b).

• Machine translation

: In a machine translation task, the input already

consists of a sequence of symbols in some language, and the computer program

must convert this into a sequence of symbols in another language. This is

commonly applied to natural languages, such as translating from English to

French. Deep learning has recently begun to have an important impact on

this kind of task (Sutskever et al., 2014; Bahdanau et al., 2015).

• Structured output

: Structured output tasks involve any task where the

output is a vector (or other data structure containing multiple values) with

important relationships between the diﬀerent elements. This is a broad

category and subsumes the transcription and translation tasks described

above, as well as many other tasks. One example is parsing—mapping a

natural language sentence into a tree that describes its grammatical structure

by tagging nodes of the trees as being verbs, nouns, adverbs, and so on.

See Collobert (2011) for an example of deep learning applied to a parsing

task. Another example is pixel-wise segmentation of images, where the

computer program assigns every pixel in an image to a speciﬁc category.

CHAPTER 5. MACHINE LEARNING BASICS

For example, deep learning can be used to annotate the locations of roads

in aerial photographs (Mnih and Hinton, 2010). The output form need

not mirror the structure of the input as closely as in these annotation-style

tasks. For example, in image captioning, the computer program observes an

image and outputs a natural language sentence describing the image (Kiros

et al., 2014a,b; Mao et al., 2015; Vinyals et al., 2015b; Donahue et al., 2014;

Karpathy and Li, 2015; Fang et al., 2015; Xu et al., 2015). These tasks

are called structured output tasks because the program must output several

values that are all tightly interrelated. For example, the words produced by

an image captioning program must form a valid sentence.

• Anomaly detection

: In this type of task, the computer program sifts

through a set of events or objects and ﬂags some of them as being unusual

or atypical. An example of an anomaly detection task is credit card fraud

detection. By modeling your purchasing habits, a credit card company can

detect misuse of your cards. If a thief steals your credit card or credit card

information, the thief’s purchases will often come from a diﬀerent probability

distribution over purchase types than your own. The credit card company

can prevent fraud by placing a hold on an account as soon as that card has

been used for an uncharacteristic purchase. See Chandola et al. (2009) for a

survey of anomaly detection methods.

• Synthesis and sampling

: In this type of task, the machine learning al-

gorithm is asked to generate new examples that are similar to those in the

training data. Synthesis and sampling via machine learning can be useful

for media applications when generating large volumes of content by hand

would be expensive, boring, or require too much time. For example, video

games can automatically generate textures for large objects or landscapes,

rather than requiring an artist to manually label each pixel (Luo et al., 2013).

In some cases, we want the sampling or synthesis procedure to generate a

speciﬁc kind of output given the input. For example, in a speech synthesis

task, we provide a written sentence and ask the program to emit an audio

waveform containing a spoken version of that sentence. This is a kind of

structured output task, but with the added qualiﬁcation that there is no

single correct output for each input, and we explicitly desire a large amount

of variation in the output, in order for the output to seem more natural and

realistic.

• Imputation of missing values

: In this type of task, the machine learning

algorithm is given a new example

x ∈ R

, but with some entries

100

CHAPTER 5. MACHINE LEARNING BASICS

missing. The algorithm must provide a prediction of the values of the missing

entries.

• Denoising

: In this type of task, the machine learning algorithm is given as

input a corrupted example

x ∈ R

obtained by an unknown corruption process

from a clean example

x ∈ R

. The learner must predict the clean example

from its corrupted version

, or more generally predict the conditional

probability distribution p(x |

x).

• Density estimation

probability mass function estimation

: In the

density estimation problem, the machine learning algorithm is asked to learn a

function

model

→ R

, where

model

(

) can be interpreted as a probability

density function (if

is continuous) or a probability mass function (if

discrete) on the space that the examples were drawn from. To do such a task

well (we will specify exactly what that means when we discuss performance

measures

), the algorithm needs to learn the structure of the data it has seen.

It must know where examples cluster tightly and where they are unlikely to

occur. Most of the tasks described above require the learning algorithm to at

least implicitly capture the structure of the probability distribution. Density

estimation enables us to explicitly capture that distribution. In principle,

we can then perform computations on that distribution to solve the other

tasks as well. For example, if we have performed density estimation to obtain

a probability distribution

(

), we can use that distribution to solve the

missing value imputation task. If a value

is missing, and all the other

values, denoted

−i

, are given, then we know the distribution over it is given

(

| x

−i

). In practice, density estimation does not always enable us to

solve all these related tasks, because in many cases the required operations

on p(x) are computationally intractable.

Of course, many other tasks and types of tasks are possible. The types of tasks

we list here are intended only to provide examples of what machine learning can

do, not to deﬁne a rigid taxonomy of tasks.

5.1.2 The Performance Measure, P

To evaluate the abilities of a machine learning algorithm, we must design a

quantitative measure of its performance. Usually this performance measure

speciﬁc to the task T being carried out by the system.

For tasks such as classiﬁcation, classiﬁcation with missing inputs, and tran-

scription, we often measure the

accuracy

of the model. Accuracy is just the

101

CHAPTER 5. MACHINE LEARNING BASICS

proportion of examples for which the model produces the correct output. We can

also obtain equivalent information by measuring the

error rate

, the proportion

of examples for which the model produces an incorrect output. We often refer to

the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0

if it is correctly classiﬁed and 1 if it is not. For tasks such as density estimation,

it does not make sense to measure accuracy, error rate, or any other kind of 0-1

loss. Instead, we must use a diﬀerent performance metric that gives the model

a continuous-valued score for each example. The most common approach is to

report the average log-probability the model assigns to some examples.

Usually we are interested in how well the machine learning algorithm performs

on data that it has not seen before, since this determines how well it will work when

deployed in the real world. We therefore evaluate these performance measures using

test set

of data that is separate from the data used for training the machine

learning system.

The choice of performance measure may seem straightforward and objective,

but it is often diﬃcult to choose a performance measure that corresponds well to

the desired behavior of the system.

In some cases, this is because it is diﬃcult to decide what should be measured.

For example, when performing a transcription task, should we measure the accuracy

of the system at transcribing entire sequences, or should we use a more ﬁne-grained

performance measure that gives partial credit for getting some elements of the

sequence correct? When performing a regression task, should we penalize the

system more if it frequently makes medium-sized mistakes or if it rarely makes

very large mistakes? These kinds of design choices depend on the application.

In other cases, we know what quantity we would ideally like to measure, but

measuring it is impractical. For example, this arises frequently in the context of

density estimation. Many of the best probabilistic models represent probability

distributions only implicitly. Computing the actual probability value assigned to

a speciﬁc point in space in many such models is intractable. In these cases, one

must design an alternative criterion that still corresponds to the design objectives,

or design a good approximation to the desired criterion.

5.1.3 The Experience, E

Machine learning algorithms can be broadly categorized as

unsupervised

supervised

by what kind of experience they are allowed to have during the

learning process.

Most of the learning algorithms in this book can be understood as being allowed

102

CHAPTER 5. MACHINE LEARNING BASICS

to experience an entire

dataset

. A dataset is a collection of many examples, as

deﬁned in section 5.1.1. Sometimes we call examples data points.

One of the oldest datasets studied by statisticians and machine learning re-

searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements

of diﬀerent parts of 150 iris plants. Each individual plant corresponds to one

example. The features within each example are the measurements of each part

of the plant: the sepal length, sepal width, petal length and petal width. The

dataset also records which species each plant belonged to. Three diﬀerent species

are represented in the dataset.

Unsupervised learning algorithms

experience a dataset containing many

features, then learn useful properties of the structure of this dataset. In the context

of deep learning, we usually want to learn the entire probability distribution that

generated a dataset, whether explicitly, as in density estimation, or implicitly, for

tasks like synthesis or denoising. Some other unsupervised learning algorithms

perform other roles, like clustering, which consists of dividing the dataset into

clusters of similar examples.

Supervised learning algorithms

experience a dataset containing features,

but each example is also associated with a

label

target

. For example, the Iris

dataset is annotated with the species of each iris plant. A supervised learning

algorithm can study the Iris dataset and learn to classify iris plants into three

diﬀerent species based on their measurements.

Roughly speaking, unsupervised learning involves observing several examples

of a random vector

and attempting to implicitly or explicitly learn the proba-

bility distribution

(

), or some interesting properties of that distribution; while

supervised learning involves observing several examples of a random vector

and

an associated value or vector

, then learning to predict

from

, usually by

estimating

(

y | x

). The term

supervised learning

originates from the view of

the target

being provided by an instructor or teacher who shows the machine

learning system what to do. In unsupervised learning, there is no instructor or

teacher, and the algorithm must learn to make sense of the data without this guide.

Unsupervised learning and supervised learning are not formally deﬁned terms.

The lines between them are often blurred. Many machine learning technologies can

be used to perform both tasks. For example, the chain rule of probability states

that for a vector x ∈ R

, the joint distribution can be decomposed as

p(x) =



i=1

p(x

| x

, . . . , x

i−1

). (5.1)

This decomposition means that we can solve the ostensibly unsupervised problem of

103

CHAPTER 5. MACHINE LEARNING BASICS

modeling

(

) by splitting it into

supervised learning problems. Alternatively, we

can solve the supervised learning problem of learning

(

y | x

) by using traditional

unsupervised learning technologies to learn the joint distribution

(

x, y

), then

inferring

p(y | x) =

p(x, y)





p(x, y



)

. (5.2)

Though unsupervised learning and supervised learning are not completely formal

or distinct concepts, they do help roughly categorize some of the things we do with

machine learning algorithms. Traditionally, people refer to regression, classiﬁcation

and structured output problems as supervised learning. Density estimation in

support of other tasks is usually considered unsupervised learning.

Other variants of the learning paradigm are possible. For example, in semi-

supervised learning, some examples include a supervision target but others do

not. In multi-instance learning, an entire collection of examples is labeled as

containing or not containing an example of a class, but the individual members

of the collection are not labeled. For a recent example of multi-instance learning

with deep models, see Kotzias et al. (2015).

Some machine learning algorithms do not just experience a ﬁxed dataset. For

example,

reinforcement learning

algorithms interact with an environment, so

there is a feedback loop between the learning system and its experiences. Such

algorithms are beyond the scope of this book. Please see Sutton and Barto (1998)

or Bertsekas and Tsitsiklis (1996) for information about reinforcement learning,

and Mnih et al. (2013) for the deep learning approach to reinforcement learning.

Most machine learning algorithms simply experience a dataset. A dataset can

be described in many ways. In all cases, a dataset is a collection of examples,

which are in turn collections of features.

One common way of describing a dataset is with a design matrix. A design

matrix is a matrix containing a diﬀerent example in each row. Each column of the

matrix corresponds to a diﬀerent feature. For instance, the Iris dataset contains

150 examples with four features for each example. This means we can represent

the dataset with a design matrix

X ∈ R

150×4

, where

i,1

is the sepal length of

plant

i,2

is the sepal width of plant

, etc. We describe most of the learning

algorithms in this book in terms of how they operate on design matrix datasets.

Of course, to describe a dataset as a design matrix, it must be possible to

describe each example as a vector, and each of these vectors must be the same size.

This is not always possible. For example, if you have a collection of photographs

with diﬀerent widths and heights, then diﬀerent photographs will contain diﬀerent

numbers of pixels, so not all the photographs may be described with the same

104

CHAPTER 5. MACHINE LEARNING BASICS

length of vector. In Section 9.7 and chapter 10, we describe how to handle diﬀerent

types of such heterogeneous data. In cases like these, rather than describing the

dataset as a matrix with

rows, we describe it as a set containing

elements:

(1)

, x

(2)

, . . . , x

(m)

}

. This notation does not imply that any two example vectors

(i)

and x

(j)

have the same size.

In the case of supervised learning, the example contains a label or target as

well as a collection of features. For example, if we want to use a learning algorithm

to perform object recognition from photographs, we need to specify which object

appears in each of the photos. We might do this with a numeric code, with 0

signifying a person, 1 signifying a car, 2 signifying a cat, and so forth. Often when

working with a dataset containing a design matrix of feature observations

, we

also provide a vector of labels y, with y

providing the label for example i.

Of course, sometimes the label may be more than just a single number. For

example, if we want to train a speech recognition system to transcribe entire

sentences, then the label for each example sentence is a sequence of words.

Just as there is no formal deﬁnition of supervised and unsupervised learning,

there is no rigid taxonomy of datasets or experiences. The structures described here

cover most cases, but it is always possible to design new ones for new applications.

5.1.4 Example: Linear Regression

Our deﬁnition of a machine learning algorithm as an algorithm that is capable

of improving a computer program’s performance at some task via experience is

somewhat abstract. To make this more concrete, we present an example of a

simple machine learning algorithm:

linear regression

. We will return to this

example repeatedly as we introduce more machine learning concepts that help to

understand the algorithm’s behavior.

As the name implies, linear regression solves a regression problem. In other

words, the goal is to build a system that can take a vector

x ∈ R

as input and

predict the value of a scalar

y ∈ R

as its output. The output of linear regression is

a linear function of the input. Let

ˆy

be the value that our model predicts

should

take on. We deﬁne the output to be

ˆy = w



x, (5.3)

where w ∈ R

is a vector of parameters.

Parameters are values that control the behavior of the system. In this case,

the coeﬃcient that we multiply by feature

before summing up the contributions

from all the features. We can think of

as a set of

weights

that determine how

105

CHAPTER 5. MACHINE LEARNING BASICS

each feature aﬀects the prediction. If a feature

receives a positive weight

then increasing the value of that feature increases the value of our prediction

ˆy

If a feature receives a negative weight, then increasing the value of that feature

decreases the value of our prediction. If a feature’s weight is large in magnitude,

then it has a large eﬀect on the prediction. If a feature’s weight is zero, it has no

eﬀect on the prediction.

We thus have a deﬁnition of our task

: to predict

from

by outputting

ˆy = w



x. Next we need a deﬁnition of our performance measure, P .

Suppose that we have a design matrix of

example inputs that we will not

use for training, only for evaluating how well the model performs. We also have

a vector of regression targets providing the correct value of

for each of these

examples. Because this dataset will only be used for evaluation, we call it the test

set. We refer to the design matrix of inputs as

(test)

and the vector of regression

targets as y

(test)

One way of measuring the performance of the model is to compute the

mean

squared error

of the model on the test set. If

(test)

gives the predictions of the

model on the test set, then the mean squared error is given by

MSE

test



(

(test)

− y

(test)

)

. (5.4)

Intuitively, one can see that this error measure decreases to 0 when

(test)

We can also see that

MSE

test

(test)

− y

(test)

, (5.5)

so the error increases whenever the Euclidean distance between the predictions

and the targets increases.

To make a machine learning algorithm, we need to design an algorithm that

will improve the weights

in a way that reduces

MSE

test

when the algorithm

is allowed to gain experience by observing a training set (

(train)

, y

(train)

). One

intuitive way of doing this (which we justify later, in section 5.5.1) is just to

minimize the mean squared error on the training set, MSE

train

To minimize MSE

train

, we can simply solve for where its gradient is 0:

∇

MSE

train

= 0 (5.6)

⇒ ∇

(train)

− y

(train)

= 0 (5.7)

106

CHAPTER 5. MACHINE LEARNING BASICS

⇒

∇

||X

(train)

w − y

(train)

= 0 (5.8)

⇒ ∇



(train)

w − y

(train)







(train)

w − y

(train)



= 0 (5.9)

⇒ ∇





(train)

(train)

w − 2w



(train)

(train)

+ y

(train)

(train)



= 0

(5.10)

⇒ 2X

(train)

(train)

w − 2X

(train)

(train)

= 0 (5.11)

⇒ w =



(train)

(train)



−1

(train)

(train)

(5.12)

The system of equations whose solution is given by equation 5.12 is known as

the

normal equations

. Evaluating equation 5.12 constitutes a simple learning

algorithm. For an example of the linear regression learning algorithm in action,

see ﬁgure 5.1.

It is worth noting that the term

linear regression

is often used to refer to

a slightly more sophisticated model with one additional parameter—an intercept

term b. In this model

ˆy = w



x + b, (5.13)

so the mapping from parameters to predictions is still a linear function but the

mapping from features to predictions is now an aﬃne function. This extension to

−1.0 −0.5 0.0 0.5 1.0

−3

−2

−1

Linear regression example

0.5 1.0 1.5

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

MSE

(train)

Optimization of w

Figure 5.1: A linear regression problem, with a training set consisting of ten data points,

each containing one feature. Because there is only one feature, the weight vector

contains only a single parameter to learn,

. (Left)Observe that linear regression learns

to set

such that the line

comes as close as possible to passing through all the

training points. (Right)The plotted point indicates the value of

found by the normal

equations, which we can see minimizes the mean squared error on the training set.

107

CHAPTER 5. MACHINE LEARNING BASICS

aﬃne functions means that the plot of the model’s predictions still looks like a

line, but it need not pass through the origin. Instead of adding the bias parameter

, one can continue to use the model with only weights but augment

with an

extra entry that is always set to 1. The weight corresponding to the extra 1 entry

plays the role of the bias parameter. We frequently use the term “linear” when

referring to aﬃne functions throughout this book.

The intercept term

is often called the

bias

parameter of the aﬃne transfor-

mation. This terminology derives from the point of view that the output of the

transformation is biased toward being

in the absence of any input. This term

is diﬀerent from the idea of a statistical bias, in which a statistical estimation

algorithm’s expected estimate of a quantity is not equal to the true quantity.

Linear regression is of course an extremely simple and limited learning algorithm,

but it provides an example of how a learning algorithm can work. In subsequent

sections we describe some of the basic principles underlying learning algorithm

design and demonstrate how these principles can be used to build more complicated

learning algorithms.

5.2 Capacity, Overﬁtting and Underﬁtting

The central challenge in machine learning is that our algorithm must perform

well on new, previously unseen inputs—not just those on which our model was

trained. The ability to perform well on previously unobserved inputs is called

generalization.

Typically, when training a machine learning model, we have access to a training

set; we can compute some error measure on the training set, called the

training

error

; and we reduce this training error. So far, what we have described is simply

an optimization problem. What separates machine learning from optimization is

that we want the

generalization error

, also called the

test error

, to be low as

well. The generalization error is deﬁned as the expected value of the error on a

new input. Here the expectation is taken across diﬀerent possible inputs, drawn

from the distribution of inputs we expect the system to encounter in practice.

We typically estimate the generalization error of a machine learning model by

measuring its performance on a

test set

of examples that were collected separately

from the training set.

In our linear regression example, we trained the model by minimizing the

training error,

(train)

||X

(train)

w − y

(train)

, (5.14)

108

CHAPTER 5. MACHINE LEARNING BASICS

but we actually care about the test error,

(test)

||X

(test)

w − y

(test)

How can we aﬀect performance on the test set when we can observe only the

training set? The ﬁeld of

statistical learning theory

provides some answers. If

the training and the test set are collected arbitrarily, there is indeed little we can

do. If we are allowed to make some assumptions about how the training and test

set are collected, then we can make some progress.

The training and test data are generated by a probability distribution over

datasets called the

data-generating process

. We typically make a set of as-

sumptions known collectively as the

i.i.d. assumptions

. These assumptions are

that the examples in each dataset are

independent

from each other, and that

the training set and test set are

identically distributed

, drawn from the same

probability distribution as each other. This assumption enables us to describe

the data-generating process with a probability distribution over a single example.

The same distribution is then used to generate every train example and every test

example. We call that shared underlying distribution the

data-generating dis-

tribution

, denoted

data

. This probabilistic framework and the i.i.d. assumptions

enables us to mathematically study the relationship between training error and

test error.

One immediate connection we can observe between training error and test error

is that the expected training error of a randomly selected model is equal to the

expected test error of that model. Suppose we have a probability distribution

(

x, y

) and we sample from it repeatedly to generate the training set and the test

set. For some ﬁxed value

, the expected training set error is exactly the same as

the expected test set error, because both expectations are formed using the same

dataset sampling process. The only diﬀerence between the two conditions is the

name we assign to the dataset we sample.

Of course, when we use a machine learning algorithm, we do not ﬁx the

parameters ahead of time, then sample both datasets. We sample the training set,

then use it to choose the parameters to reduce training set error, then sample the

test set. Under this process, the expected test error is greater than or equal to

the expected value of training error. The factors determining how well a machine

learning algorithm will perform are its ability to

1. Make the training error small.

2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning:

underﬁtting

and

overﬁtting

. Underﬁtting occurs when the model is not able to

109

CHAPTER 5. MACHINE LEARNING BASICS

obtain a suﬃciently low error value on the training set. Overﬁtting occurs when

the gap between the training error and test error is too large.

We can control whether a model is more likely to overﬁt or underﬁt by altering

its

capacity

. Informally, a model’s capacity is its ability to ﬁt a wide variety of

functions. Models with low capacity may struggle to ﬁt the training set. Models

with high capacity can overﬁt by memorizing properties of the training set that do

not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing its

hypothesis space, the set of functions that the learning algorithm is allowed to

select as being the solution. For example, the linear regression algorithm has the

set of all linear functions of its input as its hypothesis space. We can generalize

linear regression to include polynomials, rather than just linear functions, in its

hypothesis space. Doing so increases the model’s capacity.

A polynomial of degree 1 gives us the linear regression model with which we

are already familiar, with the prediction

ˆy = b + wx. (5.15)

By introducing

as another feature provided to the linear regression model, we

can learn a model that is quadratic as a function of x:

ˆy = b + w

x + w

. (5.16)

Though this model implements a quadratic function of its input, the output is

still a linear function of the parameters, so we can still use the normal equations

to train the model in closed form. We can continue to add more powers of

additional features, for example, to obtain a polynomial of degree 9:

ˆy = b +



i=1

. (5.17)

Machine learning algorithms will generally perform best when their capacity

is appropriate for the true complexity of the task they need to perform and the

amount of training data they are provided with. Models with insuﬃcient capacity

are unable to solve complex tasks. Models with high capacity can solve complex

tasks, but when their capacity is higher than needed to solve the present task, they

may overﬁt.

Figure 5.2 shows this principle in action. We compare a linear, quadratic

and degree-9 predictor attempting to ﬁt a problem where the true underlying

110

CHAPTER 5. MACHINE LEARNING BASICS

























Figure 5.2: We ﬁt three models to this example training set. The training data was

generated synthetically, by randomly sampling

values and choosing

deterministically

by evaluating a quadratic function. (Left)A linear function ﬁt to the data suﬀers from

underﬁtting—it cannot capture the curvature that is present in the data. (Center)A

quadratic function ﬁt to the data generalizes well to unseen points. It does not suﬀer from

a signiﬁcant amount of overﬁtting or underﬁtting. (Right)A polynomial of degree 9 ﬁt

to the data suﬀers from overﬁtting. Here we used the Moore-Penrose pseudoinverse to

solve the underdetermined normal equations. The solution passes through all the training

points exactly, but we have not been lucky enough for it to extract the correct structure.

It now has a deep valley between two training points that does not appear in the true

underlying function. It also increases sharply on the left side of the data, while the true

function decreases in this area.

function is quadratic. The linear function is unable to capture the curvature in

the true underlying problem, so it underﬁts. The degree-9 predictor is capable of

representing the correct function, but it is also capable of representing inﬁnitely

many other functions that pass exactly through the training points, because we

have more parameters than training examples. We have little chance of choosing

a solution that generalizes well when so many wildly diﬀerent solutions exist. In

this example, the quadratic model is perfectly matched to the true structure of

the task, so it generalizes well to new data.

So far we have described only one way of changing a model’s capacity: by

changing the number of input features it has, and simultaneously adding new

parameters associated with those features. There are in fact many ways to change

a model’s capacity. Capacity is not determined only by the choice of model. The

model speciﬁes which family of functions the learning algorithm can choose from

when varying the parameters in order to reduce a training objective. This is called

the

representational capacity

of the model. In many cases, ﬁnding the best

111

CHAPTER 5. MACHINE LEARNING BASICS

function within this family is a diﬃcult optimization problem. In practice, the

learning algorithm does not actually ﬁnd the best function, but merely one that

signiﬁcantly reduces the training error. These additional limitations, such as the

imperfection of the optimization algorithm, mean that the learning algorithm’s

eﬀective capacity

may be less than the representational capacity of the model

family.

Our modern ideas about improving the generalization of machine learning

models are reﬁnements of thought dating back to philosophers at least as early as

Ptolemy. Many early scholars invoke a principle of parsimony that is now most

widely known as

Occam’s razor

(c. 1287–1347). This principle states that among

competing hypotheses that explain known observations equally well, we should

choose the “simplest” one. This idea was formalized and made more precise in

the twentieth century by the founders of statistical learning theory (Vapnik and

Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).

Statistical learning theory provides various means of quantifying model capacity.

Among these, the most well known is the

Vapnik-Chervonenkis dimension

, or

VC dimension. The VC dimension measures the capacity of a binary classiﬁer. The

VC dimension is deﬁned as being the largest possible value of

for which there

exists a training set of

diﬀerent

points that the classiﬁer can label arbitrarily.

Quantifying the capacity of the model enables statistical learning theory to

make quantitative predictions. The most important results in statistical learning

theory show that the discrepancy between training error and generalization error

is bounded from above by a quantity that grows as the model capacity grows but

shrinks as the number of training examples increases (Vapnik and Chervonenkis,

1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide

intellectual justiﬁcation that machine learning algorithms can work, but they are

rarely used in practice when working with deep learning algorithms. This is in

part because the bounds are often quite loose and in part because it can be quite

diﬃcult to determine the capacity of deep learning algorithms. The problem of

determining the capacity of a deep learning model is especially diﬃcult because

the eﬀective capacity is limited by the capabilities of the optimization algorithm,

and we have little theoretical understanding of the general nonconvex optimization

problems involved in deep learning.

We must remember that while simpler functions are more likely to generalize

(to have a small gap between training and test error), we must still choose a

suﬃciently complex hypothesis to achieve low training error. Typically, training

error decreases until it asymptotes to the minimum possible error value as model

capacity increases (assuming the error measure has a minimum value). Typically,

112

CHAPTER 5. MACHINE LEARNING BASICS

0 Optimal Capacity

Capacity

Error

Underﬁtting zone Overﬁtting zone

Generalization gap

Training error

Generalization error

Figure 5.3: Typical relationship between capacity and error. Training and test error

behave diﬀerently. At the left end of the graph, training error and generalization error

are both high. This is the

underﬁtting regime

. As we increase capacity, training error

decreases, but the gap between training and generalization error increases. Eventually,

the size of this gap outweighs the decrease in training error, and we enter the

overﬁtting

regime, where capacity is too large, above the optimal capacity.

generalization error has a U-shaped curve as a function of model capacity. This is

illustrated in ﬁgure 5.3.

To reach the most extreme case of arbitrarily high capacity, we introduce

the concept of

nonparametric models

. So far, we have seen only parametric

models, such as linear regression. Parametric models learn a function described

by a parameter vector whose size is ﬁnite and ﬁxed before any data is observed.

Nonparametric models have no such limitation.

Sometimes, nonparametric models are just theoretical abstractions (such as

an algorithm that searches over all possible probability distributions) that cannot

be implemented in practice. However, we can also design practical nonparametric

models by making their complexity a function of the training set size. One example

of such an algorithm is

nearest neighbor regression

. Unlike linear regression,

which has a ﬁxed-length vector of weights, the nearest neighbor regression model

simply stores the

and

from the training set. When asked to classify a test

point

, the model looks up the nearest entry in the training set and returns the

associated regression target. In other words,

ˆy

where

arg min ||X

i,:

−x||

The algorithm can also be generalized to distance metrics other than the

norm,

such as learned distance metrics (Goldberger et al., 2005). If the algorithm is

allowed to break ties by averaging the

values for all

i,:

that are tied for nearest,

then this algorithm is able to achieve the minimum possible training error (which

113

CHAPTER 5. MACHINE LEARNING BASICS

might be greater than zero, if two identical inputs are associated with diﬀerent

outputs) on any regression dataset.

Finally, we can also create a nonparametric learning algorithm by wrapping a

parametric learning algorithm inside another algorithm that increases the number

of parameters as needed. For example, we could imagine an outer loop of learning

that changes the degree of the polynomial learned by linear regression on top of a

polynomial expansion of the input.

The ideal model is an oracle that simply knows the true probability distribution

that generates the data. Even such a model will still incur some error on many

problems, because there may still be some noise in the distribution. In the case

of supervised learning, the mapping from

may be inherently stochastic,

may be a deterministic function that involves other variables besides those

included in

. The error incurred by an oracle making predictions from the true

distribution p(x, y) is called the Bayes error.

Training and generalization error vary as the size of the training set varies.

Expected generalization error can never increase as the number of training examples

increases. For nonparametric models, more data yield better generalization until

the best possible error is achieved. Any ﬁxed parametric model with less than

optimal capacity will asymptote to an error value that exceeds the Bayes error.

See ﬁgure 5.4 for an illustration. Note that it is possible for the model to have

optimal capacity and yet still have a large gap between training and generalization

errors. In this situation, we may be able to reduce this gap by gathering more

training examples.

5.2.1 The No Free Lunch Theorem

Learning theory claims that a machine learning algorithm can generalize well from

a ﬁnite training set of examples. This seems to contradict some basic principles of

logic. Inductive reasoning, or inferring general rules from a limited set of examples,

is not logically valid. To logically infer a rule describing every member of a set,

one must have information about every member of that set.

In part, machine learning avoids this problem by oﬀering only probabilistic rules,

rather than the entirely certain rules used in purely logical reasoning. Machine

learning promises to ﬁnd rules that are probably correct about most members of

the set they concern.

Unfortunately, even this does not resolve the entire problem. The

no free

lunch theorem

for machine learning (Wolpert, 1996) states that, averaged over

all possible data-generating distributions, every classiﬁcation algorithm has the

114

CHAPTER 5. MACHINE LEARNING BASICS





























































































Figure 5.4: The eﬀect of the training dataset size on the train and test error, as well as

on the optimal model capacity. We constructed a synthetic regression problem based on

adding a moderate amount of noise to a degree-5 polynomial, generated a single test set,

and then generated several diﬀerent sizes of training set. For each size, we generated 40

diﬀerent training sets in order to plot error bars showing 95 percent conﬁdence intervals.

(Top)The MSE on the training and test set for two diﬀerent models: a quadratic model,

and a model with degree chosen to minimize the test error. Both are ﬁt in closed form. For

the quadratic model, the training error increases as the size of the training set increases.

This is because larger datasets are harder to ﬁt. Simultaneously, the test error decreases,

because fewer incorrect hypotheses are consistent with the training data. The quadratic

model does not have enough capacity to solve the task, so its test error asymptotes to

a high value. The test error at optimal capacity asymptotes to the Bayes error. The

training error can fall below the Bayes error, due to the ability of the training algorithm

to memorize speciﬁc instances of the training set. As the training size increases to inﬁnity,

the training error of any ﬁxed-capacity model (here, the quadratic model) must rise to at

least the Bayes error. (Bottom)As the training set size increases, the optimal capacity

(shown here as the degree of the optimal polynomial regressor) increases. The optimal

capacity plateaus after reaching suﬃcient complexity to solve the task.

115

CHAPTER 5. MACHINE LEARNING BASICS

same error rate when classifying previously unobserved points. In other words,

in some sense, no machine learning algorithm is universally any better than any

other. The most sophisticated algorithm we can conceive of has the same average

performance (over all possible tasks) as merely predicting that every point belongs

to the same class.

Fortunately, these results hold only when we average over all possible data-

generating distributions. If we make assumptions about the kinds of probability

distributions we encounter in real-world applications, then we can design learning

algorithms that perform well on these distributions.

This means that the goal of machine learning research is not to seek a universal

learning algorithm or the absolute best learning algorithm. Instead, our goal is to

understand what kinds of distributions are relevant to the “real world” that an AI

agent experiences, and what kinds of machine learning algorithms perform well on

data drawn from the kinds of data-generating distributions we care about.

5.2.2 Regularization

The no free lunch theorem implies that we must design our machine learning

algorithms to perform well on a speciﬁc task. We do so by building a set of

preferences into the learning algorithm. When these preferences are aligned with

the learning problems that we ask the algorithm to solve, it performs better.

So far, the only method of modifying a learning algorithm that we have discussed

concretely is to increase or decrease the model’s representational capacity by adding

or removing functions from the hypothesis space of solutions the learning algorithm

is able to choose from. We gave the speciﬁc example of increasing or decreasing

the degree of a polynomial for a regression problem. The view we have described

so far is oversimpliﬁed.

The behavior of our algorithm is strongly aﬀected not just by how large we

make the set of functions allowed in its hypothesis space, but by the speciﬁc identity

of those functions. The learning algorithm we have studied so far, linear regression,

has a hypothesis space consisting of the set of linear functions of its input. These

linear functions can be useful for problems where the relationship between inputs

and outputs truly is close to linear. They are less useful for problems that behave in

a very nonlinear fashion. For example, linear regression would not perform well if

we tried to use it to predict

sin

(

) from

. We can thus control the performance of

our algorithms by choosing what kind of functions we allow them to draw solutions

from, as well as by controlling the amount of these functions.

We can also give a learning algorithm a preference for one solution over another

116

CHAPTER 5. MACHINE LEARNING BASICS

in its hypothesis space. This means that both functions are eligible, but one is

preferred. The unpreferred solution will be chosen only if it ﬁts the training data

signiﬁcantly better than the preferred solution.

For example, we can modify the training criterion for linear regression to include

weight decay

. To perform linear regression with weight decay, we minimize a sum

(

) comprising both the mean squared error on the training and a criterion that

expresses a preference for the weights to have smaller squared

norm. Speciﬁcally,

J(w) = MSE

train

+ λw



w, (5.18)

where

is a value chosen ahead of time that controls the strength of our preference

for smaller weights. When

= 0, we impose no preference, and larger

forces the

weights to become smaller. Minimizing

(

) results in a choice of weights that

make a tradeoﬀ between ﬁtting the training data and being small. This gives us

solutions that have a smaller slope, or that put weight on fewer of the features.

As an example of how we can control a model’s tendency to overﬁt or underﬁt

via weight decay, we can train a high-degree polynomial regression model with

diﬀerent values of λ. See ﬁgure 5.5 for the results.

More generally, we can regularize a model that learns a function

(

;

) by

adding a penalty called a

regularizer

to the cost function. In the case of weight

decay, the regularizer is Ω(

) =



. In chapter 7, we will see that many other

regularizers are possible.

Expressing preferences for one function over another is a more general way

of controlling a model’s capacity than including or excluding members from the

hypothesis space. We can think of excluding a function from a hypothesis space as

expressing an inﬁnitely strong preference against that function.

In our weight decay example, we expressed our preference for linear functions

deﬁned with smaller weights explicitly, via an extra term in the criterion we

minimize. There are many other ways of expressing preferences for diﬀerent

solutions, both implicitly and explicitly. Together, these diﬀerent approaches

are known as

regularization

. Regularization is any modiﬁcation we make to a

learning algorithm that is intended to reduce its generalization error but not its

training error. Regularization is one of the central concerns of the ﬁeld of machine

learning, rivaled in its importance only by optimization.

The no free lunch theorem has made it clear that there is no best machine

learning algorithm, and, in particular, no best form of regularization. Instead

we must choose a form of regularization that is well suited to the particular task

we want to solve. The philosophy of deep learning in general and this book in

particular is that a wide range of tasks (such as all the intellectual tasks that

117

CHAPTER 5. MACHINE LEARNING BASICS

people can do) may all be solved eﬀectively using very general-purpose forms of

regularization.

5.3 Hyperparameters and Validation Sets

Most machine learning algorithms have hyperparameters, settings that we can

use to control the algorithm’s behavior. The values of hyperparameters are not

adapted by the learning algorithm itself (though we can design a nested learning

procedure in which one learning algorithm learns the best hyperparameters for

another learning algorithm).

The polynomial regression example in ﬁgure 5.2 has a single hyperparameter:

the degree of the polynomial, which acts as a

capacity

hyperparameter. The

value used to control the strength of weight decay is another example of a

hyperparameter.





























 

Figure 5.5: We ﬁt a high-degree polynomial regression model to our example training set

from ﬁgure 5.2. The true function is quadratic, but here we use only models with degree 9.

We vary the amount of weight decay to prevent these high-degree models from overﬁtting.

(Left)With very large

, we can force the model to learn a function with no slope at

all. This underﬁts because it can only represent a constant function. (Center)With a

medium value of λ, the learning algorithm recovers a curve with the right general shape.

Even though the model is capable of representing functions with much more complicated

shapes, weight decay has encouraged it to use a simpler function described by smaller

coeﬃcients. (Right)With weight decay approaching zero (i.e., using the Moore-Penrose

pseudoinverse to solve the underdetermined problem with minimal regularization), the

degree-9 polynomial overﬁts signiﬁcantly, as we saw in ﬁgure 5.2.

118

CHAPTER 5. MACHINE LEARNING BASICS

Sometimes a setting is chosen to be a hyperparameter that the learning algo-

rithm does not learn because the setting is diﬃcult to optimize. More frequently,

the setting must be a hyperparameter because it is not appropriate to learn that

hyperparameter on the training set. This applies to all hyperparameters that

control model capacity. If learned on the training set, such hyperparameters would

always choose the maximum possible model capacity, resulting in overﬁtting (

refer

to ﬁgure 5.3). For example, we can always ﬁt the training set better with a

higher-degree polynomial and a weight decay setting of

= 0 than we could with

a lower-degree polynomial and a positive weight decay setting.

To solve this problem, we need a

validation set

of examples that the training

algorithm does not observe.

Earlier we discussed how a held-out test set, composed of examples coming from

the same distribution as the training set, can be used to estimate the generalization

error of a learner, after the learning process has completed. It is important that the

test examples are not used in any way to make choices about the model, including

its hyperparameters. For this reason, no example from the test set can be used

in the validation set. Therefore, we always construct the validation set from the

training data. Speciﬁcally, we split the training data into two disjoint subsets.

One of these subsets is used to learn the parameters. The other subset is our

validation set, used to estimate the generalization error during or after training,

allowing for the hyperparameters to be updated accordingly. The subset of data

used to learn the parameters is still typically called the training set, even though

this may be confused with the larger pool of data used for the entire training

process. The subset of data used to guide the selection of hyperparameters is

called the validation set. Typically, one uses about 80 percent of the training

data for training and 20 percent for validation. Since the validation set is used

to “train” the hyperparameters, the validation set error will underestimate the

generalization error, though typically by a smaller amount than the training error

does. After all hyperparameter optimization is complete, the generalization error

may be estimated using the test set.

In practice, when the same test set has been used repeatedly to evaluate

performance of diﬀerent algorithms over many years, and especially if we consider

all the attempts from the scientiﬁc community at beating the reported state-of-

the-art performance on that test set, we end up having optimistic evaluations with

the test set as well. Benchmarks can thus become stale and then do not reﬂect the

true ﬁeld performance of a trained system. Thankfully, the community tends to

move on to new (and usually more ambitious and larger) benchmark datasets.

119

CHAPTER 5. MACHINE LEARNING BASICS

5.3.1 Cross-Validation

Dividing the dataset into a ﬁxed training set and a ﬁxed test set can be problematic

if it results in the test set being small. A small test set implies statistical uncertainty

around the estimated average test error, making it diﬃcult to claim that algorithm

A works better than algorithm B on the given task.

When the dataset has hundreds of thousands of examples or more, this is not

a serious issue. When the dataset is too small, alternative procedures enable one

to use all the examples in the estimation of the mean test error, at the price of

increased computational cost. These procedures are based on the idea of repeating

the training and testing computation on diﬀerent randomly chosen subsets or splits

of the original dataset. The most common of these is the

-fold cross-validation

procedure, shown in algorithm 5.1, in which a partition of the dataset is formed by

splitting it into

nonoverlapping subsets. The test error may then be estimated

by taking the average test error across

trials. On trial

, the

-th subset of the

data is used as the test set, and the rest of the data is used as the training set.

One problem is that no unbiased estimators of the variance of such average error

estimators exist (Bengio and Grandvalet, 2004), but approximations are typically

used.

5.4 Estimators, Bias and Variance

The ﬁeld of statistics gives us many tools to achieve the machine learning goal of

solving a task not only on the training set but also to generalize. Foundational

concepts such as parameter estimation, bias and variance are useful to formally

characterize notions of generalization, underﬁtting and overﬁtting.

5.4.1 Point Estimation

Point estimation is the attempt to provide the single “best” prediction of some

quantity of interest. In general the quantity of interest can be a single parameter

or a vector of parameters in some parametric model, such as the weights in our

linear regression example in section 5.1.4, but it can also be a whole function.

To distinguish estimates of parameters from their true value, our convention

will be to denote a point estimate of a parameter θ by

θ.

Let

(1)

, . . . , x

(m)

}

be a set of

independent and identically distributed

120

CHAPTER 5. MACHINE LEARNING BASICS

Algorithm 5.1

The

-fold cross-validation algorithm. It can be used to estimate

generalization error of a learning algorithm

when the given dataset

is too

small for a simple train/test or train/valid split to yield accurate estimation of

generalization error, because the mean of a loss

on a small test set may have too

high a variance. The dataset

contains as elements the abstract examples

(i)

(for

the

-th example), which could stand for an (input,target) pair

(i)

= (

(i)

, y

(i)

)

in the case of supervised learning, or for just an input

(i)

in the case

of unsupervised learning. The algorithm returns the vector of errors

for each

example in

, whose mean is the estimated generalization error. The errors on

individual examples can be used to compute a conﬁdence interval around the

mean (equation 5.47). Though these conﬁdence intervals are not well justiﬁed

after the use of cross-validation, it is still common practice to use them to declare

that algorithm

is better than algorithm

only if the conﬁdence interval of the

error of algorithm

lies below and does not intersect the conﬁdence interval of

algorithm B.

Deﬁne KFoldXV(D, A, L, k):

Require: D, the given dataset, with elements z

(i)

Require: A

, the learning algorithm, seen as a function that takes a dataset as

input and outputs a learned function

Require: L

, the loss function, seen as a function from a learned function

and

an example z

(i)

∈ D to a scalar ∈ R

Require: k, the number of folds

Split D into k mutually exclusive subsets D

, whose union is D

for i from 1 to k do

= A(D\D

)

for z

(j)

in D

= L(f

, z

(j)

)

end for

Return e

(i.i.d.) data points. A point estimator or statistic is any function of the data:

= g(x

(1)

, . . . , x

(m)

). (5.19)

The deﬁnition does not require that

return a value that is close to the true

or even that the range of

be the same as the set of allowable values of

. This

deﬁnition of a point estimator is very general and would enable the designer of an

estimator great ﬂexibility. While almost any function thus qualiﬁes as an estimator,

121

CHAPTER 5. MACHINE LEARNING BASICS

a good estimator is a function whose output is close to the true underlying

that

generated the training data.

For now, we take the frequentist perspective on statistics. That is, we assume

that the true parameter value

is ﬁxed but unknown, while the point estimate

is a function of the data. Since the data is drawn from a random process, any

function of the data is random. Therefore

θ is a random variable.

Point estimation can also refer to the estimation of the relationship between

input and target variables. We refer to these types of point estimates as function

estimators.

Function Estimation

Sometimes we are interested in performing function

estimation (or function approximation). Here, we are trying to predict a variable

given an input vector

. We assume that there is a function

(

) that describes

the approximate relationship between

and

. For example, we may assume

that

(

) +



, where



stands for the part of

that is not predictable from

. In function estimation, we are interested in approximating

with a model or

estimate

. Function estimation is really just the same as estimating a parameter

; the function estimator

is simply a point estimator in function space. The

linear regression example (discussed in section 5.1.4) and the polynomial regression

example (discussed in section 5.2) both illustrate scenarios that may be interpreted

as either estimating a parameter

or estimating a function

mapping from

to y.

We now review the most commonly studied properties of point estimators and

discuss what they tell us about these estimators.

5.4.2 Bias

The bias of an estimator is deﬁned as

bias(

) = E(

) − θ, (5.20)

where the expectation is over the data (seen as samples from a random variable)

and

is the true underlying value of

used to deﬁne the data-generating distri-

bution. An estimator

is said to be

unbiased

bias

(

) =

, which implies

that

(

) =

. An estimator

is said to be

asymptotically unbiased

lim

m→∞

bias(

) = 0, which implies that lim

m→∞

) = θ.

Example: Bernoulli Distribution

Consider a set of samples

(1)

, . . . , x

(m)

}

that are independently and identically distributed according to a Bernoulli distri-

122

CHAPTER 5. MACHINE LEARNING BASICS

bution with mean θ:

P (x

(i)

; θ) = θ

(i)

(1 − θ)

(1−x

(i)

)

. (5.21)

A common estimator for the

parameter of this distribution is the mean of the

training samples:



i=1

(i)

. (5.22)

To determine whether this estimator is biased, we can substitute equation 5.22

into equation 5.20:

bias(

) = E[

] − θ (5.23)

= E





i=1

(i)



− θ (5.24)



i=1



(i)



− θ (5.25)



i=1



(i)



(i)

(1 − θ)

(1−x

(i)

)



− θ (5.26)



i=1

(θ) − θ (5.27)

= θ − θ = 0 (5.28)

Since bias(

θ) = 0, we say that our estimator

θ is unbiased.

Example: Gaussian Distribution Estimator of the Mean

Now, consider

a set of samples

(1)

, . . . , x

(m)

}

that are independently and identically distributed

according to a Gaussian distribution

(

(i)

) =

(

(i)

;

µ, σ

), where

i ∈ {

, . . . , m}

Recall that the Gaussian probability density function is given by

p(x

(i)

; µ, σ

) =

√

2πσ

exp



−

(i)

− µ)



. (5.29)

A common estimator of the Gaussian mean parameter is known as the

sample

mean:

ˆµ



i=1

(i)

(5.30)

123

CHAPTER 5. MACHINE LEARNING BASICS

To determine the bias of the sample mean, we are again interested in calculating

its expectation:

bias(ˆµ

) = E[ˆµ

] − µ (5.31)

= E





i=1

(i)



− µ (5.32)





i=1



(i)





− µ (5.33)





i=1



− µ (5.34)

= µ − µ = 0 (5.35)

Thus we ﬁnd that the sample mean is an unbiased estimator of Gaussian mean

parameter.

Example: Estimators of the Variance of a Gaussian Distribution

For

this example, we compare two diﬀerent estimators of the variance parameter

a Gaussian distribution. We are interested in knowing if either estimator is biased.

The ﬁrst estimator of σ

we consider is known as the sample variance

ˆσ



i=1



(i)

− ˆµ



, (5.36)

where ˆµ

is the sample mean. More formally, we are interested in computing

bias(ˆσ

) = E[ˆσ

] − σ

. (5.37)

We begin by evaluating the term E[ˆσ

E[ˆσ

] =E





i=1



(i)

− ˆµ





(5.38)

m − 1

(5.39)

Returning to equation 5.37, we conclude that the bias of

ˆσ

−σ

. Therefore,

the sample variance is a biased estimator.

124

CHAPTER 5. MACHINE LEARNING BASICS

The unbiased sample variance estimator

˜σ

m − 1



i=1



(i)

− ˆµ



(5.40)

provides an alternative approach. As the name suggests this estimator is unbiased.

That is, we ﬁnd that E[˜σ

] = σ

E[˜σ

] = E



m − 1



i=1



(i)

− ˆµ





(5.41)

m − 1

E[ˆσ

] (5.42)

m − 1



m − 1



(5.43)

= σ

. (5.44)

We have two estimators: one is biased, and the other is not. While unbiased

estimators are clearly desirable, they are not always the “best” estimators. As we

will see we often use biased estimators that possess other important properties.

5.4.3 Variance and Standard Error

Another property of the estimator that we might want to consider is how much

we expect it to vary as a function of the data sample. Just as we computed the

expectation of the estimator to determine its bias, we can compute its variance.

The variance of an estimator is simply the variance

Var(

θ) (5.45)

where the random variable is the training set. Alternately, the square root of the

variance is called the standard error, denoted SE(

θ).

The variance, or the standard error, of an estimator provides a measure of how

we would expect the estimate we compute from data to vary as we independently

resample the dataset from the underlying data-generating process. Just as we

might like an estimator to exhibit low bias, we would also like it to have relatively

low variance.

When we compute any statistic using a ﬁnite number of samples, our estimate

of the true underlying parameter is uncertain, in the sense that we could have

obtained other samples from the same distribution and their statistics would have

125

CHAPTER 5. MACHINE LEARNING BASICS

been diﬀerent. The expected degree of variation in any estimator is a source of

error that we want to quantify.

The standard error of the mean is given by

SE(ˆµ

) =







Var





i=1

(i)



√

, (5.46)

where

is the true variance of the samples

. The standard error is often

estimated by using an estimate of

. Unfortunately, neither the square root of

the sample variance nor the square root of the unbiased estimator of the variance

provide an unbiased estimate of the standard deviation. Both approaches tend

to underestimate the true standard deviation but are still used in practice. The

square root of the unbiased estimator of the variance is less of an underestimate.

For large m, the approximation is quite reasonable.

The standard error of the mean is very useful in machine learning experiments.

We often estimate the generalization error by computing the sample mean of the

error on the test set. The number of examples in the test set determines the

accuracy of this estimate. Taking advantage of the central limit theorem, which

tells us that the mean will be approximately distributed with a normal distribution,

we can use the standard error to compute the probability that the true expectation

falls in any chosen interval. For example, the 95 percent conﬁdence interval centered

on the mean ˆµ

(ˆµ

− 1.96SE(ˆµ

), ˆµ

+ 1.96SE(ˆµ

)), (5.47)

under the normal distribution with mean

ˆµ

and variance

(

ˆµ

)

. In machine

learning experiments, it is common to say that algorithm

is better than algorithm

if the upper bound of the 95 percent conﬁdence interval for the error of algorithm

is less than the lower bound of the 95 percent conﬁdence interval for the error

of algorithm B.

Example: Bernoulli Distribution

We once again consider a set of samples

(1)

, . . . , x

(m)

}

drawn independently and identically from a Bernoulli distribution

(recall

(

(i)

;

) =

(i)

− θ

)

(1−x

(i)

)

). This time we are interested in computing

the variance of the estimator



i=1

(i)

Var





= Var





i=1

(i)



(5.48)

126

CHAPTER 5. MACHINE LEARNING BASICS



i=1

Var



(i)



(5.49)



i=1

θ(1 − θ) (5.50)

mθ(1 − θ) (5.51)

θ(1 − θ) (5.52)

The variance of the estimator decreases as a function of

, the number of examples

in the dataset. This is a common property of popular estimators that we will

return to when we discuss consistency (see section 5.4.5).

5.4.4 Trading oﬀ Bias and Variance to Minimize Mean Squared

Error

Bias and variance measure two diﬀerent sources of error in an estimator. Bias

measures the expected deviation from the true value of the function or parameter.

Variance on the other hand, provides a measure of the deviation from the expected

estimator value that any particular sampling of the data is likely to cause.

What happens when we are given a choice between two estimators, one with

more bias and one with more variance? How do we choose between them? For

example, imagine that we are interested in approximating the function shown in

ﬁgure 5.2 and we are only oﬀered the choice between a model with large bias and

one that suﬀers from large variance. How do we choose between them?

The most common way to negotiate this trade-oﬀ is to use cross-validation.

Empirically, cross-validation is highly successful on many real-world tasks. Alter-

natively, we can also compare the

mean squared error

(MSE) of the estimates:

MSE = E[(

− θ)

] (5.53)

= Bias(

)

+ Var(

) (5.54)

The MSE measures the overall expected deviation—in a squared error sense—

between the estimator and the true value of the parameter

. As is clear from

equation 5.54, evaluating the MSE incorporates both the bias and the variance.

Desirable estimators are those with small MSE and these are estimators that

manage to keep both their bias and variance somewhat in check.

The relationship between bias and variance is tightly linked to the machine

learning concepts of capacity, underﬁtting and overﬁtting. When generalization

127

CHAPTER 5. MACHINE LEARNING BASICS

Capacity

Bias

Generalization

error

Variance

Optimal

capacity

Overﬁtting zoneUnderﬁtting zone

Figure 5.6: As capacity increases (

-axis), bias (dotted) tends to decrease and variance

(dashed) tends to increase, yielding another U-shaped curve for generalization error (bold

curve). If we vary capacity along one axis, there is an optimal capacity, with underﬁtting

when the capacity is below this optimum and overﬁtting when it is above. This relationship

is similar to the relationship between capacity, underﬁtting, and overﬁtting, discussed in

section 5.2 and ﬁgure 5.3.

error is measured by the MSE (where bias and variance are meaningful components

of generalization error), increasing capacity tends to increase variance and decrease

bias. This is illustrated in ﬁgure 5.6, where we see again the U-shaped curve of

generalization error as a function of capacity.

5.4.5 Consistency

So far we have discussed the properties of various estimators for a training set of

ﬁxed size. Usually, we are also concerned with the behavior of an estimator as the

amount of training data grows. In particular, we usually wish that, as the number

of data points

in our dataset increases, our point estimates converge to the true

value of the corresponding parameters. More formally, we would like that

plim

m→∞

= θ. (5.55)

The symbol

plim

indicates convergence in probability, meaning that for any

 >

(

− θ| > 

)

→

0 as

m → ∞

. The condition described by equation 5.55 is

known as

consistency

. It is sometimes referred to as weak consistency, with

strong consistency referring to the

almost sure

convergence of

Almost

128

CHAPTER 5. MACHINE LEARNING BASICS

sure convergence

of a sequence of random variables

(1)

, x

(2)

, . . .

to a value

occurs when p(lim

m→∞

(m)

= x) = 1.

Consistency ensures that the bias induced by the estimator diminishes as the

number of data examples grows. However, the reverse is not true—asymptotic

unbiasedness does not imply consistency. For example, consider estimating the

mean parameter

of a normal distribution

(

;

µ, σ

), with a dataset consisting

samples:

(1)

, . . . , x

(m)

}

. We could use the ﬁrst sample

(1)

of the dataset

as an unbiased estimator:

(1)

. In that case,

(

) =

, so the estimator

is unbiased no matter how many data points are seen. This, of course, implies

that the estimate is asymptotically unbiased. However, this is not a consistent

estimator as it is not the case that

→ θ as m → ∞.

5.5 Maximum Likelihood Estimation

We have seen some deﬁnitions of common estimators and analyzed their properties.

But where did these estimators come from? Rather than guessing that some

function might make a good estimator and then analyzing its bias and variance,

we would like to have some principle from which we can derive speciﬁc functions

that are good estimators for diﬀerent models.

The most common such principle is the maximum likelihood principle.

Consider a set of

examples

(1)

, . . . , x

(m)

}

drawn independently from

the true but unknown data-generating distribution p

data

(x).

Let

model

(

;

) be a parametric family of probability distributions over the

same space indexed by

. In other words,

model

(

;

) maps any conﬁguration

to a real number estimating the true probability p

data

(x).

The maximum likelihood estimator for θ is then deﬁned as

= arg max

model

(X; θ), (5.56)

= arg max



i=1

model

(i)

; θ). (5.57)

This product over many probabilities can be inconvenient for various reasons.

For example, it is prone to numerical underﬂow. To obtain a more convenient

but equivalent optimization problem, we observe that taking the logarithm of the

likelihood does not change its

arg max

but does conveniently transform a product

129

CHAPTER 5. MACHINE LEARNING BASICS

into a sum:

= arg max



i=1

log p

model

(i)

; θ). (5.58)

Because the

arg max

does not change when we rescale the cost function, we can

divide by

to obtain a version of the criterion that is expressed as an expectation

with respect to the empirical distribution ˆp

data

deﬁned by the training data:

= arg max

x∼ˆp

data

log p

model

(x; θ). (5.59)

One way to interpret maximum likelihood estimation is to view it as minimizing

the dissimilarity between the empirical distribution ˆp

data

, deﬁned by the training

set and the model distribution, with the degree of dissimilarity between the two

measured by the KL divergence. The KL divergence is given by

(ˆp

data

p

model

) = E

x∼ˆp

data

[log ˆp

data

(x) − log p

model

(x)] . (5.60)

The term on the left is a function only of the data-generating process, not the

model. This means when we train the model to minimize the KL divergence, we

need only minimize

− E

x∼ˆp

data

[log p

model

(x)] , (5.61)

which is of course the same as the maximization in equation 5.59.

Minimizing this KL divergence corresponds exactly to minimizing the cross-

entropy between the distributions. Many authors use the term “cross-entropy” to

identify speciﬁcally the negative log-likelihood of a Bernoulli or softmax distribution,

but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross-

entropy between the empirical distribution deﬁned by the training set and the

probability distribution deﬁned by model. For example, mean squared error is the

cross-entropy between the empirical distribution and a Gaussian model.

We can thus see maximum likelihood as an attempt to make the model dis-

tribution match the empirical distribution

ˆp

data

. Ideally, we would like to match

the true data-generating distribution

data

, but we have no direct access to this

distribution.

While the optimal

is the same regardless of whether we are maximizing the

likelihood or minimizing the KL divergence, the values of the objective functions

are diﬀerent. In software, we often phrase both as minimizing a cost function.

Maximum likelihood thus becomes minimization of the negative log-likelihood

(NLL), or equivalently, minimization of the cross-entropy. The perspective of

maximum likelihood as minimum KL divergence becomes helpful in this case

because the KL divergence has a known minimum value of zero. The negative

log-likelihood can actually become negative when x is real-valued.

130

CHAPTER 5. MACHINE LEARNING BASICS

5.5.1 Conditional Log-Likelihood and Mean Squared Error

The maximum likelihood estimator can readily be generalized to estimate a condi-

tional probability

(

y | x

;

) in order to predict

given

. This is actually the

most common situation because it forms the basis for most supervised learning. If

represents all our inputs and

all our observed targets, then the conditional

maximum likelihood estimator is

= arg max

P (Y | X; θ). (5.62)

If the examples are assumed to be i.i.d., then this can be decomposed into

= arg max



i=1

log P (y

(i)

| x

(i)

; θ). (5.63)

Example: Linear Regression as Maximum Likelihood

Linear regression,

introduced in section 5.1.4, may be justiﬁed as a maximum likelihood procedure.

Previously, we motivated linear regression as an algorithm that learns to take an

input

and produce an output value

ˆy

. The mapping from

ˆy

is chosen to

minimize mean squared error, a criterion that we introduced more or less arbitrarily.

We now revisit linear regression from the point of view of maximum likelihood

estimation. Instead of producing a single prediction

ˆy

, we now think of the model

as producing a conditional distribution

(

y | x

). We can imagine that with an

inﬁnitely large training set, we might see several training examples with the same

input value

but diﬀerent values of

. The goal of the learning algorithm is now

to ﬁt the distribution

(

y | x

) to all those diﬀerent

values that are all compatible

with

. To derive the same linear regression algorithm we obtained before, we

deﬁne

(

y | x

) =

(

;

ˆy

(

;

)

, σ

). The function

ˆy

(

;

) gives the prediction of

the mean of the Gaussian. In this example, we assume that the variance is ﬁxed to

some constant

chosen by the user. We will see that this choice of the functional

form of

(

y | x

) causes the maximum likelihood estimation procedure to yield the

same learning algorithm as we developed before. Since the examples are assumed

to be i.i.d., the conditional log-likelihood (equation 5.63) is given by



i=1

log p(y

(i)

| x

(i)

; θ) (5.64)

= − m log σ −

log(2π) −



i=1



ˆy

(i)

− y

(i)



2σ

, (5.65)

131

CHAPTER 5. MACHINE LEARNING BASICS

where

ˆy

(i)

is the output of the linear regression on the

-th input

(i)

and

is the

number of the training examples. Comparing the log-likelihood with the mean

squared error,

MSE

train



i=1

||ˆy

(i)

− y

(i)

, (5.66)

we immediately see that maximizing the log-likelihood with respect to

yields

the same estimate of the parameters

as does minimizing the mean squared error.

The two criteria have diﬀerent values but the same location of the optimum. This

justiﬁes the use of the MSE as a maximum likelihood estimation procedure. As we

will see, the maximum likelihood estimator has several desirable properties.

5.5.2 Properties of Maximum Likelihood

The main appeal of the maximum likelihood estimator is that it can be shown to

be the best estimator asymptotically, as the number of examples

m → ∞

, in terms

of its rate of convergence as m increases.

Under appropriate conditions, the maximum likelihood estimator has the

property of consistency (see section 5.4.5), meaning that as the number of training

examples approaches inﬁnity, the maximum likelihood estimate of a parameter

converges to the true value of the parameter. These conditions are as follows:

•

The true distribution

data

must lie within the model family

model

(

;

Otherwise, no estimator can recover p

data

•

The true distribution

data

must correspond to exactly one value of

. Oth-

erwise, maximum likelihood can recover the correct

data

but will not be able

to determine which value of θ was used by the data-generating process.

There are other inductive principles besides the maximum likelihood estimator,

many of which share the property of being consistent estimators. Consistent

estimators can diﬀer, however, in their

statistical eﬃciency

, meaning that one

consistent estimator may obtain lower generalization error for a ﬁxed number of

samples

, or equivalently, may require fewer examples to obtain a ﬁxed level of

generalization error.

Statistical eﬃciency is typically studied in the

parametric case

(as in linear

regression), where our goal is to estimate the value of a parameter (assuming it

is possible to identify the true parameter), not the value of a function. A way to

measure how close we are to the true parameter is by the expected mean squared

error, computing the squared diﬀerence between the estimated and true parameter

132

CHAPTER 5. MACHINE LEARNING BASICS

values, where the expectation is over

training samples from the data-generating

distribution. That parametric mean squared error decreases as

increases, and

for

large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that

no consistent estimator has a lower MSE than the maximum likelihood estimator.

For these reasons (consistency and eﬃciency), maximum likelihood is often

considered the preferred estimator to use for machine learning. When the number

of examples is small enough to yield overﬁtting behavior, regularization strategies

such as weight decay may be used to obtain a biased version of maximum likelihood

that has less variance when training data is limited.

5.6 Bayesian Statistics

So far we have discussed

frequentist statistics

and approaches based on estimat-

ing a single value of

, then making all predictions thereafter based on that one

estimate. Another approach is to consider all possible values of

when making a

prediction. The latter is the domain of Bayesian statistics.

As discussed in section 5.4.1, the frequentist perspective is that the true

parameter value

is ﬁxed but unknown, while the point estimate

is a random

variable on account of it being a function of the dataset (which is seen as random).

The Bayesian perspective on statistics is quite diﬀerent. The Bayesian uses

probability to reﬂect degrees of certainty in states of knowledge. The dataset is

directly observed and so is not random. On the other hand, the true parameter

is unknown or uncertain and thus is represented as a random variable.

Before observing the data, we represent our knowledge of

using the

prior

probability distribution

(

) (sometimes referred to as simply “the prior”).

Generally, the machine learning practitioner selects a prior distribution that is

quite broad (i.e., with high entropy) to reﬂect a high degree of uncertainty in the

value of

before observing any data. For example, one might assume a priori that

lies in some ﬁnite range or volume, with a uniform distribution. Many priors

instead reﬂect a preference for “simpler” solutions (such as smaller magnitude

coeﬃcients, or a function that is closer to being constant).

Now consider that we have a set of data samples

(1)

, . . . , x

(m)

}

. We can

recover the eﬀect of data on our belief about

by combining the data likelihood

p(x

(1)

, . . . , x

(m)

| θ) with the prior via Bayes’ rule:

p(θ | x

(1)

, . . . , x

(m)

) =

p(x

(1)

, . . . , x

(m)

| θ)p(θ)

p(x

(1)

, . . . , x

(m)

)

(5.67)

133

CHAPTER 5. MACHINE LEARNING BASICS

In the scenarios where Bayesian estimation is typically used, the prior begins as a

relatively uniform or Gaussian distribution with high entropy, and the observation

of the data usually causes the posterior to lose entropy and concentrate around a

few highly likely values of the parameters.

Relative to maximum likelihood estimation, Bayesian estimation oﬀers two

important diﬀerences. First, unlike the maximum likelihood approach that makes

predictions using a point estimate of

, the Bayesian approach is to make predictions

using a full distribution over

. For example, after observing

examples, the

predicted distribution over the next data sample, x

(m+1)

, is given by

p(x

(m+1)

| x

(1)

, . . . , x

(m)

) =



p(x

(m+1)

| θ)p(θ | x

(1)

, . . . , x

(m)

) dθ. (5.68)

Here each value of

with positive probability density contributes to the prediction

of the next example, with the contribution weighted by the posterior density itself.

After having observed

(1)

, . . . , x

(m)

}

, if we are still quite uncertain about the

value of

, then this uncertainty is incorporated directly into any predictions we

might make.

In section 5.4, we discussed how the frequentist approach addresses the uncer-

tainty in a given point estimate of

by evaluating its variance. The variance of

the estimator is an assessment of how the estimate might change with alternative

samplings of the observed data. The Bayesian answer to the question of how to deal

with the uncertainty in the estimator is to simply integrate over it, which tends to

protect well against overﬁtting. This integral is of course just an application of

the laws of probability, making the Bayesian approach simple to justify, while the

frequentist machinery for constructing an estimator is based on the rather ad hoc

decision to summarize all knowledge contained in the dataset with a single point

estimate.

The second important diﬀerence between the Bayesian approach to estimation

and the maximum likelihood approach is due to the contribution of the Bayesian

prior distribution. The prior has an inﬂuence by shifting probability mass density

towards regions of the parameter space that are preferred a priori. In practice,

the prior often expresses a preference for models that are simpler or more smooth.

Critics of the Bayesian approach identify the prior as a source of subjective human

judgment aﬀecting the predictions.

Bayesian methods typically generalize much better when limited training data

is available but typically suﬀer from high computational cost when the number of

training examples is large.

134

CHAPTER 5. MACHINE LEARNING BASICS

Example: Bayesian Linear Regression

Here we consider the Bayesian esti-

mation approach to learning the linear regression parameters. In linear regression,

we learn a linear mapping from an input vector

x ∈ R

to predict the value of a

scalar y ∈ R. The prediction is parametrized by the vector w ∈ R

ˆy = w



x. (5.69)

Given a set of

training samples (

(train)

, y

(train)

), we can express the prediction

of y over the entire training set as

(train)

= X

(train)

w. (5.70)

Expressed as a Gaussian conditional distribution on y

(train)

, we have

p(y

(train)

| X

(train)

, w) = N(y

(train)

; X

(train)

w, I) (5.71)

∝ exp



−

(train)

− X

(train)



(train)

− X

(train)



(5.72)

where we follow the standard MSE formulation in assuming that the Gaussian

variance on

is one. In what follows, to reduce the notational burden, we refer to

(train)

, y

(train)

) as simply (X, y).

To determine the posterior distribution over the model parameter vector

, we

ﬁrst need to specify a prior distribution. The prior should reﬂect our naive belief

about the value of these parameters. While it is sometimes diﬃcult or unnatural

to express our prior beliefs in terms of the parameters of the model, in practice we

typically assume a fairly broad distribution, expressing a high degree of uncertainty

about

. For real-valued parameters it is common to use a Gaussian as a prior

distribution,

p(w) = N(w; µ

, Λ

) ∝ exp



−

(w − µ

)



−1

(w − µ

)



, (5.73)

where

and

are the prior distribution mean vector and covariance matrix

respectively.

With the prior thus speciﬁed, we can now proceed in determining the

posterior

distribution over the model parameters:

p(w | X, y) ∝ p(y | X, w)p(w) (5.74)

Unless there is a reason to use a particular covariance structure, we typically assume a

diagonal covariance matrix Λ

= diag(λ

135

CHAPTER 5. MACHINE LEARNING BASICS

∝ exp



−

(y − Xw)



(y − Xw)



exp



−

(w − µ

)



−1

(w − µ

)



(5.75)

∝ exp



−



−2y



Xw + w



Xw + w



−1

w − 2µ



−1





(5.76)

We now deﬁne





X + Λ

−1



−1

and

= Λ





y + Λ

−1



. Us-

ing these new variables, we ﬁnd that the posterior may be rewritten as a Gaussian

distribution:

p(w | X, y) ∝ exp



−

(w − µ

)



−1

(w − µ

) +



−1



(5.77)

∝ exp



−

(w − µ

)



−1

(w − µ

)



. (5.78)

All terms that do not include the parameter vector

have been omitted; they

are implied by the fact that the distribution must be normalized to integrate to 1.

Equation 3.23 shows how to normalize a multivariate Gaussian distribution.

Examining this posterior distribution enables us to gain some intuition for the

eﬀect of Bayesian inference. In most situations, we set

. If we set

then

gives the same estimate of

as does frequentist linear regression with a

weight decay penalty of

αw



. One diﬀerence is that the Bayesian estimate is

undeﬁned if

is set to zero—we are not allowed to begin the Bayesian learning

process with an inﬁnitely wide prior on

. The more important diﬀerence is that

the Bayesian estimate provides a covariance matrix, showing how likely all the

diﬀerent values of w are, rather than providing only the estimate µ

5.6.1 Maximum a Posteriori (MAP) Estimation

While the most principled approach is to make predictions using the full Bayesian

posterior distribution over the parameter

, it is still often desirable to have a

single point estimate. One common reason for desiring a point estimate is that

most operations involving the Bayesian posterior for most interesting models are

intractable, and a point estimate oﬀers a tractable approximation. Rather than

simply returning to the maximum likelihood estimate, we can still gain some of

the beneﬁt of the Bayesian approach by allowing the prior to inﬂuence the choice

of the point estimate. One rational way to do this is to choose the

maximum

a posteriori

(MAP) point estimate. The MAP estimate chooses the point of

136

CHAPTER 5. MACHINE LEARNING BASICS

maximal posterior probability (or maximal probability density in the more common

case of continuous θ):

MAP

= arg max

p(θ | x) = arg max

log p(x | θ) + log p(θ). (5.79)

We recognize, on the righthand side,

log p

(

x | θ

), that is, the standard log-

likelihood term, and log p(θ), corresponding to the prior distribution.

As an example, consider a linear regression model with a Gaussian prior on

the weights

. If this prior is given by

(

;

), then the log-prior term in

equation 5.79 is proportional to the familiar

λw



weight decay penalty, plus a

term that does not depend on

and does not aﬀect the learning process. MAP

Bayesian inference with a Gaussian prior on the weights thus corresponds to weight

decay.

As with full Bayesian inference, MAP Bayesian inference has the advantage of

leveraging information that is brought by the prior and cannot be found in the

training data. This additional information helps to reduce the variance in the

MAP point estimate (in comparison to the ML estimate). However, it does so at

the price of increased bias.

Many regularized estimation strategies, such as maximum likelihood learning

regularized with weight decay, can be interpreted as making the MAP approxima-

tion to Bayesian inference. This view applies when the regularization consists of

adding an extra term to the objective function that corresponds to

log p

(

). Not

all regularization penalties correspond to MAP Bayesian inference. For example,

some regularizer terms may not be the logarithm of a probability distribution.

Other regularization terms depend on the data, which of course a prior probability

distribution is not allowed to do.

MAP Bayesian inference provides a straightforward way to design complicated

yet interpretable regularization terms. For example, a more complicated penalty

term can be derived by using a mixture of Gaussians, rather than a single Gaussian

distribution, as the prior (Nowlan and Hinton, 1992).

5.7 Supervised Learning Algorithms

Recall from section 5.1.3 that supervised learning algorithms are, roughly speaking,

learning algorithms that learn to associate some input with some output, given a

training set of examples of inputs

and outputs

. In many cases the outputs

may be diﬃcult to collect automatically and must be provided by a human

137

CHAPTER 5. MACHINE LEARNING BASICS

“supervisor,” but the term still applies even when the training set targets were

collected automatically.

5.7.1 Probabilistic Supervised Learning

Most supervised learning algorithms in this book are based on estimating a

probability distribution

(

y | x

). We can do this simply by using maximum

likelihood estimation to ﬁnd the best parameter vector

for a parametric family

of distributions p(y | x; θ).

We have already seen that linear regression corresponds to the family

p(y | x; θ) = N(y; θ



x, I). (5.80)

We can generalize linear regression to the classiﬁcation scenario by deﬁning a

diﬀerent family of probability distributions. If we have two classes, class 0 and

class 1, then we need only specify the probability of one of these classes. The

probability of class 1 determines the probability of class 0, because these two values

must add up to 1.

The normal distribution over real-valued numbers that we used for linear

regression is parametrized in terms of a mean. Any value we supply for this mean

is valid. A distribution over a binary variable is slightly more complicated, because

its mean must always be between 0 and 1. One way to solve this problem is to use

the logistic sigmoid function to squash the output of the linear function into the

interval (0, 1) and interpret that value as a probability:

p(y = 1 | x; θ) = σ(θ



x). (5.81)

This approach is known as

logistic regression

(a somewhat strange name since

we use the model for classiﬁcation rather than regression).

In the case of linear regression, we were able to ﬁnd the optimal weights by

solving the normal equations. Logistic regression is somewhat more diﬃcult. There

is no closed-form solution for its optimal weights. Instead, we must search for

them by maximizing the log-likelihood. We can do this by minimizing the negative

log-likelihood using gradient descent.

This same strategy can be applied to essentially any supervised learning problem,

by writing down a parametric family of conditional probability distributions over

the right kind of input and output variables.

138

CHAPTER 5. MACHINE LEARNING BASICS

5.7.2 Support Vector Machines

One of the most inﬂuential approaches to supervised learning is the support vector

machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to

logistic regression in that it is driven by a linear function



. Unlike logistic

regression, the support vector machine does not provide probabilities, but only

outputs a class identity. The SVM predicts that the positive class is present when



is positive. Likewise, it predicts that the negative class is present when



x + b is negative.

One key innovation associated with support vector machines is the

kernel

trick

. The kernel trick consists of observing that many machine learning algorithms

can be written exclusively in terms of dot products between examples. For example,

it can be shown that the linear function used by the support vector machine can

be re-written as



x + b = b +



i=1



(i)

, (5.82)

where

(i)

is a training example, and

is a vector of coeﬃcients. Rewriting the

learning algorithm this way enables us to replace

with the output of a given feature

function

(

) and the dot product with a function

(

x, x

(i)

) =

(

)

·φ

(

(i)

) called

kernel

. The

operator represents an inner product analogous to

(

)



(

(i)

For some feature spaces, we may not use literally the vector inner product. In

some inﬁnite dimensional spaces, we need to use other kinds of inner products, for

example, inner products based on integration rather than summation. A complete

development of these kinds of inner products is beyond the scope of this book.

After replacing dot products with kernel evaluations, we can make predictions

using the function

f(x) = b +



k(x, x

(i)

). (5.83)

This function is nonlinear with respect to

, but the relationship between

(

)

and

(

) is linear. Also, the relationship between

and

(

) is linear. The

kernel-based function is exactly equivalent to preprocessing the data by applying

φ(x) to all inputs, then learning a linear model in the new transformed space.

The kernel trick is powerful for two reasons. First, it enables us to learn models

that are nonlinear as a function of

using convex optimization techniques that

are guaranteed to converge eﬃciently. This is possible because we consider

ﬁxed

and optimize only

, that is, the optimization algorithm can view the decision

function as being linear in a diﬀerent space. Second, the kernel function

often

139

CHAPTER 5. MACHINE LEARNING BASICS

admits an implementation that is signiﬁcantly more computationally eﬃcient than

naively constructing two φ(x) vectors and explicitly taking their dot product.

In some cases,

(

) can even be inﬁnite dimensional, which would result in

an inﬁnite computational cost for the naive, explicit approach. In many cases,

(

x, x



) is a nonlinear, tractable function of

even when

(

) is intractable. As

an example of an inﬁnite-dimensional feature space with a tractable kernel, we

construct a feature mapping

(

) over the nonnegative integers

. Suppose that

this mapping returns a vector containing

ones followed by inﬁnitely many zeros.

We can write a kernel function

(

x, x

(i)

) =

min

(

x, x

(i)

) that is exactly equivalent

to the corresponding inﬁnite-dimensional dot product.

The most commonly used kernel is the Gaussian kernel,

k(u, v) = N(u − v; 0, σ

I), (5.84)

where

(

;

µ, Σ

) is the standard normal density. This kernel is also known as

the

radial basis function

(RBF) kernel, because its value decreases along lines

space radiating outward from

. The Gaussian kernel corresponds to a dot

product in an inﬁnite-dimensional space, but the derivation of this space is less

straightforward than in our example of the min kernel over the integers.

We can think of the Gaussian kernel as performing a kind of

template match-

ing

. A training example

associated with training label

becomes a template

for class

. When a test point



is near

according to Euclidean distance, the

Gaussian kernel has a large response, indicating that



is very similar to the

template. The model then puts a large weight on the associated training label

Overall, the prediction will combine many such training labels weighted by the

similarity of the corresponding training examples.

Support vector machines are not the only algorithm that can be enhanced

using the kernel trick. Many other linear models can be enhanced in this way. The

category of algorithms that employ the kernel trick is known as

kernel machines

or kernel methods (Williams and Rasmussen, 1996; Schölkopf et al., 1999).

A major drawback to kernel machines is that the cost of evaluating the decision

function is linear in the number of training examples, because the

-th example

contributes a term

(

x, x

(i)

) to the decision function. Support vector machines

are able to mitigate this by learning an

vector that contains mostly zeros.

Classifying a new example then requires evaluating the kernel function only for

the training examples that have nonzero

. These training examples are known

as support vectors.

Kernel machines also suﬀer from a high computational cost of training when

the dataset is large. We revisit this idea in section 5.9. Kernel machines with

140

CHAPTER 5. MACHINE LEARNING BASICS

generic kernels struggle to generalize well. We explain why in section 5.11. The

modern incarnation of deep learning was designed to overcome these limitations of

kernel machines. The current deep learning renaissance began when Hinton et al.

(2006) demonstrated that a neural network could outperform the RBF kernel SVM

on the MNIST benchmark.

5.7.3 Other Simple Supervised Learning Algorithms

We have already brieﬂy encountered another nonprobabilistic supervised learning

algorithm, nearest neighbor regression. More generally,

-nearest neighbors is

a family of techniques that can be used for classiﬁcation or regression. As a

nonparametric learning algorithm,

-nearest neighbors is not restricted to a ﬁxed

number of parameters. We usually think of the

-nearest neighbors algorithm

as not having any parameters but rather implementing a simple function of the

training data. In fact, there is not even really a training stage or learning process.

Instead, at test time, when we want to produce an output

for a new test input

we ﬁnd the

-nearest neighbors to

in the training data

. We then return the

average of the corresponding

values in the training set. This works for essentially

any kind of supervised learning where we can deﬁne an average over

values. In

the case of classiﬁcation, we can average over one-hot code vectors

with

= 1

and

= 0 for all other values of

. We can then interpret the average over these

one-hot codes as giving a probability distribution over classes. As a nonparametric

learning algorithm,

-nearest neighbor can achieve very high capacity. For example,

suppose we have a multiclass classiﬁcation task and measure performance with 0-1

loss. In this setting, 1-nearest neighbor converges to double the Bayes error as the

number of training examples approaches inﬁnity. The error in excess of the Bayes

error results from choosing a single neighbor by breaking ties between equally

distant neighbors randomly. When there is inﬁnite training data, all test points

will have inﬁnitely many training set neighbors at distance zero. If we allow the

algorithm to use all these neighbors to vote, rather than randomly choosing one

of them, the procedure converges to the Bayes error rate. The high capacity of

-nearest neighbors enables it to obtain high accuracy given a large training set.

It does so at high computational cost, however, and it may generalize very badly

given a small ﬁnite training set. One weakness of

-nearest neighbors is that it

cannot learn that one feature is more discriminative than another. For example,

imagine we have a regression task with

x ∈ R

100

drawn from an isotropic Gaussian

distribution, but only a single variable

is relevant to the output. Suppose

further that this feature simply encodes the output directly, that

in all

cases. Nearest neighbor regression will not be able to detect this simple pattern.

141

CHAPTER 5. MACHINE LEARNING BASICS

The nearest neighbor of most points

will be determined by the large number of

features

through

100

, not by the lone feature

. Thus the output on small

training sets will essentially be random.

Another type of learning algorithm that also breaks the input space into regions

and has separate parameters for each region is the

decision tree

(Breiman et al.,

1984) and its many variants. As shown in ﬁgure 5.7, each node of the decision

tree is associated with a region in the input space, and internal nodes break that

region into one subregion for each child of the node (typically using an axis-aligned

cut). Space is thus subdivided into nonoverlapping regions, with a one-to-one

correspondence between leaf nodes and input regions. Each leaf node usually maps

every point in its input region to the same output. Decision trees are usually

trained with specialized algorithms that are beyond the scope of this book. The

learning algorithm can be considered nonparametric if it is allowed to learn a tree

of arbitrary size, though decision trees are usually regularized with size constraints

that turn them into parametric models in practice. Decision trees as they are

typically used, with axis-aligned splits and constant outputs within each node,

struggle to solve some problems that are easy even for logistic regression. For

example, if we have a two-class problem, and the positive class occurs wherever

> x

, the decision boundary is not axis aligned. The decision tree will thus

need to approximate the decision boundary with many nodes, implementing a step

function that constantly walks back and forth across the true decision function

with axis-aligned steps.

As we have seen, nearest neighbor predictors and decision trees have many

limitations. Nonetheless, they are useful learning algorithms when computational

resources are constrained. We can also build intuition for more sophisticated

learning algorithms by thinking about the similarities and diﬀerences between

sophisticated algorithms and k-nearest neighbors or decision tree baselines.

See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine

learning textbooks for more material on traditional supervised learning algorithms.

5.8 Unsupervised Learning Algorithms

Recall from section 5.1.3 that unsupervised algorithms are those that experience

only “features” but not a supervision signal. The distinction between supervised

and unsupervised algorithms is not formally and rigidly deﬁned because there is no

objective test for distinguishing whether a value is a feature or a target provided by

a supervisor. Informally, unsupervised learning refers to most attempts to extract

information from a distribution that do not require human labor to annotate

142

CHAPTER 5. MACHINE LEARNING BASICS

111

0 1

011

11111110

110

010

1110 1111

110

010 011

111

Figure 5.7: Diagrams describing how a decision tree works. (Top)Each node of the tree

chooses to send the input example to the child node on the left (0) or to the child node

on the right (1). Internal nodes are drawn as circles and leaf nodes as squares. Each

node is displayed with a binary string identiﬁer corresponding to its position in the tree,

obtained by appending a bit to its parent identiﬁer (0 = choose left or top, 1 = choose

right or bottom). (Bottom)The tree divides space into regions. The 2-D plane shows

how a decision tree might divide

. The nodes of the tree are plotted in this plane,

with each internal node drawn along the dividing line it uses to categorize examples,

and leaf nodes drawn in the center of the region of examples they receive. The result

is a piecewise-constant function, with one piece per leaf. Each leaf requires at least one

training example to deﬁne, so it is not possible for the decision tree to learn a function

that has more local maxima than the number of training examples.

143

CHAPTER 5. MACHINE LEARNING BASICS

examples. The term is usually associated with density estimation, learning to

draw samples from a distribution, learning to denoise data from some distribution,

ﬁnding a manifold that the data lies near, or clustering the data into groups of

related examples.

A classic unsupervised learning task is to ﬁnd the “best” representation of the

data. By “best” we can mean diﬀerent things, but generally speaking we are looking

for a representation that preserves as much information about

as possible while

obeying some penalty or constraint aimed at keeping the representation simpler or

more accessible than x itself.

There are multiple ways of deﬁning a simpler representation. Three of the

most common include lower-dimensional representations, sparse representations,

and independent representations. Low-dimensional representations attempt to

compress as much information about

as possible in a smaller representation.

Sparse representations (Barlow, 1989; Olshausen and Field, 1996; Hinton and

Ghahramani, 1997) embed the dataset into a representation whose entries are

mostly zeros for most inputs. The use of sparse representations typically requires

increasing the dimensionality of the representation, so that the representation

becoming mostly zeros does not discard too much information. This results in an

overall structure of the representation that tends to distribute data along the axes

of the representation space. Independent representations attempt to disentangle

the sources of variation underlying the data distribution such that the dimensions

of the representation are statistically independent.

Of course these three criteria are certainly not mutually exclusive. Low-

dimensional representations often yield elements that have fewer or weaker de-

pendencies than the original high-dimensional data. This is because one way to

reduce the size of a representation is to ﬁnd and remove redundancies. Identifying

and removing more redundancy enables the dimensionality reduction algorithm to

achieve more compression while discarding less information.

The notion of representation is one of the central themes of deep learning and

therefore one of the central themes in this book. In this section, we develop some

simple examples of representation learning algorithms. Together, these example

algorithms show how to operationalize all three of the criteria above. Most of the

remaining chapters introduce additional representation learning algorithms that

develop these criteria in diﬀerent ways or introduce other criteria.

144

CHAPTER 5. MACHINE LEARNING BASICS

−20 −10 0 10 20

−20

−10

−20 −10 0 10 20

−20

−10

Figure 5.8: PCA learns a linear projection that aligns the direction of greatest variance with

the axes of the new space. (Left)The original data consist of samples of

. In this space, the

variance might occur along directions that are not axis aligned. (Right)The transformed

data



now varies most along the axis

. The direction of second-most variance

is now along z

5.8.1 Principal Components Analysis

In section 2.12, we saw that the principal components analysis algorithm provides

a means of compressing data. We can also view PCA as an unsupervised learning

algorithm that learns a representation of data. This representation is based on

two of the criteria for a simple representation described above. PCA learns a

representation that has lower dimensionality than the original input. It also learns

a representation whose elements have no linear correlation with each other. This

is a ﬁrst step toward the criterion of learning representations whose elements are

statistically independent. To achieve full independence, a representation learning

algorithm must also remove the nonlinear relationships between variables.

PCA learns an orthogonal, linear transformation of the data that projects an

input

to a representation

as shown in ﬁgure 5.8. In section 2.12, we saw that

we could learn a one-dimensional representation that best reconstructs the original

data (in the sense of mean squared error) and that this representation actually

corresponds to the ﬁrst principal component of the data. Thus we can use PCA

as a simple and eﬀective dimensionality reduction method that preserves as much

of the information in the data as possible (again, as measured by least-squares

reconstruction error). In the following, we will study how the PCA representation

decorrelates the original data representation X.

Let us consider the

m × n

design matrix

. We will assume that the data has

145

CHAPTER 5. MACHINE LEARNING BASICS

a mean of zero,

[

] =

. If this is not the case, the data can easily be centered

by subtracting the mean from all examples in a preprocessing step.

The unbiased sample covariance matrix associated with X is given by

Var[x] =

m − 1



X. (5.85)

PCA ﬁnds a representation (through linear transformation)



, where

Var[z] is diagonal.

In section 2.12, we saw that the principal components of a design matrix

are given by the eigenvectors of X



X. From this view,



X = W ΛW



. (5.86)

In this section, we exploit an alternative derivation of the principal components.

The principal components may also be obtained via singular value decomposition

(SVD). Speciﬁcally, they are the right singular vectors of

. To see this, let

the right singular vectors in the decomposition

UΣW



. We then recover

the original eigenvector equation with W as the eigenvector basis:



X =



UΣW







UΣW



= W Σ



. (5.87)

The SVD is helpful to show that PCA results in a diagonal

Var

[

]. Using the

SVD of X, we can express the variance of X as:

Var[x] =

m − 1



X (5.88)

m − 1

(UΣW



)



UΣW



(5.89)

m − 1

W Σ



UΣW



(5.90)

m − 1

W Σ



, (5.91)

where we use the fact that



because the

matrix of the singular value

decomposition is deﬁned to be orthogonal. This shows that the covariance of

diagonal as required:

Var[z] =

m − 1



Z (5.92)

m − 1



XW (5.93)

146

CHAPTER 5. MACHINE LEARNING BASICS

m − 1



W Σ



W (5.94)

m − 1

, (5.95)

where this time we use the fact that



, again from the deﬁnition of the

SVD.

The above analysis shows that when we project the data

, via the linear

transformation

, the resulting representation has a diagonal covariance matrix

(as given by

), which immediately implies that the individual elements of

are

mutually uncorrelated.

This ability of PCA to transform data into a representation where the elements

are mutually uncorrelated is a very important property of PCA. It is a simple

example of a representation that attempts to disentangle the unknown factors of

variation underlying the data. In the case of PCA, this disentangling takes the

form of ﬁnding a rotation of the input space (described by

) that aligns the

principal axes of variance with the basis of the new representation space associated

with z.

While correlation is an important category of dependency between elements of

the data, we are also interested in learning representations that disentangle more

complicated forms of feature dependencies. For this, we will need more than what

can be done with a simple linear transformation.

5.8.2 k-means Clustering

Another example of a simple representation learning algorithm is

-means clustering.

The

-means clustering algorithm divides the training set into

diﬀerent clusters

of examples that are near each other. We can thus think of the algorithm as

providing a

-dimensional one-hot code vector

representing an input

. If

belongs to cluster

, then

= 1, and all other entries of the representation

are

zero.

The one-hot code provided by

-means clustering is an example of a sparse

representation, because the majority of its entries are zero for every input. Later,

we develop other algorithms that learn more ﬂexible sparse representations, where

more than one entry can be nonzero for each input

. One-hot codes are an extreme

example of sparse representations that lose many of the beneﬁts of a distributed

representation. The one-hot code still confers some statistical advantages (it

naturally conveys the idea that all examples in the same cluster are similar to each

other), and it confers the computational advantage that the entire representation

147

CHAPTER 5. MACHINE LEARNING BASICS

may be captured by a single integer.

The

-means algorithm works by initializing

diﬀerent centroids

{µ

(1)

, . . . , µ

(k)

}

to diﬀerent values, then alternating between two diﬀerent steps until convergence.

In one step, each training example is assigned to cluster

, where

is the index of

the nearest centroid

(i)

. In the other step, each centroid

(i)

is updated to the

mean of all training examples x

(j)

assigned to cluster i.

One diﬃculty pertaining to clustering is that the clustering problem is inherently

ill posed, in the sense that there is no single criterion that measures how well a

clustering of the data corresponds to the real world. We can measure properties

of the clustering, such as the average Euclidean distance from a cluster centroid

to the members of the cluster. This enables us to tell how well we are able to

reconstruct the training data from the cluster assignments. We do not know how

well the cluster assignments correspond to properties of the real world. Moreover,

there may be many diﬀerent clusterings that all correspond well to some property

of the real world. We may hope to ﬁnd a clustering that relates to one feature but

obtain a diﬀerent, equally valid clustering that is not relevant to our task. For

example, suppose that we run two clustering algorithms on a dataset consisting of

images of red trucks, images of red cars, images of gray trucks, and images of gray

cars. If we ask each clustering algorithm to ﬁnd two clusters, one algorithm may

ﬁnd a cluster of cars and a cluster of trucks, while another may ﬁnd a cluster of

red vehicles and a cluster of gray vehicles. Suppose we also run a third clustering

algorithm, which is allowed to determine the number of clusters. This may assign

the examples to four clusters, red cars, red trucks, gray cars, and gray trucks. This

new clustering now at least captures information about both attributes, but it has

lost information about similarity. Red cars are in a diﬀerent cluster from gray

cars, just as they are in a diﬀerent cluster from gray trucks. The output of the

clustering algorithm does not tell us that red cars are more similar to gray cars

than they are to gray trucks. They are diﬀerent from both things, and that is all

we know.

These issues illustrate some of the reasons that we may prefer a distributed

representation to a one-hot representation. A distributed representation could have

two attributes for each vehicle—one representing its color and one representing

whether it is a car or a truck. It is still not entirely clear what the optimal

distributed representation is (how can the learning algorithm know whether the

two attributes we are interested in are color and car-versus-truck rather than

manufacturer and age?), but having many attributes reduces the burden on the

algorithm to guess which single attribute we care about, and gives us the ability

to measure similarity between objects in a ﬁne-grained way by comparing many

148

CHAPTER 5. MACHINE LEARNING BASICS

attributes instead of just testing whether one attribute matches.

5.9 Stochastic Gradient Descent

Nearly all of deep learning is powered by one very important algorithm:

stochastic

gradient descent

(SGD). Stochastic gradient descent is an extension of the

gradient descent algorithm introduced in section 4.3.

A recurring problem in machine learning is that large training sets are necessary

for good generalization, but large training sets are also more computationally

expensive.

The cost function used by a machine learning algorithm often decomposes as a

sum over training examples of some per-example loss function. For example, the

negative conditional log-likelihood of the training data can be written as

J(θ) = E

x,y∼ˆp

data

L(x, y, θ) =



i=1

L(x

(i)

, y

(i)

, θ), (5.96)

where L is the per-example loss L(x, y, θ) = −log p(y | x; θ).

For these additive cost functions, gradient descent requires computing

∇

J(θ) =



i=1

∇

L(x

(i)

, y

(i)

, θ). (5.97)

The computational cost of this operation is

(

). As the training set size grows to

billions of examples, the time to take a single gradient step becomes prohibitively

long.

The insight of SGD is that the gradient is an expectation. The expectation may

be approximately estimated using a small set of samples. Speciﬁcally, on each step

of the algorithm, we can sample a

minibatch

of examples

(1)

, . . . , x



)

}

drawn uniformly from the training set. The minibatch size



is typically chosen

to be a relatively small number of examples, ranging from one to a few hundred.

Crucially,



is usually held ﬁxed as the training set size

grows. We may ﬁt a

training set with billions of examples using updates computed on only a hundred

examples.

The estimate of the gradient is formed as

g =



∇





i=1

L(x

(i)

, y

(i)

, θ) (5.98)

149

CHAPTER 5. MACHINE LEARNING BASICS

using examples from the minibatch B. The stochastic gradient descent algorithm

then follows the estimated gradient downhill:

θ ← θ − g, (5.99)

where  is the learning rate.

Gradient descent in general has often been regarded as slow or unreliable. In

the past, the application of gradient descent to nonconvex optimization problems

was regarded as foolhardy or unprincipled. Today, we know that the machine

learning models described in part II work very well when trained with gradient

descent. The optimization algorithm may not be guaranteed to arrive at even a

local minimum in a reasonable amount of time, but it often ﬁnds a very low value

of the cost function quickly enough to be useful.

Stochastic gradient descent has many important uses outside the context of

deep learning. It is the main way to train large linear models on very large

datasets. For a ﬁxed model size, the cost per SGD update does not depend on the

training set size

. In practice, we often use a larger model as the training set size

increases, but we are not forced to do so. The number of updates required to reach

convergence usually increases with training set size. However, as

approaches

inﬁnity, the model will eventually converge to its best possible test error before

SGD has sampled every example in the training set. Increasing

further will not

extend the amount of training time needed to reach the model’s best possible test

error. From this point of view, one can argue that the asymptotic cost of training

a model with SGD is O(1) as a function of m.

Prior to the advent of deep learning, the main way to learn nonlinear models

was to use the kernel trick in combination with a linear model. Many kernel learning

algorithms require constructing an

m×m

matrix

i,j

(

(i)

, x

(j)

). Constructing

this matrix has computational cost

(

), which is clearly undesirable for datasets

with billions of examples. In academia, starting in 2006, deep learning was

initially interesting because it was able to generalize to new examples better

than competing algorithms when trained on medium-sized datasets with tens of

thousands of examples. Soon after, deep learning garnered additional interest in

industry because it provided a scalable way of training nonlinear models on large

datasets.

Stochastic gradient descent and many enhancements to it are described further

in chapter 8.

150

CHAPTER 5. MACHINE LEARNING BASICS

5.10 Building a Machine Learning Algorithm

Nearly all deep learning algorithms can be described as particular instances of

a fairly simple recipe: combine a speciﬁcation of a dataset, a cost function, an

optimization procedure and a model.

For example, the linear regression algorithm combines a dataset consisting of

X and y, the cost function

J(w, b) = −E

x,y∼ˆp

data

log p

model

(y | x), (5.100)

the model speciﬁcation

model

(

y | x

) =

(

;



1), and, in most cases, the

optimization algorithm deﬁned by solving for where the gradient of the cost is zero

using the normal equations.

By realizing that we can replace any of these components mostly independently

from the others, we can obtain a wide range of algorithms.

The cost function typically includes at least one term that causes the learning

process to perform statistical estimation. The most common cost function is the

negative log-likelihood, so that minimizing the cost function causes maximum

likelihood estimation.

The cost function may also include additional terms, such as regularization

terms. For example, we can add weight decay to the linear regression cost function

to obtain

J(w, b) = λ||w||

− E

x,y∼ˆp

data

log p

model

(y | x). (5.101)

This still allows closed form optimization.

If we change the model to be nonlinear, then most cost functions can no longer

be optimized in closed form. This requires us to choose an iterative numerical

optimization procedure, such as gradient descent.

The recipe for constructing a learning algorithm by combining models, costs, and

optimization algorithms supports both supervised and unsupervised learning. The

linear regression example shows how to support supervised learning. Unsupervised

learning can be supported by deﬁning a dataset that contains only

and providing

an appropriate unsupervised cost and model. For example, we can obtain the ﬁrst

PCA vector by specifying that our loss function is

J(w) = E

x∼ˆp

data

||x − r(x; w)||

(5.102)

while our model is deﬁned to have

with norm one and reconstruction function

r(x) = w



xw.

151

CHAPTER 5. MACHINE LEARNING BASICS

In some cases, the cost function may be a function that we cannot actually

evaluate, for computational reasons. In these cases, we can still approximately

minimize it using iterative numerical optimization, as long as we have some way of

approximating its gradients.

Most machine learning algorithms make use of this recipe, though it may not

be immediately obvious. If a machine learning algorithm seems especially unique or

hand designed, it can usually be understood as using a special-case optimizer. Some

models, such as decision trees and

-means, require special-case optimizers because

their cost functions have ﬂat regions that make them inappropriate for minimization

by gradient-based optimizers. Recognizing that most machine learning algorithms

can be described using this recipe helps to see the diﬀerent algorithms as part of a

taxonomy of methods for doing related tasks that work for similar reasons, rather

than as a long list of algorithms that each have separate justiﬁcations.

5.11 Challenges Motivating Deep Learning

The simple machine learning algorithms described in this chapter work well on a

wide variety of important problems. They have not succeeded, however, in solving

the central problems in AI, such as recognizing speech or recognizing objects.

The development of deep learning was motivated in part by the failure of

traditional algorithms to generalize well on such AI tasks.

This section is about how the challenge of generalizing to new examples becomes

exponentially more diﬃcult when working with high-dimensional data, and how

the mechanisms used to achieve generalization in traditional machine learning

are insuﬃcient to learn complicated functions in high-dimensional spaces. Such

spaces also often impose high computational costs. Deep learning was designed to

overcome these and other obstacles.

5.11.1 The Curse of Dimensionality

Many machine learning problems become exceedingly diﬃcult when the number

of dimensions in the data is high. This phenomenon is known as the

curse of

dimensionality

. Of particular concern is that the number of possible distinct

conﬁgurations of a set of variables increases exponentially as the number of variables

increases.

The curse of dimensionality arises in many places in computer science, especially

in machine learning.

152

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.9: As the number of relevant dimensions of the data increases (from left to

right), the number of conﬁgurations of interest may grow exponentially. (Left)In this

one-dimensional example, we have one variable for which we only care to distinguish 10

regions of interest. With enough examples falling within each of these regions (each region

corresponds to a cell in the illustration), learning algorithms can easily generalize correctly.

A straightforward way to generalize is to estimate the value of the target function within

each region (and possibly interpolate between neighboring regions). (Center)With two

dimensions, it is more diﬃcult to distinguish 10 diﬀerent values of each variable. We need

to keep track of up to 10

10 = 100 regions, and we need at least that many examples to

cover all those regions. (Right)With three dimensions, this grows to 10

= 1

000 regions

and at least that many examples. For

dimensions and

values to be distinguished along

each axis, we seem to need

(

) regions and examples. This is an instance of the curse

of dimensionality. Figure graciously provided by Nicolas Chapados.

One challenge posed by the curse of dimensionality is a statistical challenge.

As illustrated in ﬁgure 5.9, a statistical challenge arises because the number of

possible conﬁgurations of

is much larger than the number of training examples.

To understand the issue, let us consider that the input space is organized into a

grid, as in the ﬁgure. We can describe low-dimensional space with a small number

of grid cells that are mostly occupied by the data. When generalizing to a new data

point, we can usually tell what to do simply by inspecting the training examples

that lie in the same cell as the new input. For example, if estimating the probability

density at some point

, we can just return the number of training examples in

the same unit volume cell as

, divided by the total number of training examples.

If we wish to classify an example, we can return the most common class of training

examples in the same cell. If we are doing regression, we can average the target

values observed over the examples in that cell. But what about the cells for which

we have seen no example? Because in high-dimensional spaces, the number of

conﬁgurations is huge, much larger than our number of examples, a typical grid cell

has no training example associated with it. How could we possibly say something

meaningful about these new conﬁgurations? Many traditional machine learning

153

CHAPTER 5. MACHINE LEARNING BASICS

algorithms simply assume that the output at a new point should be approximately

the same as the output at the nearest training point.

5.11.2 Local Constancy and Smoothness Regularization

To generalize well, machine learning algorithms need to be guided by prior beliefs

about what kind of function they should learn. We have seen these priors incorpo-

rated as explicit beliefs in the form of probability distributions over parameters of

the model. More informally, we may also discuss prior beliefs as directly inﬂuencing

the function itself and inﬂuencing the parameters only indirectly, as a result of the

relationship between the parameters and the function. Additionally, we informally

discuss prior beliefs as being expressed implicitly by choosing algorithms that

are biased toward choosing some class of functions over another, even though

these biases may not be expressed (or even be possible to express) in terms of a

probability distribution representing our degree of belief in various functions.

Among the most widely used of these implicit “priors” is the

smoothness

prior

, or

local constancy prior

. This prior states that the function we learn

should not change very much within a small region.

Many simpler algorithms rely exclusively on this prior to generalize well, and

as a result, they fail to scale to the statistical challenges involved in solving AI-

level tasks. Throughout this book, we describe how deep learning introduces

additional (explicit and implicit) priors in order to reduce the generalization

error on sophisticated tasks. Here, we explain why the smoothness prior alone is

insuﬃcient for these tasks.

There are many diﬀerent ways to implicitly or explicitly express a prior belief

that the learned function should be smooth or locally constant. All these diﬀerent

methods are designed to encourage the learning process to learn a function

∗

that

satisﬁes the condition

∗

(x) ≈ f

∗

(x + ) (5.103)

for most conﬁgurations

and small change



. In other words, if we know a good

answer for an input

(for example, if

is a labeled training example), then that

answer is probably good in the neighborhood of

. If we have several good answers

in some neighborhood, we would combine them (by some form of averaging or

interpolation) to produce an answer that agrees with as many of them as much as

possible.

An extreme example of the local constancy approach is the

-nearest neighbors

family of learning algorithms. These predictors are literally constant over each

region containing all the points

that have the same set of

nearest neighbors in

154

CHAPTER 5. MACHINE LEARNING BASICS

the training set. For

= 1, the number of distinguishable regions cannot be more

than the number of training examples.

While the

-nearest neighbors algorithm copies the output from nearby training

examples, most kernel machines interpolate between training set outputs associated

with nearby training examples. An important class of kernels is the family of

local

kernels

, where

(

u, v

) is large when

and decreases as

and

grow further

apart from each other. A local kernel can be thought of as a similarity function

that performs template matching, by measuring how closely a test example

resembles each training example

(i)

. Much of the modern motivation for deep

learning is derived from studying the limitations of local template matching and

how deep models are able to succeed in cases where local template matching fails

(Bengio et al., 2006b).

Decision trees also suﬀer from the limitations of exclusively smoothness-based

learning, because they break the input space into as many regions as there are

leaves and use a separate parameter (or sometimes many parameters for extensions

of decision trees) in each region. If the target function requires a tree with at

least

leaves to be represented accurately, then at least

training examples are

required to ﬁt the tree. A multiple of

is needed to achieve some level of statistical

conﬁdence in the predicted output.

In general, to distinguish

(

) regions in input space, all these methods require

(

) examples. Typically there are

(

) parameters, with

(1) parameters

associated with each of the

(

) regions. The nearest neighbor scenario, in which

each training example can be used to deﬁne at most one region, is illustrated in

ﬁgure 5.10.

Is there a way to represent a complex function that has many more regions to

be distinguished than the number of training examples? Clearly, assuming only

smoothness of the underlying function will not allow a learner to do that. For

example, imagine that the target function is a kind of checkerboard. A checkerboard

contains many variations, but there is a simple structure to them. Imagine what

happens when the number of training examples is substantially smaller than the

number of black and white squares on the checkerboard. Based on only local

generalization and the smoothness or local constancy prior, the learner would be

guaranteed to correctly guess the color of a new point if it lay within the same

checkerboard square as a training example. There is no guarantee, however, that

the learner could correctly extend the checkerboard pattern to points lying in

squares that do not contain training examples. With this prior alone, the only

information that an example tells us is the color of its square, and the only way to

get the colors of the entire checkerboard right is to cover each of its cells with at

155

CHAPTER 5. MACHINE LEARNING BASICS

least one example.

The smoothness assumption and the associated nonparametric learning algo-

rithms work extremely well as long as there are enough examples for the learning

algorithm to observe high points on most peaks and low points on most valleys

of the true underlying function to be learned. This is generally true when the

function to be learned is smooth enough and varies in few enough dimensions.

In high dimensions, even a very smooth function can change smoothly but in a

diﬀerent way along each dimension. If the function additionally behaves diﬀerently

in various regions, it can become extremely complicated to describe with a set of

training examples. If the function is complicated (we want to distinguish a huge

number of regions compared to the number of examples), is there any hope to

generalize well?

The answer to both of these questions—whether it is possible to represent

a complicated function eﬃciently, and whether it is possible for the estimated

function to generalize well to new inputs—is yes. The key insight is that a very large

number of regions, such as

), can be deﬁned with

(

) examples, so long as we

Figure 5.10: Illustration of how the nearest neighbor algorithm breaks up the input space

into regions. An example (represented here by a circle) within each region deﬁnes the

region boundary (represented here by the lines). The

value associated with each example

deﬁnes what the output should be for all points within the corresponding region. The

regions deﬁned by nearest neighbor matching form a geometric pattern called a Voronoi

diagram. The number of these contiguous regions cannot grow faster than the number

of training examples. While this ﬁgure illustrates the behavior of the nearest neighbor

algorithm speciﬁcally, other machine learning algorithms that rely exclusively on the

local smoothness prior for generalization exhibit similar behaviors: each training example

only informs the learner about how to generalize in some neighborhood immediately

surrounding that example.

156

CHAPTER 5. MACHINE LEARNING BASICS

introduce some dependencies between the regions through additional assumptions

about the underlying data-generating distribution. In this way, we can actually

generalize nonlocally (Bengio and Monperrus, 2005; Bengio et al., 2006c). Many

diﬀerent deep learning algorithms provide implicit or explicit assumptions that are

reasonable for a broad range of AI tasks in order to capture these advantages.

Other approaches to machine learning often make stronger, task-speciﬁc as-

sumptions. For example, we could easily solve the checkerboard task by providing

the assumption that the target function is periodic. Usually we do not include such

strong, task-speciﬁc assumptions in neural networks so that they can generalize

to a much wider variety of structures. AI tasks have structure that is much too

complex to be limited to simple, manually speciﬁed properties such as periodicity,

so we want learning algorithms that embody more general-purpose assumptions.

The core idea in deep learning is that we assume that the data was generated by

the composition of factors, or features, potentially at multiple levels in a hierar-

chy. Many other similarly generic assumptions can further improve deep learning

algorithms. These apparently mild assumptions allow an exponential gain in the

relationship between the number of examples and the number of regions that can

be distinguished. We describe these exponential gains more precisely in sections

6.4.1, 15.4 and 15.5. The exponential advantages conferred by the use of deep

distributed representations counter the exponential challenges posed by the curse

of dimensionality.

5.11.3 Manifold Learning

An important concept underlying many ideas in machine learning is that of a

manifold.

manifold

is a connected region. Mathematically, it is a set of points

associated with a neighborhood around each point. From any given point, the

manifold locally appears to be a Euclidean space. In everyday life, we experience

the surface of the world as a 2-D plane, but it is in fact a spherical manifold in

3-D space.

The concept of a neighborhood surrounding each point implies the existence of

transformations that can be applied to move on the manifold from one position to

a neighboring one. In the example of the world’s surface as a manifold, one can

walk north, south, east, or west.

Although there is a formal mathematical meaning to the term “manifold,” in

machine learning it tends to be used more loosely to designate a connected set

of points that can be approximated well by considering only a small number of

157

CHAPTER 5. MACHINE LEARNING BASICS

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

Figure 5.11: Data sampled from a distribution in a two-dimensional space that is actually

concentrated near a one-dimensional manifold, like a twisted string. The solid line indicates

the underlying manifold that the learner should infer.

degrees of freedom, or dimensions, embedded in a higher-dimensional space. Each

dimension corresponds to a local direction of variation. See ﬁgure 5.11 for an

example of training data lying near a one-dimensional manifold embedded in two-

dimensional space. In the context of machine learning, we allow the dimensionality

of the manifold to vary from one point to another. This often happens when a

manifold intersects itself. For example, a ﬁgure eight is a manifold that has a single

dimension in most places but two dimensions at the intersection at the center.

Many machine learning problems seem hopeless if we expect the machine

learning algorithm to learn functions with interesting variations across all of

Manifold learning

algorithms surmount this obstacle by assuming that most

consists of invalid inputs, and that interesting inputs occur only along

a collection of manifolds containing a small subset of points, with interesting

variations in the output of the learned function occurring only along directions

that lie on the manifold, or with interesting variations happening only when we

move from one manifold to another. Manifold learning was introduced in the case

of continuous-valued data and in the unsupervised learning setting, although this

probability concentration idea can be generalized to both discrete data and the

supervised learning setting: the key assumption remains that probability mass is

highly concentrated.

The assumption that the data lies along a low-dimensional manifold may not

always be correct or useful. We argue that in the context of AI tasks, such as

those that involve processing images, sounds, or text, the manifold assumption is

158

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.12: Sampling images uniformly at random (by randomly picking each pixel

according to a uniform distribution) gives rise to noisy images. Although there is a

nonzero probability of generating an image of a face or of any other object frequently

encountered in AI applications, we never actually observe this happening in practice. This

suggests that the images encountered in AI applications occupy a negligible proportion of

the volume of image space.

at least approximately correct. The evidence in favor of this assumption consists

of two categories of observations.

The ﬁrst observation in favor of the manifold hypothesis is that the proba-

159

CHAPTER 5. MACHINE LEARNING BASICS

bility distribution over images, text strings, and sounds that occur in real life is

highly concentrated. Uniform noise essentially never resembles structured inputs

from these domains. Figure 5.12 shows how, instead, uniformly sampled points

look like the patterns of static that appear on analog television sets when no signal

is available. Similarly, if you generate a document by picking letters uniformly at

random, what is the probability that you will get a meaningful English-language

text? Almost zero, again, because most of the long sequences of letters do not

correspond to a natural language sequence: the distribution of natural language

sequences occupies a very little volume in the total space of sequences of letters.

Of course, concentrated probability distributions are not suﬃcient to show that

the data lies on a reasonably small number of manifolds. We must also establish

that the examples we encounter are connected to each other by other examples,

with each example surrounded by other highly similar examples that can be reached

by applying transformations to traverse the manifold. The second argument in

favor of the manifold hypothesis is that we can imagine such neighborhoods and

transformations, at least informally. In the case of images, we can certainly think

of many possible transformations that allow us to trace out a manifold in image

space: we can gradually dim or brighten the lights, gradually move or rotate

objects in the image, gradually alter the colors on the surfaces of objects, and so

forth. Multiple manifolds are likely involved in most applications. For example,

the manifold of human face images may not be connected to the manifold of cat

face images.

These thought experiments convey some intuitive reasons supporting the mani-

fold hypothesis. More rigorous experiments (Cayton, 2005; Narayanan and Mitter,

2010; Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand,

2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul,

2004) clearly support the hypothesis for a large class of datasets of interest in AI.

When the data lies on a low-dimensional manifold, it can be most natural for

machine learning algorithms to represent the data in terms of coordinates on the

manifold, rather than in terms of coordinates in

. In everyday life, we can think

of roads as 1-D manifolds embedded in 3-D space. We give directions to speciﬁc

addresses in terms of address numbers along these 1-D roads, not in terms of

coordinates in 3-D space. Extracting these manifold coordinates is challenging but

holds the promise of improving many machine learning algorithms. This general

principle is applied in many contexts. Figure 5.13 shows the manifold structure of

a dataset consisting of faces. By the end of this book, we will have developed the

methods necessary to learn such a manifold structure. In ﬁgure 20.6, we will see

how a machine learning algorithm can successfully accomplish this goal.

160

CHAPTER 5. MACHINE LEARNING BASICS

This concludes part I, which has provided the basic concepts in mathematics

and machine learning that are employed throughout the remaining parts of the

book. You are now prepared to embark on your study of deep learning.

Figure 5.13: Training examples from the QMUL Multiview Face Dataset (Gong et al.,

2000), for which the subjects were asked to move in such a way as to cover the two-

dimensional manifold corresponding to two angles of rotation. We would like learning

algorithms to be able to discover and disentangle such manifold coordinates. Figure 20.6

illustrates such a feat.

161