Chapter 3

Probability and Information

Theory

In this chapter, we describe probability theory. Probability theory is a mathe-

matical framework for representing uncertain statements. It provides a means

of quantifying uncertainty and axioms for deriving new uncertain statements. In

artiﬁcial intelligence applications, we use probability theory in two major ways.

First, the laws of probability tell us how AI systems should reason, so we de-

sign our algorithms to compute or approximate various expressions derived using

probability theory. Second, we can use probability and statistics to theoretically

analyze the behavior of proposed AI systems.

Probability theory is a fundamental tool of many disciplines of science and

engineering. We provide this chapter to ensure that readers whose background is

primarily in software engineering with limited exposure to probability theory can

understand the material in this book. If you are already familiar with probability

theory, feel free to skip this chapter. If you have absolutely no prior experi-

ence with probability, this chapter should be suﬃcient to successfully carry out

deep learning research projects, but we do suggest that you consult an additional

resource, such as (Jaynes, 2003).

3.1 Why Probability?

Many branches of computer science deal mostly with entities that are entirely

deterministic and certain. A programmer can usually safely assume that a CPU

will execute each machine instruction ﬂawlessly. Errors in hardware do occur, but

are rare enough that most software applications do not need to be designed to

account for them. Given that many computer scientists and software engineers

work in a relatively clean and certain environment, it can be surprising that

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

machine learning makes heavy use of probability theory.

This is because machine learning must always deal with uncertain quantities,

and sometimes may also need to deal with stochastic quantities. Uncertainty

and stochasticity can arise from many sources. Researchers have made com-

pelling arguments for quantifying uncertainty using probability since at least the

1980s. Many of the arguments presented here are summarized from or inspired

by (Pearl, 1988). Much earlier work in probability and engineering introduced

and developed the underlying fundamental notions, such as the notion of ex-

changeability (de Finetti, 1937), Cox’s theorem as the foundations of Bayesian

inference (Cox, 1946), and the theory of stochastic processes (Doob, 1953).

Nearly all activities require some ability to reason in the presence of uncer-

tainty. In fact, beyond mathematical statements that are true by deﬁnition, it is

diﬃcult to think of any proposition that is absolutely true or any event that is

absolutely guaranteed to occur.

One source of uncertainty is incomplete observability. When we cannot ob-

serve something, we are uncertain about its true nature. In machine learning, it

is often the case that we can observe a large amount of data, but there is not a

data instance for every situation we care about. We are also generally not able to

observe directly what process generates the data. Since we are uncertain about

what process generates the data, we are also uncertain about what happens in

the situations for which we have not observed data points. Lack of observability

can also give rise to apparent stochasticity. Deterministic systems can appear

stochastic when we cannot observe all of the variables that drive the behavior of

the system. For example, consider a game of Russian roulette. The outcome is

deterministic if you know which chamber of the revolver is loaded. If you do not

know this important information, then it is a game of chance. In many cases, we

are able to observe some quantity, but our measurement is itself uncertain. For

example, laser range ﬁnders may have several centimeters of random error.

Uncertainty can also arise from the simpliﬁcations we make in order to model

real-world processes. For example, if we discretize space, then we immediately

become uncertain about the precise position of objects: each object could be

anywhere within the discrete cell that we know it occupies.

Conceivably, the universe itself could have stochastic dynamics, but we make

no claim on this subject.

In many cases, it is more practical to use a simple but uncertain rule rather

than a complex but certain one, even if our modeling system has the ﬁdelity to

accommodate a complex rule. For example, the simple rule “Most birds ﬂy” is

cheap to develop and is broadly useful, while a rule of the form, “Birds ﬂy, except

for very young birds that have not yet learned to ﬂy, sick or injured birds that have

lost the ability to ﬂy, ﬂightless species of birds including the cassowary, ostrich,

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

and kiwi. . . ” is expensive to develop, maintain, and communicate, and after all

of this eﬀort is still very brittle and prone to failure.

Given that we need a means of representing and reasoning about uncertainty,

it is not immediately obvious that probability theory can provide all of the tools

we want for artiﬁcial intelligence applications. Probability theory was originally

developed to analyze the frequencies of events. It is easy to see how probability

theory can be used to study events like drawing a certain hand of cards in a

game of poker. These kinds of events are often repeatable, and when we say that

an outcome has a probability p of occurring, it means that if we repeated the

experiment (e.g., draw a hand of cards) inﬁnitely many times, then proportion

p of the repetitions would result in that outcome. This kind of reasoning does

not seem immediately applicable to propositions that are not repeatable. If a

doctor analyzes a patient and says that the patient has a 40% chance of having

the ﬂu, this means something very diﬀerent—we can not make inﬁnitely many

replicas of the patient, nor is there any reason to believe that diﬀerent replicas of

the patient would present with the same symptoms yet have varying underlying

conditions. In the case of the doctor diagnosing the patient, we use probability to

represent a degree of belief, with 1 indicating absolute certainty, and 0 indicating

absolute uncertainty. The former kind of probability, related directly to the rates

at which events occur, is known as frequentist probability, while the latter, related

to qualitative levels of certainty, is known as Bayesian probability.

It turns out that if we list several properties that we expect common sense

reasoning about uncertainty to have, then the only way to satisfy those properties

is to treat Bayesian probabilities as behaving exactly the same as frequentist

probabilities. For example, if we want to compute the probability that a player

will win a poker game given that she has a certain set of cards, we use exactly the

same formulas as when we compute the probability that a patient has a disease

given that she has certain symptoms. For more details about why a small set of

common sense assumptions implies that the same axioms must control both kinds

of probability, see (Ramsey, 1926).

Probability can be seen as the extension of logic to deal with uncertainty.

Logic provides a set of formal rules for determining what propositions are implied

to be true or false given the assumption that some other set of propositions is

true or false. Probability theory provides a set of formal rules for determining the

likelihood of a proposition being true given the likelihood of other propositions.

3.2 Random Variables

A random variable is a variable that can take on diﬀerent values randomly. We

typically denote the random variable itself with a lower case letter in plain type-

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

face, and the values it can take on with lower case script letters. For example,

and x

are both possible values that the random variable x can take on. For

vector-valued variables, we would write the random variable as x and one of its

values as x. On its own, a random variable is just a description of the states that

are possible; it must be coupled with a probability distribution that speciﬁes how

likely each of these states are.

Random variables may be discrete or continuous. A discrete random variable

is one that has a ﬁnite or countably inﬁnite number of states. Note that these

states are not necessarily the integers; they can also just be named states that

are not considered to have any numerical value. A continuous random variable is

associated with a real value.

3.3 Probability Distributions

A probability distribution is a description of how likely a random variable or set

of random variables is to take on each of its possible states. The way we de-

scribe probability distributions depends on whether the variables are discrete or

continuous.

3.3.1 Discrete Variables and Probability Mass Functions

A probability distribution over discrete variables may be described using a proba-

bility mass function (PMF). We typically denote probability mass functions with

a capital P. Often we associate each random variable with a diﬀerent probabil-

ity mass function and the reader must infer which probability mass function to

use based on the identity of the random variable, rather than the name of the

function; P (x) is usually not the same as P (y).

The probability mass function maps from a state of a random variable to

the probability of that random variable taking on that state. P (x) denotes the

probability that x = x, with a probability of 1 indicating that x = x is certain and

a probability of 0 indicating that x = x is impossible. Sometimes to disambiguate

which PMF to use, we write the name of the random variable explicitly: P (x =

x). Sometimes we deﬁne a variable ﬁrst, then use ∼ notation to specify which

distribution it follows later: x ∼ P (x).

Probability mass functions can act on many variables at the same time. Such

a probability distribution over many variables is known as a joint probability

distribution. P (x = x, y = y) denotes the probability that x = x and y = y

simultaneously. We may also write P (x, y) for brevity.

To be a probability mass function on a set of random variables x, a function

f must meet the following properties:

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

• The domain of f must be the set of all possible states of x.

• The range of f must be a subset of the real interval [0, 1]. (No state can be

more likely than a guaranteed event or less likely than an impossible event.)

•

x∈x

f(x) = 1. (f must guarantee that some state occurs.)

For example, consider a single discrete random variable x with k diﬀerent

states. We can place a uniform distribution on x — that is, make each of its

states equally likely, by setting its probability mass function to

P (x = x

) =

for all i. We can see that this ﬁts the requirements for a probability mass function.

The value

is positive because k is a positive integer. We also see that

P (x =

) =

= 1, so the distribution is properly normalized.

3.3.2 Continuous Variables and Probability Density Functions

When working with continuous random variables, we describe probability distri-

butions using a probability density function (PDF) rather than a probability mass

function. A probability density function must satisfy the following properties:

• It must map from the domain of the random variable whose distribution it

describes to the real numbers.

• ∀x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.

•

p(x)dx = 1.

A probability density function does not give the probability of a speciﬁc state

directly, instead the probability of landing inside an inﬁnitesimal region with

volume δx is given by p(x)δx.

We can integrate the density function to ﬁnd the actual probability mass of

a set of points. Speciﬁcally, the probability that x lies in some set S is given by

the integral of p(x) over that set. In the univariate example, the probability that

x lies in the interval [a, b] is given by

[a,b]

p(x)dx.

For an example of a probability density function corresponding to a speciﬁc

probability density over a continuous random variable, consider a uniform dis-

tribution on an interval of the real numbers. We can do this with a function

u(x; a, b), where a and b are the endpoints of the interval, with b > a. (The

“;” notation means “parametrized by”; we consider x to be the argument of the

function, while a and b are parameters that deﬁne the function) To ensure that

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

there is no probability mass outside the interval, we say u(x; a, b) = 0 for all

x 6∈ [a, b]. Within [a, b], u(x; a, b) =

b−a

. We can see that this is nonnegative

everywhere. Additionally, it integrates to 1. We often denote that x follows the

uniform distribution on [a, b] by writing x ∼ U(a, b).

3.4 Marginal Probability

Sometimes we know the probability distribution over a set of variables, and we

want to know the probability distribution over just a subset of them. The prob-

ability distribution over the subset is known as the marginal probability.

For example, suppose we have discrete random variables x and y, and we know

P (x, y). We can ﬁnd P (x) with the sum rule:

∀x ∈ x, P(x = x) =

P (x = x, y = y).

The name “marginal probability” comes from the process of computing marginal

probabilities on paper. When the values of P(x, y) are written in a grid with dif-

ferent values of x in rows and diﬀerent values of y in columns, it is natural to sum

across a row of the grid, then write P (x) in the margin of the paper just to the

right of the row.

For continuous variables, we need to use integration instead of summation:

p(x) =

p(x, y)dy.

3.5 Conditional Probability

In many cases, we are interested in the probability of some event, given that some

other event has happened. This is called a conditional probability. We denote the

conditional probability that y = y given x = x as P (y = y | x = x). This

conditional probability can be computed with the formula

P (y = y | x = x) = P (y = y, x = x)/P(x = x).

Note that this is only deﬁned when P (x = x) > 0. We cannot compute the

conditional probability conditioned on an event that never happens.

It is important not to confuse conditional probability with computing what

would happen if some action were undertaken. The conditional probability that

a person is from Germany given that they speak German is quite high, but if

a randomly selected person is taught to speak German, their country of origin

does not change. Computing the consequences of an action is called making an

intervention query. Intervention queries are the domain of causal modeling, which

we do not explore in this book.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.6 The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decom-

posed into conditional distributions over only one variable:

P (x

(1)

, . . . , x

(n)

) = P (x

(1)

)Π

i=2

P (x

(i)

| x

(1)

, . . . , x

(i−1)

This observation is known as the chain rule or product rule of probability. It

follows immediately from the deﬁnition of conditional probability. For example,

applying the deﬁnition twice, we get

P (a, b, c) = P (a | b, c)P (b, c)

P (b, c) = P (b | c)P (c)

P (a, b, c) = P (a | b, c)P (b | c)P (c).

Note how every statement about probabilities remains true if we add conditions

(stuﬀ on the right-hand side of the vertical bar) consistently on all the “P”’s in

the statement. We can use this to derive the same thing diﬀerently:

P (a, b | c) = P (a | b, c)P (b | c)

P (a, b, c) = P (a, b | c)P (c) = P (a | b, c)P (b | c)P (c).

3.7 Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can

be expressed as a product of two factors, one involving only x and one involving

only y:

∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y).

Two random variables x and y are conditionally independent given a random

variable z if the conditional probability distribution over x and y factorizes in this

way for every value of z:

∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z).

We can denote independence and conditional independence with compact no-

tation: x⊥y means that x and y are independent, while x⊥y | z means that x and

y are conditionally independent given z.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.8 Expectation, Variance, and Covariance

The expectation or expected value of some function f(x) with respect to a prob-

ability distribution P (x) is the average or mean value that f takes on when x is

drawn from P . For discrete variables this can be computed with a summation:

x∼P

[f(x)] =

P (x)f(x),

while for continuous variables, it is computed with an integral:

x∼P

[f(x)] =

p(x)f(x)dx.

When the identity of the distribution is clear from the context, we may simply

write the name of the random variable that the expectation is over, e.g. E

[f(x)].

If it is clear which random variable the expectation is over, we may omit the

subscript entirely, e.g. E[f(x)]. By default, we can assume that E[·] averages over

the values of all the random variables inside the brackets. Likewise, when there

is no ambiguity, we may omit the square brackets.

Expectations are linear, for example, E[αf(x)+βg(x)] = αE[f (x)]+βE[g(x)],

when α and β are ﬁxed (not random and not depending on x).

The variance gives a measure of how much the diﬀerent values of a function

are spread apart:

Var(f(x)) = E

(f(x) −E[f(x)])

When the variance is low, the values of f(x) cluster near their expected value.

The square root of the variance is known as the standard deviation.

The covariance gives some sense of how much two values are linearly related

to each other, as well as the scale of these variables:

Cov(f(x), g(y)) = E [(f(x) −E [f(x)]) (g(y) − E [g(y)])] .

High absolute values of the covariance mean that the values change a lot and

are both far from their respective means at the same time. If the sign of the

covariance is positive, then the values tend to change in the same direction, while

if it is negative, they tend to change in opposite directions. Other measures such

as correlation normalize the contribution of each variable in order to measure only

how much the variables are related, rather than also being aﬀected by the scale

of the separate variables.

The notions of covariance and dependence are conceptually related, but are in

fact distinct concepts. Two random variables that have non-zero covariance are

dependent. However, they may have zero covariance without being independent.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

For example, suppose we ﬁrst generate x, then generate s ∈ {−1, 1} with each

state having probability 0.5, then generate y as s(x − E[x]). Clearly, x and y

are not independent, because y only has two possible values given x. However,

Cov(x, y) = 0.

The covariance matrix of a random vector x ∈ R

is an n × n matrix, such

that

Cov(x)

i,j

= Cov(x

, x

Note that the diagonal elements give Cov(x

, x

) = Var(x

3.9 Information Theory

Information theory is a branch of applied mathematics that revolves around quan-

tifying how much information is present in a signal. It was originally invented

to study sending messages from discrete alphabets over a noisy channel, such as

communication via radio transmission. In this context, information theory tells

how to design optimal codes and calculate the expected length of messages sam-

pled from speciﬁc probability distributions using various encoding schemes. In

the context of machine learning, we can also apply information theory to contin-

uous variables where some of these message length interpretations do not apply.

This ﬁeld is fundamental to many areas of electrical engineering and computer

science. In this textbook, we mostly use a few key ideas from information theory

to characterize probability distributions or quantify similarity between probabil-

ity distributions. For more detail on information theory, see (Cover and Thomas,

2006; MacKay, 2003).

The basic intuition behind information theory is that learning that an un-

likely event has occurred is more informative than learning that a likely event has

occurred. A message saying “the sun rose this morning” is so uninformative as

to be unnecessary to send, but a message saying “there was a solar eclipse this

morning” is very informative.

We would like to quantify information in a way that formalizes this intuition.

Speciﬁcally,

• Likely events should have low information content, and in the extreme case,

events that are guaranteed to happen should have no information content

whatsoever.

• Less likely events should have higher information content.

• Independent events should have additive information. For example, ﬁnding

out that a tossed coin has come up as heads twice should convey twice as

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

much information as ﬁnding out that a tossed coin has come up as heads

once.

In order to satisfy all three of these properties, we deﬁne the self-information

of an event x = x to be

I(x) = −log P (x). (3.1)

In this book, we always use log to mean the natural logarithm, with base e. Our

deﬁnition of I(x) is therefore written in units of nats. One nat is the amount of

information gained by observing an event of probability

. Other texts use base-2

logarithms and units called bits or shannons; information measured in bits is just

a rescaling of information measured in nats.

When x is continuous, we use the same deﬁnition of information by analogy,

but some of the properties from the discrete case are lost. For example, an event

with unit density still has zero information, despite not being an event that is

guaranteed to occur.

Self-information deals only with a single outcome. We can quantify the

amount of uncertainty in an entire probability distribution using the Shannon

entropy

H(x) = E

x∼P

[I(x)] = −E

x∼P

[log P(x)]. (3.2)

also denoted H(P ). In other words, the Shannon entropy of a distribution is

the expected amount of information in an event drawn from that distribution. It

actually gives a lower bound on the number of bits (if the logarithm is base 2,

otherwise the units are diﬀerent) needed in average to encode symbols drawn from

a distribution P. Distributions that are nearly deterministic (where the outcome

is nearly certain) have low entropy; distributions that are closer to uniform have

high entropy. See Fig. 3.1 for a demonstration. When x is continous, the Shannon

entropy is known as the diﬀerential entropy.

If we have two separate probability distributions P (x) and Q(x) over the same

random variable x, we can measure how diﬀerent these two distributions are using

the Kullback-Leibler (KL) divergence:

(QkP ) = E

x∼Q



log

Q(x)

P (x)



. (3.3)

In the case of discrete variables, it is the extra amount of information needed to

send a message containing symbols drawn with probability Q, when we incorrectly

believe that they were drawn with probability P , i.e., it measures how much it

hurts to use P as a model when Q is the ground truth. The KL divergence has

Shannon entropy is named for Claude Shannon, the father of information theory (Shannon,

1948, 1949). For an interesting biographical account of Shannon and some of his contemporaries,

see Fortune’s Formula by William Poundstone (Poundstone, 2005).

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.1: This plot shows how distributions that are closer to deterministic have low

Shannon entropy while distributions that are close to uniform have high Shannon entropy.

On the horizontal axis, we plot p, the probability of a binary random variable being equal

to 1. When p is near 0, the distribution is nearly deterministic, because the random

variable is nearly always 0. When p is near 1, the distribution is nearly deterministic,

because the random variable is nearly always 1. When p = 0.5, the entropy is maximal,

because the distribution is uniform over the two outcomes.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

many useful properties, most notably that it is non-negative. The KL divergence

is 0 if and only if P and Q are the same distribution in the case of discrete

variables, or equal “almost everywhere” in the case of continuous variables (see

section 3.13 for details). Because the KL divergence is non-negative and measures

the diﬀerence between two distributions, it is often conceptualized as measuring

some sort of distance between these distributions. However, it is not a true

distance measure because it is not symmetric, i.e. D

(P kQ) 6= D

(QkP ) for

some P and Q.

When computing many of these quantities, it is common to encounter expres-

sions of the form 0 log 0. By convention, in the context of information theory, we

treat these expressions as lim

x→0

x log x = 0.

3.10 Common Probability Distributions

Several simple probability distributions are useful in many contexts in machine

learning.

3.10.1 Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable.

It is controlled by a single parameter φ ∈ [0, 1], which gives the probability of the

random variable being equal to 1. It has the following properties:

P (x = 1) = φ

P (x = 0) = 1 − φ

P (x = x) = φ

(1 − φ)

1−x

[x] = φ

Var

(x) = φ(1 − φ)

H(x) = (φ −1) log(1 −φ) − φ log φ.

3.10.2 Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete

variable with k diﬀerent states, where k is ﬁnite

. The multinoulli distribution

“Multinoulli” is a recently coined term. The multinoulli distribution is a special case of

the multinomial distribution. A multinomial distribution is the distribution over vectors in

{0, . . . , k}

representing how many times each of the k categories is visited when n samples

are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to

multinoulli distributions without clarifying that they refer only to the n = 1 case.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

is parametrized by a vector p ∈ [0, 1]

k−1

, where p

gives the probability of the

i-th state. The ﬁnal, k-th state’s probability is given by 1 − 1

p. Note that

we must constrain 1

p ≤ 1. Multinoulli distributions are often used to refer to

distributions over categories of objects, so we do not usually assume that state 1

has numerical value 1, etc. For this reason, we do not usually need to compute

the expectation or variance of multinoulli-distributed random variables.

The Bernoulli and multinoulli distributions are suﬃcient to describe any distri-

bution over their domain. This is because they model discrete variables for which

it is feasible to simply enumerate all of the states. When dealing with continuous

variables, there are uncountably many states, so any distribution described by a

small number of parameters must impose strict limits on the distribution.

3.10.3 Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribu-

tion, also known as the Gaussian distribution:

N(x | µ, σ

) =

2πσ

exp



−

2σ

(x − µ)



See Fig. 3.2 for a schematic.

The two parameters µ ∈ R and σ ∈ R

control the normal distribution. µ

gives the coordinate of the central peak. This is also the mean of the distribution,

i.e. E[x] = µ. The standard deviation of the distribution is given by σ, i.e.

Var(x) = σ

Note that when we evaluate the PDF, we need to square and invert σ. When

we need to frequently evaluate the PDF with diﬀerent parameter values, a more

eﬃcient way of parametrizing the distribution is to use a parameter β ∈ R

control the precision or inverse variance of the distribution:

N(x | µ, β

−1

) =

2π

exp



−

β(x − µ)



Normal distributions are a sensible choice for many applications. In the ab-

sence of prior knowledge about what form a distribution over the real numbers

should take, the normal distribution is a good default choice for two major rea-

sons.

First, many distributions we wish to model are truly close to being normal

distributions. The central limit theorem shows that the sum of many indepen-

dent random variables is approximately normally distributed. This means that

in practice, many complicated systems can be modeled successfully as normally

distributed noise, even if the system can be decomposed into parts with more

structured behavior.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.2: The normal distribution: The normal distribution N(x | µ, σ

) exhibits a

classic “bell curve” shape, with the x coordinate of its central peak given by µ, and

the width of its peak controlled by σ. In this example, we depict the standard normal

distribution, with µ = 0 and σ = 1.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Second, the normal distribution in some sense makes the fewest assumptions

of any distribution over the reals, so choosing to use it inserts the least amount of

prior knowledge into a model. Out of all distributions with the same variance, the

normal distribution has the highest entropy. It is not possible to place a uniform

distribution on all of R. The closest we can come to doing so is to use a normal

distribution with high variance.

The normal distribution generalizes to R

, in which case it is known as the

multivariate normal distribution. It may be parametrized with a positive deﬁnite

symmetric matrix Σ:

N(x | µ, Σ) =

(2π)

det(Σ)

exp



−

(x − µ)

−1

(x − µ)



The parameter µ still gives the mean of the distribution, though now it is

vector-valued. The parameter Σ gives the covariance matrix of the distribution.

As in the univariate case, the covariance is not necessarily the most computation-

ally eﬃcient way to parametrize the distribution, since we need to invert Σ to

evaluate the PDF. We can instead use a precision matrix β:

N(x | µ, β

−1

) =

det(β)

(2π)

exp



−

(x − µ)

β(x −µ)



3.10.4 Dirac Distribution

In some cases, we wish to specify that all of the mass in a probability distribution

clusters around a single point. This can be accomplished by deﬁning a PDF using

the Dirac delta function, δ(x):

p(x) = δ(x −µ).

The Dirac delta function is deﬁned such that it is zero-valued everywhere but 0,

yet integrates to 1. By deﬁning p(x) to be δ shifted by −µ we obtain an inﬁnitely

narrow and inﬁnitely high peak of probability mass where x = µ.

A common use of the Dirac delta distribution is as a component of the so-called

empirical distribution,

ˆp(x) =

i=1

δ(x −x

) (3.4)

which puts probability mass

on each of the n points x

, . . . x

forming a given

data set or collection of samples. The Dirac delta distribution is only necessary to

deﬁne the empirical distribution over continuous variables. For discrete variables,

the situation is simpler: an empirical distribution can be conceptualized as a

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

multinoulli distribution, with a probability associated to each possible input value

that is simply equal to the empirical frequency of that value in the training set.

We can view the empirical distribution formed from a dataset of training ex-

amples as specifying the distribution that we sample from when we train a model

on this dataset. Another important perspective on the empirical distribution is

that it is the probability density that maximizes the likelihood of the training data

(see Section 5.8). Many machine learning algorithms can be conﬁgured to have

arbitrarily high capacity. If given enough capacity, these algorithms will simply

learn the empirical distribution. This is a bad outcome because the model does

not generalize at all and assigns inﬁnitesimal probability to any point in space

that did not occur in the training set. A central problem in machine learning is

studying how to limit the capacity of a model in a way that prevents it from sim-

ply learning the empirical distribution while also allowing it to learn complicated

functions.

The empirical distribution is a particular form of mixture, discussed next.

3.10.5 Mixtures of Distributions and Gaussian Mixture

It is also common to deﬁne probability distributions by composing other sim-

pler probability distributions. One common way of combining distributions is to

construct a mixture distribution. A mixture distribution is made up of several

component distributions. On each trial, the choice of which component distribu-

tion generates the sample is determined by sampling a component identity from

a multinoulli distribution:

P (x) =

P (c = i)P (x | c = i)

where P (c) is the multinoulli distribution over component identities. In chap-

ter 13, we explore the art of building complex probability distributions from sim-

ple ones in more detail. Note that we can think of the variable c as a non-observed

(or latent) random variable that is related to x through their joint distribution

P (x, c) = P (x | c)P (c). Latent variables are discussed further in Section 13.4.2.

A very powerful and common type of mixture model is the Gaussian mixture

model, in which the components P (x | c = i) are Gaussians, each with its mean µ

and covariance Σ

. Some mixtures can have more constraints, for example, the co-

variances could be shared across components, i.e., Σ

= Σ

= Σ, or the covariance

matrices could be constrained to be diagonal or simply equal to a scalar times the

identity. A Gaussian mixture model is a universal approximator of densities, in

the sense that any smooth density can be approximated to a particular precision

by a Gaussian mixture model with enough components. Gaussian mixture models

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

have been used in many settings, and are particularly well known for their use as

acoustic models in speech recognition (Bahl et al., 1987).

3.11 Useful Properties of Common Functions

Certain functions arise often while working with probability distributions, espe-

cially the probability distributions used in deep learning models.

One of these functions is the logistic sigmoid:

σ(x) =

1 + exp(−x)

The logistic sigmoid is commonly used to produce the φ parameter of a Bernoulli

distribution because its range is (0, 1), which lies within the valid range of values

for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function.

Figure 3.3: The logistic sigmoid function.

Another commonly encountered function is the softplus function (Dugas et al.,

2001):

ζ(x) = log (1 + exp(x)) .

The softplus function can be useful for producing the β or σ parameter of a normal

distribution because its range is R

. It also arises commonly when manipulating

expressions involving sigmoids, as it is the primitive of the sigmoid, i.e., the

integral from −∞ to x of the sigmoid. The name of the softplus function comes

from the fact that it is a smoothed or “softened” version of

= max(0, x).

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.4: The softplus function.

See Fig. 3.4 for a graph of the softplus function.

The following properties are all useful enough that you may wish to memorize

them:

σ(x) =

exp(x)

exp(x) + exp(0)

σ(x) = σ(x)(1 − σ(x))

1 − σ(x) = σ(−x)

log σ(x) = −ζ(−x)

ζ(x) = σ(x)

∀x ∈ (0, 1), σ

−1

(x) = log



1 − x



∀x > 0, ζ

−1

(x) = log (exp(x) − 1)

ζ(x) − ζ(−x) = x

The function σ

−1

(x) is called the logit in statistics, but this term is more rarely

used in machine learning. The ﬁnal property provides extra justiﬁcation for the

name “softplus”, since x

− x

−

= x.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.12 Bayes’ Rule

We often ﬁnd ourselves in a situation where we know P (y | x) and need to know

P (x | y). Fortunately, if we also know P (x), we can compute the desired quantity

using Bayes’ rule:

P (x | y) =

P (x)P (y | x)

P (y)

Note that while P (y) appears in the formula, it is usually feasible to compute

P (y) =

P (y | x)P (x), so we do not need to begin with knowledge of P (y).

Bayes’ rule is straightforward to derive from the deﬁnition of conditional prob-

ability, but it is useful to know the name of this formula since many texts refer to

it by name. It is named after the Reverend Thomas Bayes, who ﬁrst discovered a

special case of the formula. The general version presented here was independently

discovered by Pierre-Simon Laplace.

3.13 Technical Details of Continuous Variables

A proper formal understanding of continuous random variables and probability

density functions requires developing probability theory in terms of a branch of

mathematics known as measure theory. Measure theory is beyond the scope of

this textbook, but we can brieﬂy sketch some of the issues that measure theory

is employed to resolve.

In section 3.3.2, we saw that the probability of x lying in some set S is

given by the integral of p(x) over the set S. Some choices of set S can produce

paradoxes. For example, it is possible to construct two sets S

and S

such

that P(S

) + P(S

) > 1 but S

∩ S

= ∅. These sets are generally constructed

making very heavy use of the inﬁnite precision of real numbers, for example by

making fractal-shaped sets or sets that are deﬁned by transforming the set of

rational numbers

. One of the key contributions of measure theory is to provide a

characterization of the set of sets that we can compute the probability of without

encountering paradoxes. In this book, we only integrate over sets with relatively

simple descriptions, so this aspect of measure theory never becomes a relevant

concern.

For our purposes, measure theory is more useful for describing theorems that

apply to most points in R

but do not apply to some corner cases. Measure theory

provides a rigorous way of describing that a set of points is negligibly small. Such

a set is said to have “measure zero”. We do not formally deﬁne this concept in this

textbook. However, it is useful to understand the intuition that a set of measure

zero occupies no volume in the space we are measuring. For example, within R

The Banach-Tarski theorem provides a fun example of such sets.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

a line has measure zero, while a ﬁlled polygon has positive measure. Likewise, an

individual point has measure zero. Any union of countably many sets that each

have measure zero also has measure zero (so the set of all the rational numbers

has measure zero, for instance).

Another useful term from measure theory is “almost everywhere”. A property

that holds almost everywhere holds throughout all of space except for on a set

of measure zero. Because the exceptions occupy a negligible amount of space,

they can be safely ignored for many applications. Some important results in

probability theory hold for all discrete values but only hold “almost everywhere”

for continuous values.

One other detail we must be aware of relates to handling random variables

that are deterministic functions of one another. Suppose we have two random

variables, x and y, such that y = g(x). You might think that p

(y) = p

−1

(y)).

This is actually not the case.

Suppose y =

and x ∼ U(0, 1). If we use the rule p

(y) = p

(2y) then p

will be 0 everywhere except the interval [0,

], and it will be 1 on this interval.

This means

(y)dy =

which violates the deﬁnition of a probability distribution.

This common mistake is wrong because it fails to account for the distortion

of space introduced by the function g(x). Recall that the probability of x lying

in an inﬁnitesimally small region with volume δx is given by p(x)δx. Since g can

expand or contract space, the inﬁnitesimal volume surrounding x in x space may

have diﬀerent volume in y space. To correct the problem, we need to preserve

the property

(g(x))dy| = |p

(x)dx|.

Solving from this, we obtain

(y) = p

−1

(y))|

∂x

∂y

or equivalently

(x) = p

(g(x))|

∂g(x)

∂x

|. (3.5)

In higher dimensions, the absolute value of the derivative generalizes to the de-

terminant of the Jacobian matrix — the matrix with J

i,j

∂x

∂y

3.14 Structured Probabilistic Models

Machine learning algorithms often involve probability distributions over a very

large number of random variables. Often, these probability distributions involve

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

direct interactions between relatively few variables. Using a single function to

describe the entire joint probability distribution can be very ineﬃcient (both

computationally and statistically).

Instead of using a single function to represent a probability distribution, we

can split a probability distribution into many factors that we multiply together.

For example, suppose we have three random variables, a, b, and c. Suppose that

a inﬂuences the value of b and b inﬂuences the value of c, but that a and c are

independent given b. We can represent the probability distribution over all three

variables as a product of probability distributions over two variables:

p(a, b, c) = p(a)p(b | a)p(c | a).

These factorizations can greatly reduce the number of parameters needed to

describe the distribution. Each factor uses a number of parameters that is expo-

nential in the number of variables in the factor. This means that we can greatly

reduce the cost of representing a distribution if we are able to ﬁnd a factorization

into distributions over fewer variables.

We can describe these kinds of factorizations using graphs. Here we use the

word “graph” in the sense of graph theory, i.e. a set of vertices that may be

connected to each other with edges. When we represent the factorization of a

probability distribution with a graph, we call it a structured probabilistic model

or graphical model.

There are two main kinds of structured probabilistic models: directed and

undirected. Both kinds of graphical models use a graph in which each node in

the graph corresponds to a random variable, and an edge connecting two ran-

dom variables means that the probability distribution is able to represent direct

interactions between those two random variables.

Directed models use graphs with directed edges, and they represent factoriza-

tions into conditional probability distributions, as in the example above. Specif-

ically, a directed model contains one factor for every random variable x

in the

distribution, and that factor consists of the conditional distribution over x

given

the parents of x

p(x) =

p (x

| P a

)) .

See Fig. 3.5 for an example of a directed graph and the factorization of probability

distributions it represents.

Undirected models use graphs with undirected edges, and they represent fac-

torizations into a set of functions; unlike in the directed case, these functions are

usually not probability distributions of any kind. Any set of nodes that are all

connected to each other in G is called a clique. Each clique C

(i)

in an undirected

model is associated with a factor φ

(i)

). These factors are just functions, not

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.5: A directed graphical model over random variables a, b, c, d, and e. This

graph corresponds to probability distributions that can be factored as p(a, b, c, d, e) =

p(a)p(b)p(c | a, b)p(d | b)p(e | c). This graph allows us to quickly see some properties

of the distribution. For example, a and c interact directly, but a and e interact only

indirectly via c.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

probability distributions. The output of each factor must be non-negative, but

there is no constraint that the factor must sum or integrate to 1 like a probability

distribution.

The probability of a conﬁguration of random variables is proportional to the

product of all of these factors—assignments that result in larger factor values are

more likely. Of course, there is no guarantee that this product will sum to 1. We

therefore divide by a normalizing constant Z, deﬁned to be the sum or integral

over all states of the product of the φ functions, in order to obtain a normalized

probability distribution:

p(x) =

(i)



(i)



See Fig. 3.6 for an example of an undirected graph and the factorization of prob-

ability distributions it represents.

Keep in mind that these graphical representations of factorizations are a lan-

guage for describing probability distributions. They are not mutually exclusive

families of probability distributions. Being directed or undirected is not a prop-

erty of a probability distribution; it is a property of a particular description of

a probability distribution, but any probability distribution may be described in

both ways.

Throughout part I and part II of this book, we will use structured probabilistic

models merely as a language to describe which direct probabilistic relationships

diﬀerent machine learning algorithms choose to represent. No further understand-

ing of structured probabilistic models is needed until the discussion of research

topics, in part III, where we will explore structured probabilistic models in much

greater detail.

3.15 Example: Naive Bayes

We now know enough probability theory that we can perform some simple appli-

cations with a probabilistic model. In this example, we will show how to infer the

probability that a patient has the ﬂu using a simple probabilistic model. For now,

we will assume that we just know the correct model somehow. Later chapters will

cover the concepts needed to learn the model from data.

The Naive Bayes model is a simple probabilistic model that is often used to

recognize patterns. The model consists of one random variable c representing a

category, and a set of random variables F = {f

(1)

, . . . , f

(n)

} representing features

of objects in each category. In this example, we’ll use Naive Bayes to diagnose

patients as having the ﬂu or not. The random variable c can thus have two

values: c

representing the category of patients who do not have the ﬂu, and c

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

Figure 3.6: An undirected graphical model over random variables a, b, c, d, and e. This

graph corresponds to probability distributions that can be factored as p(a, b, c, d, e) =

(1)

(a, b, c)φ

(2)

(b, d)φ

(3)

(c, e). This graph allows us to quickly see some properties of

the distribution. For example, a and c interact directly, but a and e interact only indirectly

via c.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

representing the category of patients who do. Suppose f

(1)

is the random variable

representing whether the patient has a sore throat, with f

(1)

representing no sore

throat, and f

(1)

representing a sore throat. Suppose f

(2)

∈ R is the patient’s

temperature in degrees Celsius.

When using the Naive Bayes model, we assume that all of the features are

independent from each other given the category:

P (c, f

(1)

, . . . , f

(n)

) = P (c)

P (f

(i)

| c).

See Fig. 3.7 for a directed graphical model that expresses these conditional in-

dependence assumptions. These assumptions are very strong and unlikely to be

true in naturally occuring situations, hence the name “naive”. Surprisingly, Naive

Bayes often produces good predictions in practice (even though the assumptions

do not hold precisely), and is a good baseline model to start with when tackling

a new problem.

Beyond these conditional independence assumptions, the Naive Bayes frame-

work does not specify anything about the probability distribution. The speciﬁc

choice of distributions is left up to the designer. In our ﬂu example, let’s make

P (c) a Bernoulli distribution, with P(c = c

) = φ

(c)

. We can also make P (f

(1)

| c)

a Bernoulli distribution, with

P (f

(1)

= f

(1)

| c = c) = φ

In other words, the Bernoulli parameter changes depending on the value of c.

Finally, we need to choose the distribution over f

(2)

. Since f

(2)

is real-valued, a

normal distribution is a good choice. Because f

(2)

is a temperature, there are

hard limits to the values it can take on—it cannot go below 0K, for example.

Fortunately, these values are so far from the values measured in human patients

that we can safely ignore these hard limits. Values outside the hard limits will

receive extremely low probability under the normal distribution so long as the

mean and variance are set correctly. As with f

(1)

, we need to use diﬀerent param-

eters for diﬀerent values of c, to represent that patients with the ﬂu have diﬀerent

temperatures than patients without it:

(2)

∼ N(f

(2)

| µ

, σ

Now we are ready to determine how likely a patient is to have the ﬂu. To do

this, we want to compute P(c | F), but we know P (c) and P(F | c). This suggests

that we should use Bayes’ rule to determine the desired distribution. The word

“Bayes” in the name “Naive Bayes” comes from this frequent use of Bayes’ rule

in conjunction with the model. We begin by applying Bayes’ rule:

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

(1)

(2)

Figure 3.7: A directed graphical model depicting the conditional independence assump-

tions used by the Naive Bayes model.

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

P (c | F) =

P (c)P (F | c)

P (F)

. (3.6)

We do not know P (F). Fortunately, it is easy to compute:

P (F) =

c∈c

P (c = c, F) (by the sum rule)

c∈c

P (c = c)P (F | c = c) (by the chain rule).

Substituting this result back into equation 3.6, we obtain

P (c | F) =

P (c)P (F | c)

c∈c

P (c = c)P (F | c = c)

P (c)Π

P (f

(i)

| c)

c∈c

P (c = c)Π

P (f

(i)

| c = c)

by the Naive Bayes assumptions. This is as far as we can simplify the expression

for a general Naive Bayes model.

We can simplify the expression further by substituting in the deﬁnitions of the

particular probability distributions we have deﬁned for our ﬂu diagnosis example:

P (c = c | f

(1)

= f

, f

(2)

= f

) =

g(c)

∈c

g(c

)

where

g(c) = P (c = c)P (f

(1)

= f

(1)

| c = c)P (f

(2)

= f

(2)

| c = c).

Since c only has two possible values in our example, we can simplify this to:

P (c = 1 | f

(1)

= f

, f

(2)

= f

) =

g(1)

g(0) + g(1)

1 +

g(0)

g(1)

1 + exp (log g(0) − log g(1))

= σ (log g(1) −log g(0)) . (3.7)

To go further, let’s simplify log g(i):

log g(i) = log

(c)i

(1 − φ

(c)

)

1−i

(f)f

(1 − φ

(f)

)

1−f

2πσ

exp



−

2σ

− µ

)



CHAPTER 3. PROBABILITY AND INFORMATION THEORY

= i log φ

(c)

+(1−i) log



1 − φ

(c)



log φ

(f)

+(1−f

) log(1−φ

(f)

log

2πσ

−

2σ

− µ

)

Substituting this back into equation 3.7, we obtain

P (c = c | f

(1)

= f

, f

(2)

= f

) =



log φ

(c)

− log(1 −φ

(c)

) + f

log φ

(f)

+ (1 −f

) log(1 − φ

(f)

)

−f

log φ

(f)

+ (1 −f

) log(1 − φ

(f)

)

−

log 2πσ

−

2σ

− µ

)

2σ

− µ

)



From this formula, we can read oﬀ various intuitive properties of the Naive

Bayes classiﬁer’s behavior on this example problem, regarding the inference that

can be drawn from a trained model. The probability of the patient having the ﬂu

grows like a sigmoidal curve. We move farther to the left as f

, the patient’s tem-

perature, moves farther away from µ

, the average temperature of a ﬂu patient.