Chapter 3

Probability and Information

Theory

In this chapter, we describe probability theory. Probability theory is a mathematical

framework for representing uncertain statements. It provides a means of quantifying

uncertainty and axioms for deriving new uncertain statements. In artiﬁcial intelligence

applications, we use probability theory in two major ways. First, the laws of probabil-

ity tell us how AI systems should reason, so we design our algorithms to compute or

approximate various expressions derived using probability theory. Second, we can use

probability and statistics to theoretically analyze the behavior of proposed AI systems.

Probability theory is a fundamental tool of many disciplines of science and engi-

neering. We provide this chapter to ensure that readers whose background is primarily

in software engineering with limited exposure to probability theory can understand the

material in this book. If you are already familiar with probability theory, feel free to skip

this chapter. If you have absolutely no prior experience with probability, this chapter

should be suﬃcient to successfully carry out deep learning research projects, but we do

suggest that you consult an additional resource, such as (Jaynes, 2003).

3.1 Why probability?

Many branches of computer science deal mostly with entities that are entirely determin-

istic and certain. A programmer can usually safely assume that a CPU will execute each

machine instruction ﬂawlessly. Errors in hardware do occur, but are rare enough that

most software applications do not need to be designed to account for them. Given that

many computer scientists and software engineers work in a relatively clean and certain

environment, it can be surprising that machine learning makes heavy use of probability

theory.

This is because machine learning must always deal with uncertain quantities, and

sometimes may also need to deal with stochastic quantities. Uncertainty and stochas-

ticity can arise from many sources. Researchers have made compelling arguments for

quantifying uncertainty using probability since at least the 1980s. Many of the argu-

ments presented here are summarized from or inspired by (Pearl, 1988).

Nearly all activities require some ability to reason in the presence of uncertainty. In

fact, beyond mathematical statements that are true by deﬁnition, it is diﬃcult to think

of any proposition that is absolutely true or any event that is absolutely guaranteed to

occur.

One source of uncertainty is incomplete observability. When we cannot observe

something, we are uncertain about its true nature. In machine learning, it is often the

case that we can observe a large amount of data, but there is not a data instance for

every situation we care about. We are also generally not able to observe directly what

process generates the data. Since we are uncertain about what process generates the

data, we are also uncertain about what happens in the situations for which we have not

observed data points. Lack of observability can also give rise to apparent stochasticity.

Deterministic systems can appear stochastic when we cannot observe all of the variables

that drive the behavior of the system. For example, consider a game of Russian roulette.

The outcome is deterministic if you know which chamber of the revolver is loaded. If

you do not know this important information, then it is a game of chance. In many

cases, we are able to observe some quantity, but our measurement is itself uncertain.

For example, laser range ﬁnders may have several centimeters of random error.

Uncertainty can also arise from the simpliﬁcations we make in order to model real-

world processes. For example, if we discretize space, then we immediately become

uncertain about the precise position of objects: each object could be anywhere within

the discrete cell that we know it occupies.

Conceivably, the universe itself could have stochastic dynamics, but we make no

claim on this subject.

In many cases, it is more practical to use a simple but uncertain rule rather than a

complex but certain one, even if our modeling system has the ﬁdelity to accommodate

a complex rule. For example, the simple rule “Most birds ﬂy” is cheap to develop and

is broadly useful, while a rule of the form, “Birds ﬂy, except for very young birds that

have not yet learned to ﬂy, sick or injured birds that have lost the ability to ﬂy, ﬂightless

species of birds including the cassowary, ostrich, and kiwi. . . ” is expensive to develop,

maintain, and communicate, and after all of this eﬀort is still very brittle and prone to

failure.

Given that we need a means of representing and reasoning about uncertainty, it is

not immediately obvious that probability theory can provide all of the tools we want

for artiﬁcial intelligence applications. Probability theory was originally developed to

analyze the frequencies of events. It is easy to see how probability theory can be used

to study events like drawing a certain hand of cards in a game of poker. These kinds

of events are often repeatable, and when we say that an outcome has a probability p

of occurring, it means that if we repeated the experiment (e.g., draw a hand of cards)

inﬁnitely many times, then proportion p of the repetitions would result in that outcome.

This kind of reasoning does not seem immediately applicable to propositions that are

not repeatable. If a doctor analyzes a patient and says that the patient has a 40% chance

of having the ﬂu, this means something very diﬀerent—we can not make inﬁnitely many

replicas of the patient, nor is there any reason to believe that diﬀerent replicas of the

patient would present with the same symptoms yet have varying underlying conditions.

In the case of the doctor diagnosing the patient, we use probability to represent a degree

of belief, with 1 indicating absolute certainty, and 0 indicating absolute uncertainty. The

former kind of probability, related directly to the rates at which events occur, is known

as frequentist probability, while the latter, related to qualitative levels of certainty, is

known as Bayesian probability.

It turns out that if we list several properties that we expect common sense reasoning

about uncertainty to have, then the only way to satisfy those properties is to treat

Bayesian probabilities as behaving exactly the same as frequentist probabilities. For

example, if we want to compute the probability that a player will win a poker game

given that she has a certain set of cards, we use exactly the same formulas as when we

compute the probability that a patient has a disease given that she has certain symptoms.

For more details about why a small set of common sense assumptions implies that the

same axioms must control both kinds of probability, see (Ramsey, 1926).

Probability can be seen as the extension of logic to deal with uncertainty. Logic

provides a set of formal rules for determining what propositions are implied to be true or

false given the assumption that some other set of propositions is true or false. Probability

theory provides a set of formal rules for determining the likelihood of a proposition being

true given the likelihood of other propositions.

3.2 Random variables

A random variable is a variable that can take on diﬀerent values randomly. We typically

denote the random variable itself with an uppercase script letter, and the values it

can take on with lower case script letters. For example, x

and x

are both possible

values that the random variable x can take on. On its own, a random variable is just

a description of the states that are possible; it must be coupled with a probability

distribution that speciﬁes how likely each of these states are.

Random variables may be discrete or continuous. A discrete random variable is one

that has a ﬁnite or countably inﬁnite number of states. Note that these states are not

necessarily the integers; they can also just be named states that are not considered to

have any numerical value. A continuous random variable is associated with a real value.

3.3 Probability distributions

A probability distribution is a description of how likely a random variable or set of random

variables is to take on each of its possible states. The way we describe probability

distributions depends on whether the variables are discrete or continuous.

3.3.1 Discrete variables and probability mass functions

A probability distribution over discrete variables may be described using a probability

mass function (PMF). We typically denote probability mass functions with a capital P .

Often we associate each random variable with a diﬀerent probability mass function and

the reader must infer which probability mass function to use based on the identity of the

random variable, rather than the name of the function; P(X) is usually not the same

as P (Y ).

The probability mass function maps from a state of a random variable to the prob-

ability of that random variable taking on that state. P (x) denotes the probability that

X = x, with a probability of 1 indicating that X = x is certain and a probability of

0 indicating that X = x is impossible. Sometimes to disambiguate which PMF to use,

we write the name of the random variable explicitly: P (X = x). Sometimes we de-

ﬁne a variable ﬁrst, then use ∼ notation to specify which distribution it follows later:

x ∼ P(X).

Probability mass functions can act on many variables at the same time. Such a

probability distribution over many variables is known as a joint probability distribution.

P (X = x, Y = y) denotes the probability that X = x and Y = y simultaneously. We

may also write P(x, y) for brevity.

To be a probability mass function on a set of random variables X, a function f must

meet the following properties:

• The domain of f must be the set of all possible states of X.

• The range of f must be a subset of the real interval [0, 1]. (No state can be more

likely than a guaranteed event or less likely than an impossible event.)

•



x∈X

f(x) = 1. (f must guarantee that some state occurs.)

For example, consider a single discrete random variable X with k diﬀerent states.

We can place a uniform distribution on X — that is, make each of its states equally

likely, by setting its probability mass function to

P (X = x

) =

for all i. We can see that this ﬁts the requirements for a probability mass function. The

value

is positive because k is a positive integer. We also see that



P (X = x

) =



= 1, so the distribution is properly normalized.

3.3.2 Continuous variables and probability density functions

When working with continuous random variables, we describe probability distributions

using a probability density function (PDF) rather than a probability mass function. A

probability density function must satisfy the following properties:

• It must map from the domain of the random variable whose distribution it de-

scribes to the real numbers.

• ∀x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.

•



p(x)dx = 1.

A probability density function does not give the probability of a speciﬁc state directly,

instead the probability of landing inside an inﬁnitesimal region with volume δx is given

by p(x)δx.

We can integrate the density function to ﬁnd the actual probability mass of a set of

points. Speciﬁcally, the probability that x lies in some set S is given by the integral of

p(x) over that set. In the univariate example, the probability that x lies in the interval

[a, b] is given by



p(x)dx.

For an example of a probability density function corresponding to a speciﬁc proba-

bility density over a continuous random variable, consider a uniform distribution on an

interval of the real numbers. We can do this with a function u(x; a, b), where a and b are

the endpoints of the interval, with b > a. (The “;” notation means “parameterized by”;

we consider x to be the argument of the function, while a and b are parameters that

deﬁne the function) To ensure that there is no probability mass outside the interval, we

say u(x; a, b) = 0 for all x ∈ [a, b]. Within [a, b], u(x; a, b) =

b−a

. We can see that this is

nonnegative everywhere. Additionally, it integrates to 1. We often denote that x follows

the uniform distribution on [a, b] by writing x ∼ U(a, b).

3.4 Marginal probability

Sometimes we know the probability distribution over a set of variables, and we want to

know the probability distribution over just a subset of them. The probability distribution

over the subset is known as the marginal probability.

For example, suppose we have discrete random variables X and Y , and we know

P (X, Y ). We can ﬁnd P (X) with the sum rule:

∀x ∈ X, P (X = x) =



P (X = x, Y = y).

The name “marginal probability” comes from the process of computing marginal

probabilities on paper. When the values of P(X, Y ) are written in a grid with diﬀerent

values of x in rows and diﬀerent values of y in columns, it is natural to sum across a

row of the grid, then write P (x) in the margin of the paper just to the right of the row.

For continuous variables, we need to use integration instead of summation:

p(x) =



p(x, y)dy.

3.5 Conditional probability

In many cases, we are interested in the probability of some event, given that some other

event has happened. This is called a conditional probability. We denote the conditional

probability that Y = y given X = x as P (Y = y | X = x). This conditional probability

can be computed with the formula

P (Y = y | X = x) = P (Y = y, X = x)/P (X = x).

Note that this is only deﬁned when P(X = x) > 0. We cannot compute the conditional

probability conditioned on an event that never happens.

It is important not to confuse conditional probability with computing what would

happen if some action were undertaken. The conditional probability that a person is

from Germany given that they speak German is quite high, but if a randomly selected

person is taught to speak German, their country of origin does not change. Computing

the consequences of an action is called making an intervention query. Intervention

queries are the domain of causal modeling, which we do not explore in this book.

3.6 The chain rule

Any joint probability distribution over many random variables may be decomposed into

conditional distributions over only one variable:

P (X

(1)

, . . . , X

(n)

) = P (X

(1)

)Π

i=2

P (X

(i)

| X

(1)

, . . . , X

(i−1)

This observation is known as the chain rule or product rule of probability. It follows

immediately from the deﬁnition of conditional probability. For example, applying the

deﬁnition twice, we get

P (A, B, C) = P (A|B, C)P (B, C)

P (B, C) = P (B|C)P(C)

P (A, B, C) = P (A|B, C)P (B|C)P (C).

Note how every statement about probabilities remains true if we add conditions (stuﬀ

on the right-hand side of the vertical bar) consistently on all the “P”’s in the statement.

We can use this to derive the same thing diﬀerently:

P (A, B|C) = P (A|B, C)P (B|C)

P (A, B, C) = P (A, B|C)P (C) = P(A|B, C)P (B|C)P (C).

3.7 Independence and conditional independence

Two random variables X and Y are independent if their probability distribution can be

expressed as a product of two factors, one involving only X and one involving only Y :

∀x ∈ X, y ∈ Y, p(X = x, Y = y) = p(X = x)p(Y = y).

Two random variables X and Y are conditionally independent given a random vari-

able Z if the conditional probability distribution over X and Y factorizes in this way

for every value of Z:

∀x ∈ X, y ∈ Y, z ∈ Z, p(X = x, Y = y | Z = z) = p(X = x | Z = z)p(Y = y | Z = z).

3.8 Expectation, variance, and covariance

The expectation or expected value of some function f(x) with respect to a probability

distribution P (X) is the average or mean value that f takes on when x is drawn from

P . For discrete variables this can be computed with a summation:

x∼P

[f(x)] =



P (x)f(x),

while for continuous variables, it is computed with an integral:

x∼P

[f(x)] =



p(x)f(x)dx.

When the identity of the distribution is clear from the context, we may simply write the

name of the random variable that the expectation is over, e.g. E

[f(x)]. If it is clear

which random variable the expectation is over, we may omit the subscript entirely, e.g.

E[f(x)].

Expectations are linear, for example, E[αf(x) + βg(x)] = αE[f(x)] + βE[g(x)].

The variance gives a measure of how much the diﬀerent values of a function are

spread apart:

Var(f(x)) = E



(f(x) −E[f(x)])



When the variance is low, the values of f(x) cluster near their expected value. The

square root of the variance is known as the standard deviation.

The covariance gives some sense of how much two values are linearly related to each

other, as well as the scale of these variables:

Cov(f(x), g(y)) = E [(f(x) −E [f(x)]) (g(y) − E [g(y)])] .

High absolute values of the covariance mean that the values change a lot and are both

far from their respective means at the same time. If the sign of the covariance is

positive, then the values tend to change in the same direction, while if it is negative,

they tend to change in opposite directions. Other measures such as correlation normalize

the contribution of each variable in order to measure only how much the variables are

related, rather than also being aﬀected by the scale of the separate variables.

The concepts of covariance and dependence are conceptually related, but are in

fact distinct concepts. Two random variables have non-zero covariance only if they

are dependent. However, they may have zero covariance without being independent.

For example, suppose we ﬁrst generate x, then generate s ∈ {−1, 1} with each state

having probability 0.5, then generate y as E[x] + s(x − E[x]). Clearly, x and y are not

independent, because y only has two possible values given x. However, Cov(x, y) = 0.

The covariance matrix of a random vector x ∈ R

is an n ×n matrix, such that

Cov(x)

i,j

= Cov(x

, x

Note that the diagonal elements give Cov(x

, x

) = Var(x

3.9 Information theory

Information theory is a branch of applied mathematics that revolves around quantifying

how much information is present in a signal. This ﬁeld is fundamental to many areas

of electrical engineering and computer science. In this textbook, we mostly use a few

key ideas from information theory to characterize probability distributions or quantify

similarity between probability distributions. For more detail on information theory,

see (Cover, 2006).

The basic intuition behind information theory is that learning that an unlikely event

has occurred is more informative than learning that a likely event has occurred. A

message saying “the sun rose this morning” is so uninformative as to be unnecessary to

send, but a message saying “there was a solar eclipse this morning” is very informative.

We would like to quantify information in a way that formalizes this intuition. Specif-

ically,

• Likely events should have low information content.

• Unlikely events should have high information content

• If one event is half as likely as another, then learning about the former event should

convey twice as much information as the latter.

In order to satisfy all three of these properties, we deﬁne the self-information of an

event X = x to be I(x) = −log P (x). In this book, we always use log to mean the

natural logarithm, with base e. Our deﬁnition of I(x) is therefore written in units of

nats. One nat is the amount of information gained by observing an event of probability

. Other texts use base-2 logarithms and units called bits or shannons; information

measured in bits is just a rescaling of information measured in nats.

Self-information deals only with a single outcome. We can quantify the amount of

uncertainty in an entire probability distribution using the Shannon entropy

H(X) = E

x∼P

[I(x)].

In other words, the Shannon entropy of a distribution is the expected amount of infor-

mation in an event drawn from that distribution. Distributions that are nearly deter-

ministic have low entropy; distributions that are closer to uniform have high entropy.

See Fig. 3.1 for a demonstration.

If we have two separate probability distributions P (X) and Q(X) over the same

random variable X, we can measure how diﬀerent these two distributions are using the

Kullback-Leibler (KL) divergence:

(P Q) = E

x∼P

log

P (x)

Q(x)

Shannon entropy is named for Claude Shannon, the father of information theory. For an interesting

biographical account of Shannon and some of his contemporaries, see Fortune’s Formula by William

Poundstone.

Figure 3.1: This plot shows how distributions that are closer to deterministic have

low Shannon entropy while distributions that are close to uniform have high Shannon

entropy. On the horizontal axis, we plot p, the probability of a binary random variable

being equal to 1. When p is near 0, the distribution is nearly deterministic, because

the random variable is nearly always 0. When p is near 1, the distribution is nearly

deterministic, because the random variable is nearly always 1. When p = 0.5, the

entropy is maximal, because the distribution is uniform over the two outcomes.

The KL divergence has many useful properties, most notably that it is non-negative.

The KL divergence is 0 if and only if P and Q are the same distribution in the case

of discrete variables, or equal “almost everywhere” in the case of continuous variables

(see section 3.13 for details). Because the KL divergence is non-negative and measures

the diﬀerence between two distributions, it is often conceptualized as measuring some

sort of distance between these distributions. However, it is not a true distance measure

because it is not symmetric, i.e. D

(P Q) = D

(QP ) for some P and Q.

When computing many of these quantities, it is common to encounter expressions

of the form 0 log 0. By convention, in the context of information theory, we treat these

expressions as lim

x→0

x log x = 0.

Many information theoretic concepts can be interpreted in terms of the number of

bits required to send certain messages. While these interpretations are valuable in other

ﬁelds, they have not usually been of much practical importance in the context of deep

learning.

3.10 Common probability distributions

Several simple probability distributions are useful in many contexts in machine learning.

3.10.1 Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable. It is

controlled by a single parameter φ ∈ [0, 1], which gives the probability of the random

variable being equal to 1. It has the following properties:

P (X = 1) = φ

P (X = 0) = 1 −φ

P (X = x) = φ

(1 −φ)

1−x

[X] = φ

Var

(X) = φ(1 − φ)

H(X) = (φ − 1) log(1 − φ) − φ log φ.

3.10.2 Multinoulli Distribution

The Multinoulli distribution is a distribution over a single discrete variable with k diﬀer-

ent states, where k is ﬁnite

. The Multinoulli distribution is parameterized by a vector

p ∈ [0, 1]

k−1

, where p

gives the probability of the i-th state. The ﬁnal, k-th state’s

“Multinoulli” is a recently coined term. The Multinoulli distribution is a special case of the

multinomial distribution. A multinomial distribution is the distribution over vectors in {0, . . . , k}

representing how many times each of the k categories is visited when n samples are drawn from a

Multinoulli distribution. Many texts use the term “multinomial” to refer to Multinoulli distributions

without clarifying that they refer only to the n = 1 case.

probability is given by 1 − 1



p. Note that we must constrain 1



p ≤ 1. Multinoulli

distributions are often used to refer to distributions over categories of objects, so we do

not usually assume that state 1 has numerical value 1, etc. For this reason, we do not

usually need to compute the expectation or variance of Multinoulli-distributed random

variables.

The Bernoulli and Multinoulli distributions are suﬃcient to describe any distribution

over their domain. This is because they model discrete variables for which it is feasible

to simply enumerate all of the states. When dealing with continuous variables, there are

uncountably many states, so any distribution described by a small number of parameters

must impose strict limits on the distribution.

3.10.3 Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribution, also

known as the Gaussian distribution:

N(x | µ, σ

) =



2πσ

exp



−

2σ

(x −µ)



See Fig. 3.2 for a schematic.

The two parameters µ ∈ R and σ ∈ R

control the normal distribution. µ gives the

coordinate of the central peak. This is also the mean of the distribution, i.e. E[X] = µ.

The standard deviation of the distribution is given by σ, i.e. Var(X) = σ

Note that when we evaluate the PDF, we need to square and invert σ. When we

need to frequently evaluate the PDF with diﬀerent parameter values, a more eﬃcient

way of parameterizing the distribution is to use a parameter β ∈ R

to control the

precision or inverse variance of the distribution:

N(x | µ, β

−1

) =



2π

exp



−

β(x − µ)



Normal distributions are a sensible choice for many applications. In the absence of

prior knowledge about what form a distribution over the real numbers should take, the

normal distribution is a good default choice for two major reasons.

First, many distributions we wish to model are truly close to being normal distri-

butions. The central limit theorem shows that the sum of many independent random

variables is approximately normally distributed. This means that in practice, many

complicated systems can be modeled successfully as normally distributed noise, even if

the system can be decomposed into parts with more structured behavior.

Second, the normal distribution in some sense makes the fewest assumptions of

any distribution over the reals, so choosing to use it inserts the least amount of prior

knowledge into a model. Out of all distributions with the same variance, the normal

distribution has the highest entropy. It is not possible to place a uniform distribution

on all of R. The closest we can come to doing so is to use a normal distribution with

high variance.

Figure 3.2: The normal distribution: The normal distribution N(x | µ, σ

) exhibits a

classic “bell curve” shape, with the x coordinate of its central peak given by µ, and

the width of its peak controlled by σ. In this example, we depict the standard normal

distribution, with µ = 0 and σ = 1.

The normal distribution generalizes to R

, in which case it is known as the multi-

variate normal distribution. It may be parameterized with a positive deﬁnite symmetric

matrix Σ:

N(x | µ, Σ) =



(2π)

det(Σ)

exp



−

(x −µ)



−1

(x −µ)



The parameter µ still gives the mean of the distribution, though now it is vector-

valued. The parameter Σ gives the covariance matrix of the distribution. As in the

univariate case, the covariance is not necessarily the most computationally eﬃcient way

to parameterize the distribution, since we need to invert Σ to evaluate the PDF. We

can instead use a precision matrix β:

N(x | µ, β

−1

) =



det(β)

(2π)

exp



−

(x −µ)



β(x −µ)



3.10.4 Dirac Distribution

In some cases, we wish to specify that all of the mass in a probability distribution

clusters around a single point. This can be accomplished by deﬁning a PDF using the

Dirac delta function, δ(x):

p(x) = δ(x −µ).

The Dirac delta function is deﬁned such that it is zero-valued everywhere but 0, yet

integrates to 1. By deﬁning p(x) to be δ shifted by −µ we obtain an inﬁnitely narrow

and inﬁnitely high peak of probability mass where x = µ.

A common use of the Dirac delta distribution is as a component of the so-called

empirical distribution,

ˆp(x) =



i=1

δ(x −x

)

which puts probability mass

on each of the n points x

, . . . x

forming a given data

set or collection of samples. The Dirac delta distribution is only necessary to deﬁne the

empirical distribution over continuous variables. For discrete variables, the situation is

simpler: an empirical distribution can be conceptualized as a Multinoulli distribution,

with a probability associated to each possible input value that is simply equal to the

empirical frequency of that value in the training set.

We can view the empirical distribution formed from a dataset of training examples as

specifying the distribution that we sample from when we train a model on this dataset.

Another important perspective on the empirical distribution is that it is the probability

density that maximizes the likelihood of of the training data (see Section 5.6). Many

machine learning algorithms can be conﬁgured to have arbitrarily high capacity. If given

enough capacity, these algorithms will simply learn the empirical distribution. This is

a bad outcome because the model does not generalize at all and assigns inﬁnitesimal

probability to any point in space that did not occur in the training set. A central

problem in machine learning is studying how to limit the capacity of a model in a way

that prevents it from simply learning the empirical distribution while also allowing it to

learn complicated functions.

The empirical distribution is a particular form of mixture, discussed next.

3.10.5 Mixtures of Distributions and Gaussian Mixture

It is also common to deﬁne probability distributions by composing other simpler prob-

ability distributions. One common way of combining distributions is to construct a

mixture distribution. A mixture distribution is made up of several component distribu-

tions. On each trial, the choice of which component distribution generates the sample

is determined by sampling a component identity from a Multinoulli distribution:

P (X) =



P (C = i)P (X | C = i)

where P (C) is the Multinoulli distribution over component identities. In chapter 9, we

explore the art of building complex probability distributions from simple ones in more

detail. Note that we can think of the variable C as a non-observed (or latent) random

variable that is related to X through their joint distribution P (X, C) = P (X|C)P (C).

Latent variables are discussed further in Section 9.4.2.

A very powerful and common type of mixture model is the Gaussian mixture model,

in which the components P (X | C = i) are Gaussians, each with its mean µ

and

covariance Σ

. Some mixtures can have more constraints, for example, the covariances

could be shared across components, i.e., Σ

= Σ

= Σ, or the covariance matrices could

be constrained to be diagonal or simply equal to a scalar times the identity. A Gaussian

mixture model is a universal approximator of densities, in the sense that any smooth

density can be approximated to a particular precision by a Gaussian mixture model with

enough components. Gaussian mixture models have been used in many settings, and

are particularly well known for their use as acoustic models in speech recognition (Bahl

et al., 1987).

3.11 Useful properties of common functions

Certain functions arise often while working with probability distributions, especially the

probability distributions used in deep learning models.

One of these functions is the logistic sigmoid:

σ(x) =

1 + exp(−x)

The logistic sigmoid is commonly used to produce the φ parameter of a Bernoulli dis-

tribution because its range is (0, 1), which lies within the valid range of values for the φ

parameter. See Fig. 3.3 for a graph of the sigmoid function.

Another commonly encountered function is the softplus function:

ζ(x) = log (1 + exp(x)) .

Figure 3.3: The logistic sigmoid function.

Figure 3.4: The softplus function.

The softplus function can be useful for producing the β or σ parameter of a normal

distribution because its range is R

. It also arises commonly when manipulating ex-

pressions involving sigmoids. The name of the softplus function comes from the fact

that it is a smoothed or “softened” version of

= max(0, x).

See Fig. 3.4 for a graph of the softplus function.

The following properties are all useful to have memorized:

σ(x) =

exp(x)

exp(x) + exp(0)

σ(x) = σ(x)(1 −σ(x))

1 −σ(x) = σ(−x)

log σ(x) = −ζ(−x)

ζ(x) = σ(x)

ζ(x) −ζ(−x) = x

∀x ∈ (0, 1), σ

−1

(x) = log



1 −x



∀x > 0, ζ

−1

(x) = log (exp(x) −1)

The ﬁnal property provides extra justiﬁcation for the name “softplus”, since x

−

= x.

3.12 Bayes’ rule

We often ﬁnd ourselves in a situation where we know P (Y | X) and need to know

P (X | Y ). Fortunately, if we also know P (X), we can compute the desired quantity

using Bayes’ rule:

P (X | Y ) =

P (X)P (Y | X)

P (Y )

Note that while P(Y ) appears in the formula, it is usually feasible to compute P (Y ) =



P (Y |x)P (x), so we do not need to begin with knowledge of P(Y ).

Bayes’ rule is straightforward to derive from the deﬁnition of conditional probability,

but it is useful to know the name of this formula since many texts refer to it by name.

It is named after the Reverend Thomas Bayes, who ﬁrst discovered a special case of the

formula. The general version presented here was independently discovered by Pierre-

Simon Laplace.

3.13 Technical details of continuous variables

A proper formal understanding of continuous random variables and probability density

functions requires developing probability theory in terms of a branch of mathematics

known as measure theory. Measure theory is beyond the scope of this textbook, but we

can brieﬂy sketch some of the issues that measure theory is employed to resolve.

In section 3.3.2, we saw that the probability of x lying in some set S is given by

the integral of p(x) over the set S. Some choices of set S can produce paradoxes. For

example, it is possible to construct two sets S

and S

such that P(S

) + P (S

) > 1 but

∩S

= ∅. These sets are generally constructed making very heavy use of the inﬁnite

precision of real numbers, for example by making fractal-shaped sets or sets that are

deﬁned by transforming the set of rational numbers

. One of the key contributions of

measure theory is to provide a characterization of the set of sets that we can compute

the probability of without encountering paradoxes. In this book, we only integrate over

sets with relatively simple descriptions, so this aspect of measure theory never becomes

a relevant concern.

For our purposes, measure theory is more useful for describing theorems that apply

to most points in R

but do not apply to some corner cases. Measure theory provides a

The Banach-Tarski theorem provides a fun example of such sets.

rigorous way of describing that a set of points is negligibly small. Such a set is is said to

have “measure zero”. We do not formally deﬁne this concept in this textbook. However,

it is useful to understand the intuition that a set of measure zero occupies no volume

in the space we are measuring. For example, within R

, a line has measure zero, while

a ﬁlled polygon has positive measure. Likewise, an individual point has measure zero.

Any union of countably many sets that each have measure zero also has measure zero

(so the set of all the rational numbers has measure zero, for instance).

Another useful term from measure theory is “almost everywhere”. A property that

holds almost everywhere holds throughout all of space except for on a set of measure

zero. Because the exceptions occupy a negligible amount of space, they can be safely

ignored for many applications. Some important results in probability theory hold for all

discrete values but only hold “almost everywhere” for continuous values.

One other detail we must be aware of relates to handling random variables that are

deterministic functions of one another. Suppose we have two random variables, x and

y, such that y = g(x). You might think that p

(y) = p

−1

(y)). This is actually not

the case.

Suppose y =

and x ∼ U (0, 1). If we use the rule p

(y) = p

(2y) then p

will be

0 everywhere except the interval [0,

], and it will be 1 on this interval. This means



∞

−∞

(y)dy =

which violates the deﬁnition of a probability distribution.

This common mistake is wrong because it fails to account for the distortion of space

introduced by the function g(x). Recall that the probability of x lying in an inﬁnitesi-

mally small region with volume δx is given by p(x)δx. Since g can expand or contract

space, the inﬁnitesimal volume surrounding x in x space may have diﬀerent volume in

y space. To correct the problem, we need to preserve the property

(g(x))dy| = |p

(x)dx|.

Solving from this, we obtain

(y) = p

−1

(y))|

∂x

∂y

or equivalently

(x) = p

(g(x))|

∂

(x)

∂x

|. (3.1)

In higher dimensions, the absolute value of the derivative generalizes to the determinant

of the Jacobian matrix — the matrix with J

i,j

∂x

∂y

3.14 Example: Naive Bayes

We now know enough probability theory that we can derive some simple machine learn-

ing tools.

The Naive Bayes model is a simple probabilistic model that is often used to recognize

patterns. The model consists of one random variable C representing a category, and a

set of random variables F = {F

(1)

, . . . , F

(n)

} representing features of objects in each

category. In this example, we’ll use Naive Bayes to diagnose patients as having the ﬂu

or not. C can thus have two values: c

representing the category of patients who do

not have the ﬂu, and c

representing the category of patients who do. Suppose F

(1)

is the random variable representing whether the patient has a sore throat, with f

(1)

representing no sore throat, and f

(1)

representing a sore throat. Suppose F

(2)

∈ R is

the patient’s temperature in degrees Celsius.

When using the Naive Bayes model, we assume that all of the features are indepen-

dent from each other given the category:

P (C, F

(1)

, . . . , F

(n)

) = P (C)Π

P (F

(i)

| C).

These assumptions are very strong and unlikely to be true in practice, hence the name

“naive”. Surprisingly, Naive Bayes often produces good predictions in practice, and is

a good baseline model to start with when tackling a new problem.

Beyond these conditional independence assumptions, the Naive Bayes framework

does not specify anything about the probability distribution. The speciﬁc choice of

distributions is left up to the designer. In our ﬂu example, let’s make P (C) a Bernoulli

distribution, with P (C = c

) = φ

(C)

. We can also make P (F

(1)

| C) a Bernoulli

distribution, with

P (F

(1)

= f

(1)

| C = c) = φ

In other words, the Bernoulli parameter changes depending on the value of C. Finally, we

need to choose the distribution over F

(2)

. Since F

(2)

is real-valued, a normal distribution

is a good choice. Because F

(2)

is a temperature, there are hard limits to the values it

can take on—it cannot go below 0K, for example. Fortunately, these values are so far

from the values measured in human patients that we can safely ignore these hard limits.

Values outside the hard limits will receive extremely low probability under the normal

distribution so long as the mean and variance are set correctly. As with F

(1)

, we need

to use diﬀerent parameters for diﬀerent values of C, to represent that patients with the

ﬂu have diﬀerent temperatures than patients without it:

(2)

∼ N(F

(2)

| µ

, σ

Now we are ready to determine how likely a patient is to have the ﬂu. To do this,

we want to compute P(C | F ), but we know P (C) and P (F | C). This suggests that

we should use Bayes’ rule to determine the desired distribution. The word “Bayes” in

the name “Naive Bayes” comes from this frequent use of Bayes’ rule in conjunction with

the model. We begin by applying Bayes’ rule:

P (C | F ) =

P (C)P (F | C)

P (F )

. (3.2)

We do not know P (F ). Fortunately, it is easy to compute:

P (F ) =



c∈C

P (C = c, F ) (by the sum rule)



c∈C

P (C = c)P (F | C = c) (by the chain rule).

Substituting this result back into equation 3.2, we obtain

P (C | F ) =

P (C)P (F | C)



c∈C

P (C = c)P (F | C = c)

P (C)Π

P (F

(i)

| C)



c∈C

P (C = c)Π

P (F

(i)

| C = c)

by the Naive Bayes assumptions. This is as far as we can simplify the expression for a

general Naive Bayes model.

We can simplify the expression further by substituting in the deﬁnitions of the par-

ticular probability distributions we have deﬁned for our ﬂu diagnosis example:

P (C = c | F

(1)

= f

, F

(2)

= f

) =

g(c)





∈C

g(c



)

where

g(c) = P (C = c)P (F

(1)

= f

(1)

| C = c)p(F

(2)

= f

(2)

| C = c).

Since C only has two possible values in our example, we can simplify this to:

P (C = c | F

(1)

= f

, F

(2)

= f

) =

g(1)

g(0) + g(1)

1 +

g(0)

g(1)

1 + exp (log g(0) − log g(1))

= σ (log g(1) − log g(0)) . (3.3)

To go further, let’s simplify log g(i):

log g(i) = log



(C)i

(1 −φ

(C)

)

1−i

(F )f

(1 −φ

(F )

)

1−f



2πσ

exp



−

2σ

− µ

)





= i log φ

(C)

+(1−i) log



1 −φ

(C)



log φ

(F )

+(1−f

) log(1−φ

(F )

)−

log

2πσ

−

2σ

− µ

)

Substituting this back into equation 3.3, we obtain

P (C = c | F

(1)

= f

, F

(2)

= f

) =



log φ

(C)

− log(1 −φ

(C)

) + f

log φ

(F )

+ (1 −f

) log(1 − φ

(F )

)

−f

log φ

(F )

+ (1 −f

) log(1 − φ

(F )

)

−

log 2πσ

−

2σ

− µ

)

2σ

− µ

)



From this formula, we can read oﬀ various intuitive properties of the Naive Bayes

classiﬁer’s behavior on this example problem, regarding the inference that can be drawn

from a trained model. The probability of the patient having the ﬂu grows like a sigmoidal

curve. We move farther to the left as f

, the patient’s temperature, moves farther away

from µ

, the average temperature of a ﬂu patient.