Chapter 3
Probability and Information
Theory
In this chapter, we describe probability theory. Probability theory is a mathe-
matical framework for representing uncertain statements. It provides a means
of quantifying uncertainty and axioms for deriving new uncertain statements. In
artificial intelligence applications, we use probability theory in two major ways.
First, the laws of probability tell us how AI systems should reason, so we de-
sign our algorithms to compute or approximate various expressions derived using
probability theory. Second, we can use probability and statistics to theoretically
analyze the behavior of proposed AI systems.
Probability theory is a fundamental tool of many disciplines of science and
engineering. We provide this chapter to ensure that readers whose background is
primarily in software engineering with limited exposure to probability theory can
understand the material in this book. If you are already familiar with probability
theory, feel free to skip this chapter. If you have absolutely no prior experi-
ence with probability, this chapter should be sufficient to successfully carry out
deep learning research projects, but we do suggest that you consult an additional
resource, such as (Jaynes, 2003).
3.1 Why Probability?
Many branches of computer science deal mostly with entities that are entirely
deterministic and certain. A programmer can usually safely assume that a CPU
will execute each machine instruction flawlessly. Errors in hardware do occur, but
are rare enough that most software applications do not need to be designed to
account for them. Given that many computer scientists and software engineers
work in a relatively clean and certain environment, it can be surprising that
46
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
machine learning makes heavy use of probability theory.
This is because machine learning must always deal with uncertain quantities,
and sometimes may also need to deal with stochastic quantities. Uncertainty
and stochasticity can arise from many sources. Researchers have made com-
pelling arguments for quantifying uncertainty using probability since at least the
1980s. Many of the arguments presented here are summarized from or inspired
by (Pearl, 1988). Much earlier work in probability and engineering introduced
and developed the underlying fundamental notions, such as the notion of ex-
changeability (de Finetti, 1937), Cox’s theorem as the foundations of Bayesian
inference (Cox, 1946), and the theory of stochastic processes (Doob, 1953).
Nearly all activities require some ability to reason in the presence of uncer-
tainty. In fact, beyond mathematical statements that are true by definition, it is
difficult to think of any proposition that is absolutely true or any event that is
absolutely guaranteed to occur.
One source of uncertainty is incomplete observability. When we cannot ob-
serve something, we are uncertain about its true nature. In machine learning, it
is often the case that we can observe a large amount of data, but there is not a
data instance for every situation we care about. We are also generally not able to
observe directly what process generates the data. Since we are uncertain about
what process generates the data, we are also uncertain about what happens in
the situations for which we have not observed data points. Lack of observability
can also give rise to apparent stochasticity. Deterministic systems can appear
stochastic when we cannot observe all of the variables that drive the behavior of
the system. For example, consider a game of Russian roulette. The outcome is
deterministic if you know which chamber of the revolver is loaded. If you do not
know this important information, then it is a game of chance. In many cases, we
are able to observe some quantity, but our measurement is itself uncertain. For
example, laser range finders may have several centimeters of random error.
Uncertainty can also arise from the simplifications we make in order to model
real-world processes. For example, if we discretize space, then we immediately
become uncertain about the precise position of objects: each object could be
anywhere within the discrete cell that we know it occupies.
Conceivably, the universe itself could have stochastic dynamics, but we make
no claim on this subject.
In many cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, even if our modeling system has the fidelity to
accommodate a complex rule. For example, the simple rule “Most birds fly” is
cheap to develop and is broadly useful, while a rule of the form, “Birds fly, except
for very young birds that have not yet learned to fly, sick or injured birds that have
lost the ability to y, flightless species of birds including the cassowary, ostrich,
47
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
and kiwi. . . is expensive to develop, maintain, and communicate, and after all
of this effort is still very brittle and prone to failure.
Given that we need a means of representing and reasoning about uncertainty,
it is not immediately obvious that probability theory can provide all of the tools
we want for artificial intelligence applications. Probability theory was originally
developed to analyze the frequencies of events. It is easy to see how probability
theory can be used to study events like drawing a certain hand of cards in a
game of poker. These kinds of events are often repeatable, and when we say that
an outcome has a probability p of occurring, it means that if we repeated the
experiment (e.g., draw a hand of cards) infinitely many times, then proportion
p of the repetitions would result in that outcome. This kind of reasoning does
not seem immediately applicable to propositions that are not repeatable. If a
doctor analyzes a patient and says that the patient has a 40% chance of having
the flu, this means something very different—we can not make infinitely many
replicas of the patient, nor is there any reason to believe that different replicas of
the patient would present with the same symptoms yet have varying underlying
conditions. In the case of the doctor diagnosing the patient, we use probability to
represent a degree of belief, with 1 indicating absolute certainty, and 0 indicating
absolute uncertainty. The former kind of probability, related directly to the rates
at which events occur, is known as frequentist probability, while the latter, related
to qualitative levels of certainty, is known as Bayesian probability.
It turns out that if we list several properties that we expect common sense
reasoning about uncertainty to have, then the only way to satisfy those properties
is to treat Bayesian probabilities as behaving exactly the same as frequentist
probabilities. For example, if we want to compute the probability that a player
will win a poker game given that she has a certain set of cards, we use exactly the
same formulas as when we compute the probability that a patient has a disease
given that she has certain symptoms. For more details about why a small set of
common sense assumptions implies that the same axioms must control both kinds
of probability, see (Ramsey, 1926).
Probability can be seen as the extension of logic to deal with uncertainty.
Logic provides a set of formal rules for determining what propositions are implied
to be true or false given the assumption that some other set of propositions is
true or false. Probability theory provides a set of formal rules for determining the
likelihood of a proposition being true given the likelihood of other propositions.
3.2 Random Variables
A random variable is a variable that can take on different values randomly. We
typically denote the random variable itself with a lower case letter in plain type-
48
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
face, and the values it can take on with lower case script letters. For example,
x
1
and x
2
are both possible values that the random variable x can take on. For
vector-valued variables, we would write the random variable as x and one of its
values as x. On its own, a random variable is just a description of the states that
are possible; it must be coupled with a probability distribution that specifies how
likely each of these states are.
Random variables may be discrete or continuous. A discrete random variable
is one that has a finite or countably infinite number of states. Note that these
states are not necessarily the integers; they can also just be named states that
are not considered to have any numerical value. A continuous random variable is
associated with a real value.
3.3 Probability Distributions
A probability distribution is a description of how likely a random variable or set
of random variables is to take on each of its possible states. The way we de-
scribe probability distributions depends on whether the variables are discrete or
continuous.
3.3.1 Discrete Variables and Probability Mass Functions
A probability distribution over discrete variables may be described using a proba-
bility mass function (PMF). We typically denote probability mass functions with
a capital P. Often we associate each random variable with a different probabil-
ity mass function and the reader must infer which probability mass function to
use based on the identity of the random variable, rather than the name of the
function; P (x) is usually not the same as P (y).
The probability mass function maps from a state of a random variable to
the probability of that random variable taking on that state. P (x) denotes the
probability that x = x, with a probability of 1 indicating that x = x is certain and
a probability of 0 indicating that x = x is impossible. Sometimes to disambiguate
which PMF to use, we write the name of the random variable explicitly: P (x =
x). Sometimes we define a variable first, then use notation to specify which
distribution it follows later: x P (x).
Probability mass functions can act on many variables at the same time. Such
a probability distribution over many variables is known as a joint probability
distribution. P (x = x, y = y) denotes the probability that x = x and y = y
simultaneously. We may also write P (x, y) for brevity.
To be a probability mass function on a set of random variables x, a function
f must meet the following properties:
49
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The domain of f must be the set of all possible states of x.
The range of f must be a subset of the real interval [0, 1]. (No state can be
more likely than a guaranteed event or less likely than an impossible event.)
P
xx
f(x) = 1. (f must guarantee that some state occurs.)
For example, consider a single discrete random variable x with k different
states. We can place a uniform distribution on x that is, make each of its
states equally likely, by setting its probability mass function to
P (x = x
i
) =
1
k
for all i. We can see that this fits the requirements for a probability mass function.
The value
1
k
is positive because k is a positive integer. We also see that
P
i
P (x =
x
i
) =
P
i
1
k
=
k
k
= 1, so the distribution is properly normalized.
3.3.2 Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distri-
butions using a probability density function (PDF) rather than a probability mass
function. A probability density function must satisfy the following properties:
It must map from the domain of the random variable whose distribution it
describes to the real numbers.
x, p(x) 0. Note that we do not require p(x) 1.
R
p(x)dx = 1.
A probability density function does not give the probability of a specific state
directly, instead the probability of landing inside an infinitesimal region with
volume δx is given by p(x)δx.
We can integrate the density function to find the actual probability mass of
a set of points. Specifically, the probability that x lies in some set S is given by
the integral of p(x) over that set. In the univariate example, the probability that
x lies in the interval [a, b] is given by
R
[a,b]
p(x)dx.
For an example of a probability density function corresponding to a specific
probability density over a continuous random variable, consider a uniform dis-
tribution on an interval of the real numbers. We can do this with a function
u(x; a, b), where a and b are the endpoints of the interval, with b > a. (The
“;” notation means “parametrized by”; we consider x to be the argument of the
function, while a and b are parameters that define the function) To ensure that
50
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
there is no probability mass outside the interval, we say u(x; a, b) = 0 for all
x 6∈ [a, b]. Within [a, b], u(x; a, b) =
1
ba
. We can see that this is nonnegative
everywhere. Additionally, it integrates to 1. We often denote that x follows the
uniform distribution on [a, b] by writing x U(a, b).
3.4 Marginal Probability
Sometimes we know the probability distribution over a set of variables, and we
want to know the probability distribution over just a subset of them. The prob-
ability distribution over the subset is known as the marginal probability.
For example, suppose we have discrete random variables x and y, and we know
P (x, y). We can find P (x) with the sum rule:
x x, P(x = x) =
X
y
P (x = x, y = y).
The name “marginal probability” comes from the process of computing marginal
probabilities on paper. When the values of P(x, y) are written in a grid with dif-
ferent values of x in rows and different values of y in columns, it is natural to sum
across a row of the grid, then write P (x) in the margin of the paper just to the
right of the row.
For continuous variables, we need to use integration instead of summation:
p(x) =
Z
p(x, y)dy.
3.5 Conditional Probability
In many cases, we are interested in the probability of some event, given that some
other event has happened. This is called a conditional probability. We denote the
conditional probability that y = y given x = x as P (y = y | x = x). This
conditional probability can be computed with the formula
P (y = y | x = x) = P (y = y, x = x)/P(x = x).
Note that this is only defined when P (x = x) > 0. We cannot compute the
conditional probability conditioned on an event that never happens.
It is important not to confuse conditional probability with computing what
would happen if some action were undertaken. The conditional probability that
a person is from Germany given that they speak German is quite high, but if
a randomly selected person is taught to speak German, their country of origin
does not change. Computing the consequences of an action is called making an
intervention query. Intervention queries are the domain of causal modeling, which
we do not explore in this book.
51
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.6 The Chain Rule of Conditional Probabilities
Any joint probability distribution over many random variables may be decom-
posed into conditional distributions over only one variable:
P (x
(1)
, . . . , x
(n)
) = P (x
(1)
n
i=2
P (x
(i)
| x
(1)
, . . . , x
(i1)
).
This observation is known as the chain rule or product rule of probability. It
follows immediately from the definition of conditional probability. For example,
applying the definition twice, we get
P (a, b, c) = P (a|b, c)P (b, c)
P (b, c) = P (b|c)P (c)
P (a, b, c) = P (a|b, c)P (b|c)P (c).
Note how every statement about probabilities remains true if we add conditions
(stuff on the right-hand side of the vertical bar) consistently on all the “P”’s in
the statement. We can use this to derive the same thing differently:
P (a, b|c) = P (a|b, c)P (b|c)
P (a, b, c) = P (a, b|c)P (c) = P (a|b, c)P (b|c)P (c).
3.7 Independence and Conditional Independence
Two random variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y:
x x, y y, p(x = x, y = y) = p(x = x)p(y = y).
Two random variables x and y are conditionally independent given a random
variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:
x x, y y, z z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z).
We can denote independence and conditional independence with compact no-
tation: xy means that x and y are independent, while xy | z means that x and
y are conditionally independent given z.
52
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.8 Expectation, Variance, and Covariance
The expectation or expected value of some function f(x) with respect to a prob-
ability distribution P (x) is the average or mean value that f takes on when x is
drawn from P . For discrete variables this can be computed with a summation:
E
xP
[f(x)] =
X
x
P (x)f(x),
while for continuous variables, it is computed with an integral:
E
xP
[f(x)] =
Z
p(x)f(x)dx.
When the identity of the distribution is clear from the context, we may simply
write the name of the random variable that the expectation is over, e.g. E
x
[f(x)].
If it is clear which random variable the expectation is over, we may omit the
subscript entirely, e.g. E[f(x)]. By default, we can assume that E[·] averages over
the values of all the random variables inside the brackets. Likewise, when there
is no ambiguity, we may omit the square brackets.
Expectations are linear, for example, E[αf(x)+βg(x)] = αE[f (x)]+βE[g(x)],
when α and β are fixed (not random and not depending on x).
The variance gives a measure of how much the different values of a function
are spread apart:
Var(f(x)) = E
h
(f(x) E[f(x)])
2
i
.
When the variance is low, the values of f(x) cluster near their expected value.
The square root of the variance is known as the standard deviation.
The covariance gives some sense of how much two values are linearly related
to each other, as well as the scale of these variables:
Cov(f(x), g(y)) = E [(f(x) E [f(x)]) (g(y) E [g(y)])] .
High absolute values of the covariance mean that the values change a lot and
are both far from their respective means at the same time. If the sign of the
covariance is positive, then the values tend to change in the same direction, while
if it is negative, they tend to change in opposite directions. Other measures such
as correlation normalize the contribution of each variable in order to measure only
how much the variables are related, rather than also being affected by the scale
of the separate variables.
The notions of covariance and dependence are conceptually related, but are in
fact distinct concepts. Two random variables that have non-zero covariance are
dependent. However, they may have zero covariance without being independent.
53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
For example, suppose we first generate x, then generate s {−1, 1} with each
state having probability 0.5, then generate y as s(x E[x]). Clearly, x and y
are not independent, because y only has two possible values given x. However,
Cov(x, y) = 0.
The covariance matrix of a random vector x R
n
is an n × n matrix, such
that
Cov(x)
i,j
= Cov(x
i
, x
j
).
Note that the diagonal elements give Cov(x
i
, x
i
) = Var(x
i
).
3.9 Information Theory
Information theory is a branch of applied mathematics that revolves around quan-
tifying how much information is present in a signal. It was originally invented
to study sending messages from discrete alphabets over a noisy channel, such as
communication via radio transmission. In this context, information theory tells
how to design optimal codes and calculate the expected length of messages sam-
pled from specific probability distributions using various encoding schemes. In
the context of machine learning, we can also apply information theory to contin-
uous variables where some of these message length interpretations do not apply.
This field is fundamental to many areas of electrical engineering and computer
science. In this textbook, we mostly use a few key ideas from information theory
to characterize probability distributions or quantify similarity between probabil-
ity distributions. For more detail on information theory, see (Cover and Thomas,
2006; MacKay, 2003).
The basic intuition behind information theory is that learning that an un-
likely event has occurred is more informative than learning that a likely event has
occurred. A message saying “the sun rose this morning” is so uninformative as
to be unnecessary to send, but a message saying “there was a solar eclipse this
morning” is very informative.
We would like to quantify information in a way that formalizes this intuition.
Specifically,
Likely events should have low information content, and in the extreme case,
events that are guaranteed to happen should have no information content
whatsoever.
Less likely events should have higher information content.
Independent events should have additive information. For example, finding
out that a tossed coin has come up as heads twice should convey twice as
54
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
much information as finding out that a tossed coin has come up as heads
once.
In order to satisfy all three of these properties, we define the self-information
of an event x = x to be
I(x) = log P (x). (3.1)
In this book, we always use log to mean the natural logarithm, with base e. Our
definition of I(x) is therefore written in units of nats. One nat is the amount of
information gained by observing an event of probability
1
e
. Other texts use base-2
logarithms and units called bits or shannons; information measured in bits is just
a rescaling of information measured in nats.
When x is continuous, we use the same definition of information by analogy,
but some of the properties from the discrete case are lost. For example, an event
with unit density still has zero information, despite not being an event that is
guaranteed to occur.
Self-information deals only with a single outcome. We can quantify the
amount of uncertainty in an entire probability distribution using the Shannon
entropy
1
:
H(x) = E
xP
[I(x)] = E
xP
[log P(x)]. (3.2)
also denoted H(P ). In other words, the Shannon entropy of a distribution is
the expected amount of information in an event drawn from that distribution. It
actually gives a lower bound on the number of bits (if the logarithm is base 2,
otherwise the units are different) needed in average to encode symbols drawn from
a distribution P. Distributions that are nearly deterministic (where the outcome
is nearly certain) have low entropy; distributions that are closer to uniform have
high entropy. See Fig. 3.1 for a demonstration. When x is continous, the Shannon
entropy is known as the differential entropy.
If we have two separate probability distributions P (x) and Q(x) over the same
random variable x, we can measure how different these two distributions are using
the Kullback-Leibler (KL) divergence:
D
KL
(QkP ) = E
xQ
log
Q(x)
P (x)
. (3.3)
In the case of discrete variables, it is the extra amount of information needed to
send a message containing symbols drawn with probability Q, when we incorrectly
believe that they were drawn with probability P , i.e., it measures how much it
hurts to use P as a model when Q is the ground truth. The KL divergence has
1
Shannon entropy is named for Claude Shannon, the father of information theory (Shannon,
1948, 1949). For an interesting biographical account of Shannon and some of his contemporaries,
see Fortune’s Formula by William Poundstone (Poundstone, 2005).
55
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Figure 3.1: This plot shows how distributions that are closer to deterministic have low
Shannon entropy while distributions that are close to uniform have high Shannon entropy.
On the horizontal axis, we plot p, the probability of a binary random variable being equal
to 1. When p is near 0, the distribution is nearly deterministic, because the random
variable is nearly always 0. When p is near 1, the distribution is nearly deterministic,
because the random variable is nearly always 1. When p = 0.5, the entropy is maximal,
because the distribution is uniform over the two outcomes.
56
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
many useful properties, most notably that it is non-negative. The KL divergence
is 0 if and only if P and Q are the same distribution in the case of discrete
variables, or equal “almost everywhere” in the case of continuous variables (see
section 3.13 for details). Because the KL divergence is non-negative and measures
the difference between two distributions, it is often conceptualized as measuring
some sort of distance between these distributions. However, it is not a true
distance measure because it is not symmetric, i.e. D
KL
(P kQ) 6= D
KL
(QkP ) for
some P and Q.
When computing many of these quantities, it is common to encounter expres-
sions of the form 0 log 0. By convention, in the context of information theory, we
treat these expressions as lim
x0
x log x = 0.
3.10 Common Probability Distributions
Several simple probability distributions are useful in many contexts in machine
learning.
3.10.1 Bernoulli Distribution
The Bernoulli distribution is a distribution over a single binary random variable.
It is controlled by a single parameter φ [0, 1], which gives the probability of the
random variable being equal to 1. It has the following properties:
P (x = 1) = φ
P (x = 0) = 1 φ
P (x = x) = φ
x
(1 φ)
1x
E
x
[x] = φ
Var
x
(x) = φ(1 φ)
H(x) = (φ 1) log(1 φ) φ log φ.
3.10.2 Multinoulli Distribution
The multinoulli or categorical distribution is a distribution over a single discrete
variable with k different states, where k is finite
2
. The multinoulli distribution
2
“Multinoulli” is a recently coined term. The multinoulli distribution is a special case of
the multinomial distribution. A multinomial distribution is the distribution over vectors in
{0, . . . , k}
n
representing how many times each of the k categories is visited when n samples
are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to
multinoulli distributions without clarifying that they refer only to the n = 1 case.
57
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
is parametrized by a vector p [0, 1]
k1
, where p
i
gives the probability of the
i-th state. The final, k-th state’s probability is given by 1 1
>
p. Note that
we must constrain 1
>
p 1. Multinoulli distributions are often used to refer to
distributions over categories of objects, so we do not usually assume that state 1
has numerical value 1, etc. For this reason, we do not usually need to compute
the expectation or variance of multinoulli-distributed random variables.
The Bernoulli and multinoulli distributions are sufficient to describe any distri-
bution over their domain. This is because they model discrete variables for which
it is feasible to simply enumerate all of the states. When dealing with continuous
variables, there are uncountably many states, so any distribution described by a
small number of parameters must impose strict limits on the distribution.
3.10.3 Gaussian Distribution
The most commonly used distribution over real numbers is the normal distribu-
tion, also known as the Gaussian distribution:
N(x | µ, σ
2
) =
r
1
2πσ
2
exp
1
2σ
2
(x µ)
2
.
See Fig. 3.2 for a schematic.
The two parameters µ R and σ R
+
control the normal distribution. µ
gives the coordinate of the central peak. This is also the mean of the distribution,
i.e. E[x] = µ. The standard deviation of the distribution is given by σ, i.e.
Var(x) = σ
2
.
Note that when we evaluate the PDF, we need to square and invert σ. When
we need to frequently evaluate the PDF with different parameter values, a more
efficient way of parametrizing the distribution is to use a parameter β R
+
to
control the precision or inverse variance of the distribution:
N(x | µ, β
1
) =
r
β
2π
exp
1
2
β(x µ)
2
.
Normal distributions are a sensible choice for many applications. In the ab-
sence of prior knowledge about what form a distribution over the real numbers
should take, the normal distribution is a good default choice for two major rea-
sons.
First, many distributions we wish to model are truly close to being normal
distributions. The central limit theorem shows that the sum of many indepen-
dent random variables is approximately normally distributed. This means that
in practice, many complicated systems can be modeled successfully as normally
distributed noise, even if the system can be decomposed into parts with more
structured behavior.
58
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Figure 3.2: The normal distribution: The normal distribution N(x | µ, σ
2
) exhibits a
classic “bell curve” shape, with the x coordinate of its central peak given by µ, and
the width of its peak controlled by σ. In this example, we depict the standard normal
distribution, with µ = 0 and σ = 1.
59
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Second, the normal distribution in some sense makes the fewest assumptions
of any distribution over the reals, so choosing to use it inserts the least amount of
prior knowledge into a model. Out of all distributions with the same variance, the
normal distribution has the highest entropy. It is not possible to place a uniform
distribution on all of R. The closest we can come to doing so is to use a normal
distribution with high variance.
The normal distribution generalizes to R
n
, in which case it is known as the
multivariate normal distribution. It may be parametrized with a positive definite
symmetric matrix Σ:
N(x | µ, Σ) =
s
1
(2π)
n
det(Σ)
exp
1
2
(x µ)
>
Σ
1
(x µ)
.
The parameter µ still gives the mean of the distribution, though now it is
vector-valued. The parameter Σ gives the covariance matrix of the distribution.
As in the univariate case, the covariance is not necessarily the most computation-
ally efficient way to parametrize the distribution, since we need to invert Σ to
evaluate the PDF. We can instead use a precision matrix β:
N(x | µ, β
1
) =
s
det(β)
(2π)
n
exp
1
2
(x µ)
>
β(x µ)
.
3.10.4 Dirac Distribution
In some cases, we wish to specify that all of the mass in a probability distribution
clusters around a single point. This can be accomplished by defining a PDF using
the Dirac delta function, δ(x):
p(x) = δ(x µ).
The Dirac delta function is defined such that it is zero-valued everywhere but 0,
yet integrates to 1. By defining p(x) to be δ shifted by µ we obtain an infinitely
narrow and infinitely high peak of probability mass where x = µ.
A common use of the Dirac delta distribution is as a component of the so-called
empirical distribution,
ˆp(x) =
1
n
n
X
i=1
δ(x x
i
) (3.4)
which puts probability mass
1
n
on each of the n points x
1
, . . . x
n
forming a given
data set or collection of samples. The Dirac delta distribution is only necessary to
define the empirical distribution over continuous variables. For discrete variables,
the situation is simpler: an empirical distribution can be conceptualized as a
60
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
multinoulli distribution, with a probability associated to each possible input value
that is simply equal to the empirical frequency of that value in the training set.
We can view the empirical distribution formed from a dataset of training ex-
amples as specifying the distribution that we sample from when we train a model
on this dataset. Another important perspective on the empirical distribution is
that it is the probability density that maximizes the likelihood of the training data
(see Section 5.7). Many machine learning algorithms can be configured to have
arbitrarily high capacity. If given enough capacity, these algorithms will simply
learn the empirical distribution. This is a bad outcome because the model does
not generalize at all and assigns infinitesimal probability to any point in space
that did not occur in the training set. A central problem in machine learning is
studying how to limit the capacity of a model in a way that prevents it from sim-
ply learning the empirical distribution while also allowing it to learn complicated
functions.
The empirical distribution is a particular form of mixture, discussed next.
3.10.5 Mixtures of Distributions and Gaussian Mixture
It is also common to define probability distributions by composing other sim-
pler probability distributions. One common way of combining distributions is to
construct a mixture distribution. A mixture distribution is made up of several
component distributions. On each trial, the choice of which component distribu-
tion generates the sample is determined by sampling a component identity from
a multinoulli distribution:
P (x) =
X
i
P (c = i)P (x | c = i)
where P (c) is the multinoulli distribution over component identities. In chap-
ter 14, we explore the art of building complex probability distributions from sim-
ple ones in more detail. Note that we can think of the variable c as a non-observed
(or latent) random variable that is related to x through their joint distribution
P (x, c) = P (x|c)P (c). Latent variables are discussed further in Section 14.4.2.
A very powerful and common type of mixture model is the Gaussian mixture
model, in which the components P (x | c = i) are Gaussians, each with its mean µ
i
and covariance Σ
i
. Some mixtures can have more constraints, for example, the co-
variances could be shared across components, i.e., Σ
i
= Σ
j
= Σ, or the covariance
matrices could be constrained to be diagonal or simply equal to a scalar times the
identity. A Gaussian mixture model is a universal approximator of densities, in
the sense that any smooth density can be approximated to a particular precision
by a Gaussian mixture model with enough components. Gaussian mixture models
61
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
have been used in many settings, and are particularly well known for their use as
acoustic models in speech recognition (Bahl et al., 1987).
3.11 Useful Properties of Common Functions
Certain functions arise often while working with probability distributions, espe-
cially the probability distributions used in deep learning models.
One of these functions is the logistic sigmoid:
σ(x) =
1
1 + exp(x)
.
The logistic sigmoid is commonly used to produce the φ parameter of a Bernoulli
distribution because its range is (0, 1), which lies within the valid range of values
for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function.
Figure 3.3: The logistic sigmoid function.
Another commonly encountered function is the softplus function (Dugas et al.,
2001):
ζ(x) = log (1 + exp(x)) .
The softplus function can be useful for producing the β or σ parameter of a normal
distribution because its range is R
+
. It also arises commonly when manipulating
expressions involving sigmoids, as it is the primitive of the sigmoid, i.e., the
integral from −∞ to x of the sigmoid. The name of the softplus function comes
from the fact that it is a smoothed or “softened” version of
x
+
= max(0, x).
62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Figure 3.4: The softplus function.
See Fig. 3.4 for a graph of the softplus function.
The following properties are all useful enough that you may wish to memorize
them:
σ(x) =
exp(x)
exp(x) + exp(0)
d
dx
σ(x) = σ(x)(1 σ(x))
1 σ(x) = σ(x)
log σ(x) = ζ(x)
d
dx
ζ(x) = σ(x)
x (0, 1), σ
1
(x) = log
x
1 x
x > 0, ζ
1
(x) = log (exp(x) 1)
ζ(x) ζ(x) = x
The function σ
1
(x) is called the logit in statistics, but this term is more rarely
used in machine learning. The final property provides extra justification for the
name “softplus”, since x
+
x
= x.
63
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.12 Bayes’ Rule
We often find ourselves in a situation where we know P (y | x) and need to know
P (x | y). Fortunately, if we also know P (x), we can compute the desired quantity
using Bayes’ rule:
P (x | y) =
P (x)P (y | x)
P (y)
.
Note that while P (y) appears in the formula, it is usually feasible to compute
P (y) =
P
x
P (y|x)P (x), so we do not need to begin with knowledge of P (y).
Bayes’ rule is straightforward to derive from the definition of conditional prob-
ability, but it is useful to know the name of this formula since many texts refer to
it by name. It is named after the Reverend Thomas Bayes, who first discovered a
special case of the formula. The general version presented here was independently
discovered by Pierre-Simon Laplace.
3.13 Technical Details of Continuous Variables
A proper formal understanding of continuous random variables and probability
density functions requires developing probability theory in terms of a branch of
mathematics known as measure theory. Measure theory is beyond the scope of
this textbook, but we can briefly sketch some of the issues that measure theory
is employed to resolve.
In section 3.3.2, we saw that the probability of x lying in some set S is
given by the integral of p(x) over the set S. Some choices of set S can produce
paradoxes. For example, it is possible to construct two sets S
1
and S
2
such
that P(S
1
) + P(S
2
) > 1 but S
1
S
2
= . These sets are generally constructed
making very heavy use of the infinite precision of real numbers, for example by
making fractal-shaped sets or sets that are defined by transforming the set of
rational numbers
3
. One of the key contributions of measure theory is to provide a
characterization of the set of sets that we can compute the probability of without
encountering paradoxes. In this book, we only integrate over sets with relatively
simple descriptions, so this aspect of measure theory never becomes a relevant
concern.
For our purposes, measure theory is more useful for describing theorems that
apply to most points in R
n
but do not apply to some corner cases. Measure theory
provides a rigorous way of describing that a set of points is negligibly small. Such
a set is said to have “measure zero”. We do not formally define this concept in this
textbook. However, it is useful to understand the intuition that a set of measure
zero occupies no volume in the space we are measuring. For example, within R
2
,
3
The Banach-Tarski theorem provides a fun example of such sets.
64
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
a line has measure zero, while a filled polygon has positive measure. Likewise, an
individual point has measure zero. Any union of countably many sets that each
have measure zero also has measure zero (so the set of all the rational numbers
has measure zero, for instance).
Another useful term from measure theory is almost everywhere”. A property
that holds almost everywhere holds throughout all of space except for on a set
of measure zero. Because the exceptions occupy a negligible amount of space,
they can be safely ignored for many applications. Some important results in
probability theory hold for all discrete values but only hold “almost everywhere”
for continuous values.
One other detail we must be aware of relates to handling random variables
that are deterministic functions of one another. Suppose we have two random
variables, x and y, such that y = g(x). You might think that p
y
(y) = p
x
(g
1
(y)).
This is actually not the case.
Suppose y =
x
2
and x U(0, 1). If we use the rule p
y
(y) = p
x
(2y) then p
y
will be 0 everywhere except the interval [0,
1
2
], and it will be 1 on this interval.
This means
Z
p
y
(y)dy =
1
2
which violates the definition of a probability distribution.
This common mistake is wrong because it fails to account for the distortion
of space introduced by the function g(x). Recall that the probability of x lying
in an infinitesimally small region with volume δx is given by p(x)δx. Since g can
expand or contract space, the infinitesimal volume surrounding x in x space may
have different volume in y space. To correct the problem, we need to preserve
the property
|p
y
(g(x))dy| = |p
x
(x)dx|.
Solving from this, we obtain
p
y
(y) = p
x
(g
1
(y))|
x
y
|
or equivalently
p
x
(x) = p
y
(g(x))|
g(x)
x
|. (3.5)
In higher dimensions, the absolute value of the derivative generalizes to the de-
terminant of the Jacobian matrix the matrix with J
i,j
=
x
i
y
j
.
3.14 Example: Naive Bayes
We now know enough probability theory that we can derive some simple machine
learning tools.
65
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The Naive Bayes model is a simple probabilistic model that is often used to
recognize patterns. The model consists of one random variable c representing a
category, and a set of random variables F = {f
(1)
, . . . , f
(n)
} representing features
of objects in each category. In this example, we’ll use Naive Bayes to diagnose
patients as having the flu or not. The random variable c can thus have two
values: c
0
representing the category of patients who do not have the flu, and c
1
representing the category of patients who do. Suppose f
(1)
is the random variable
representing whether the patient has a sore throat, with f
(1)
0
representing no sore
throat, and f
(1)
1
representing a sore throat. Suppose f
(2)
R is the patient’s
temperature in degrees Celsius.
When using the Naive Bayes model, we assume that all of the features are
independent from each other given the category:
P (c, f
(1)
, . . . , f
(n)
) = P (c)Π
i
P (f
(i)
| c).
These assumptions are very strong and unlikely to be true in naturally occuring
situations, hence the name “naive”. Surprisingly, Naive Bayes often produces
good predictions in practice (even though the assumptions do not hold precisely),
and is a good baseline model to start with when tackling a new problem.
Beyond these conditional independence assumptions, the Naive Bayes frame-
work does not specify anything about the probability distribution. The specific
choice of distributions is left up to the designer. In our flu example, let’s make
P (c) a Bernoulli distribution, with P(c = c
1
) = φ
(c)
. We can also make P (f
(1)
| c)
a Bernoulli distribution, with
P (f
(1)
= f
(1)
1
| c = c) = φ
f
c
.
In other words, the Bernoulli parameter changes depending on the value of c.
Finally, we need to choose the distribution over f
(2)
. Since f
(2)
is real-valued, a
normal distribution is a good choice. Because f
(2)
is a temperature, there are
hard limits to the values it can take on—it cannot go below 0K, for example.
Fortunately, these values are so far from the values measured in human patients
that we can safely ignore these hard limits. Values outside the hard limits will
receive extremely low probability under the normal distribution so long as the
mean and variance are set correctly. As with f
(1)
, we need to use different param-
eters for different values of c, to represent that patients with the flu have different
temperatures than patients without it:
f
(2)
N(f
(2)
| µ
c
, σ
2
c
).
Now we are ready to determine how likely a patient is to have the flu. To do
this, we want to compute P(c | F), but we know P (c) and P(F | c). This suggests
66
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
that we should use Bayes’ rule to determine the desired distribution. The word
“Bayes” in the name “Naive Bayes” comes from this frequent use of Bayes’ rule
in conjunction with the model. We begin by applying Bayes’ rule:
P (c | F) =
P (c)P (F | c)
P (F)
. (3.6)
We do not know P (F). Fortunately, it is easy to compute:
P (F) =
X
cc
P (c = c, F) (by the sum rule)
=
X
cc
P (c = c)P (F | c = c) (by the chain rule).
Substituting this result back into equation 3.6, we obtain
P (c | F) =
P (c)P (F | c)
P
cc
P (c = c)P (F | c = c)
=
P (c)Π
i
P (f
(i)
| c)
P
cc
P (c = c
i
P (f
(i)
| c = c)
by the Naive Bayes assumptions. This is as far as we can simplify the expression
for a general Naive Bayes model.
We can simplify the expression further by substituting in the definitions of the
particular probability distributions we have defined for our flu diagnosis example:
P (c = c | f
(1)
= f
1
, f
(2)
= f
2
) =
g(c)
P
c
0
c
g(c
0
)
where
g(c) = P (c = c)P (f
(1)
= f
(1)
| c = c)P (f
(2)
= f
(2)
| c = c).
Since c only has two possible values in our example, we can simplify this to:
P (c = 1 | f
(1)
= f
1
, f
(2)
= f
2
) =
g(1)
g(0) + g(1)
=
1
1 +
g(0)
g(1)
=
1
1 + exp (log g(0) log g(1))
= σ (log g(1) log g(0)) . (3.7)
67
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
To go further, let’s simplify log g(i):
log g(i) = log
φ
(c)i
(1 φ
(c)
)
1i
φ
(f)f
1
1
(1 φ
(f)
1
)
1f
1
s
1
2πσ
2
i
exp
1
2σ
2
i
(f
2
µ
i
)
2
!
= i log φ
(c)
+(1i) log
1 φ
(c)
+f
1
log φ
(f)
i
+(1f
1
) log(1φ
(f)
i
)+
1
2
log
1
2πσ
2
i
1
2σ
2
i
(f
2
µ
i
)
2
.
Substituting this back into equation 3.7, we obtain
P (c = c | f
(1)
= f
1
, f
(2)
= f
2
) =
σ
log φ
(c)
log(1 φ
(c)
) + f
1
log φ
(f)
1
+ (1 f
1
) log(1 φ
(f)
1
)
f
1
log φ
(f)
0
+ (1 f
1
) log(1 φ
(f)
0
)
1
2
log 2πσ
2
1
+
1
2
log 2πσ
2
0
1
2σ
2
1
(f
2
µ
1
)
2
+
1
2σ
2
0
(f
2
µ
0
)
2
.
From this formula, we can read off various intuitive properties of the Naive
Bayes classifier’s behavior on this example problem, regarding the inference that
can be drawn from a trained model. The probability of the patient having the flu
grows like a sigmoidal curve. We move farther to the left as f
2
, the patient’s tem-
perature, moves farther away from µ
1
, the average temperature of a flu patient.
68