Chapter 3
Probability and Information
Theory
In this chapter, we describe probability theory. Probability theory is a mathematical
framework for representing uncertain statements. It provides a means of quantifying
uncertainty and axioms for deriving new uncertain statements. In artificial intelligence
applications, we use probability theory in two major ways. First, the laws of probabil-
ity tell us how AI systems should reason, so we design our algorithms to compute or
approximate various expressions derived using probability theory. Second, we can use
probability and statistics to theoretically analyze the behavior of proposed AI systems.
Probability theory is a fundamental tool of many disciplines of science and engi-
neering. We provide this chapter to ensure that readers whose background is primarily
in software engineering with limited exposure to probability theory can understand the
material in this book. If you are already familiar with probability theory, feel free to skip
this chapter. If you have absolutely no prior experience with probability, this chapter
should be sufficient to successfully carry out deep learning research projects, but we do
suggest that you consult an additional resource, such as (Jaynes, 2003).
3.1 Why probability?
Many branches of computer science deal mostly with entities that are entirely determin-
istic and certain. A programmer can usually safely assume that a CPU will execute each
machine instruction flawlessly. Errors in hardware do occur, but are rare enough that
most software applications do not need to be designed to account for them. Given that
many computer scientists and software engineers work in a relatively clean and certain
environment, it can be surprising that machine learning makes heavy use of probability
theory.
This is because machine learning must always deal with uncertain quantities, and
sometimes may also need to deal with stochastic quantities. Uncertainty and stochas-
ticity can arise from many sources. Researchers have made compelling arguments for
quantifying uncertainty using probability since at least the 1980s. Many of the argu-
35
ments presented here are summarized from or inspired by (Pearl, 1988).
Nearly all activities require some ability to reason in the presence of uncertainty. In
fact, beyond mathematical statements that are true by definition, it is difficult to think
of any proposition that is absolutely true or any event that is absolutely guaranteed to
occur.
One source of uncertainty is incomplete observability. When we cannot observe
something, we are uncertain about its true nature. In machine learning, it is often the
case that we can observe a large amount of data, but there is not a data instance for
every situation we care about. We are also generally not able to observe directly what
process generates the data. Since we are uncertain about what process generates the
data, we are also uncertain about what happens in the situations for which we have not
observed data points. Lack of observability can also give rise to apparent stochasticity.
Deterministic systems can appear stochastic when we cannot observe all of the variables
that drive the behavior of the system. For example, consider a game of Russian roulette.
The outcome is deterministic if you know which chamber of the revolver is loaded. If
you do not know this important information, then it is a game of chance. In many
cases, we are able to observe some quantity, but our measurement is itself uncertain.
For example, laser range finders may have several centimeters of random error.
Uncertainty can also arise from the simplifications we make in order to model real-
world processes. For example, if we discretize space, then we immediately become
uncertain about the precise position of objects: each object could be anywhere within
the discrete cell that we know it occupies.
Conceivably, the universe itself could have stochastic dynamics, but we make no
claim on this subject.
In many cases, it is more practical to use a simple but uncertain rule rather than a
complex but certain one, even if our modeling system has the fidelity to accommodate
a complex rule. For example, the simple rule “Most birds fly” is cheap to develop and
is broadly useful, while a rule of the form, “Birds fly, except for very young birds that
have not yet learned to fly, sick or injured birds that have lost the ability to fly, flightless
species of birds including the cassowary, ostrich, and kiwi. . . is expensive to develop,
maintain, and communicate, and after all of this effort is still very brittle and prone to
failure.
Given that we need a means of representing and reasoning about uncertainty, it is
not immediately obvious that probability theory can provide all of the tools we want
for artificial intelligence applications. Probability theory was originally developed to
analyze the frequencies of events. It is easy to see how probability theory can be used
to study events like drawing a certain hand of cards in a game of poker. These kinds
of events are often repeatable, and when we say that an outcome has a probability p
of occurring, it means that if we repeated the experiment (e.g., draw a hand of cards)
infinitely many times, then proportion p of the repetitions would result in that outcome.
This kind of reasoning does not seem immediately applicable to propositions that are
not repeatable. If a doctor analyzes a patient and says that the patient has a 40% chance
of having the flu, this means something very different—we can not make infinitely many
replicas of the patient, nor is there any reason to believe that different replicas of the
36
patient would present with the same symptoms yet have varying underlying conditions.
In the case of the doctor diagnosing the patient, we use probability to represent a degree
of belief, with 1 indicating absolute certainty, and 0 indicating absolute uncertainty. The
former kind of probability, related directly to the rates at which events occur, is known
as frequentist probability, while the latter, related to qualitative levels of certainty, is
known as Bayesian probability.
It turns out that if we list several properties that we expect common sense reasoning
about uncertainty to have, then the only way to satisfy those properties is to treat
Bayesian probabilities as behaving exactly the same as frequentist probabilities. For
example, if we want to compute the probability that a player will win a poker game
given that she has a certain set of cards, we use exactly the same formulas as when we
compute the probability that a patient has a disease given that she has certain symptoms.
For more details about why a small set of common sense assumptions implies that the
same axioms must control both kinds of probability, see (Ramsey, 1926).
Probability can be seen as the extension of logic to deal with uncertainty. Logic
provides a set of formal rules for determining what propositions are implied to be true or
false given the assumption that some other set of propositions is true or false. Probability
theory provides a set of formal rules for determining the likelihood of a proposition being
true given the likelihood of other propositions.
3.2 Random variables
A random variable is a variable that can take on different values randomly. We typically
denote the random variable itself with an uppercase script letter, and the values it
can take on with lower case script letters. For example, x
1
and x
2
are both possible
values that the random variable x can take on. On its own, a random variable is just
a description of the states that are possible; it must be coupled with a probability
distribution that specifies how likely each of these states are.
Random variables may be discrete or continuous. A discrete random variable is one
that has a finite or countably infinite number of states. Note that these states are not
necessarily the integers; they can also just be named states that are not considered to
have any numerical value. A continuous random variable is associated with a real value.
3.3 Probability distributions
A probability distribution is a description of how likely a random variable or set of random
variables is to take on each of its possible states. The way we describe probability
distributions depends on whether the variables are discrete or continuous.
3.3.1 Discrete variables and probability mass functions
A probability distribution over discrete variables may be described using a probability
mass function (PMF). We typically denote probability mass functions with a capital P .
37
Often we associate each random variable with a different probability mass function and
the reader must infer which probability mass function to use based on the identity of the
random variable, rather than the name of the function; P(X) is usually not the same
as P (Y ).
The probability mass function maps from a state of a random variable to the prob-
ability of that random variable taking on that state. P (x) denotes the probability that
X = x, with a probability of 1 indicating that X = x is certain and a probability of
0 indicating that X = x is impossible. Sometimes to disambiguate which PMF to use,
we write the name of the random variable explicitly: P (X = x). Sometimes we de-
fine a variable first, then use notation to specify which distribution it follows later:
x P(X).
Probability mass functions can act on many variables at the same time. Such a
probability distribution over many variables is known as a joint probability distribution.
P (X = x, Y = y) denotes the probability that X = x and Y = y simultaneously. We
may also write P(x, y) for brevity.
To be a probability mass function on a set of random variables X, a function f must
meet the following properties:
The domain of f must be the set of all possible states of X.
The range of f must be a subset of the real interval [0, 1]. (No state can be more
likely than a guaranteed event or less likely than an impossible event.)
x∈X
f(x) = 1. (f must guarantee that some state occurs.)
For example, consider a single discrete random variable X with k different states.
We can place a uniform distribution on X that is, make each of its states equally
likely, by setting its probability mass function to
P (X = x
i
) =
1
k
for all i. We can see that this fits the requirements for a probability mass function. The
value
1
k
is positive because k is a positive integer. We also see that
i
P (X = x
i
) =
i
1
k
=
k
k
= 1, so the distribution is properly normalized.
3.3.2 Continuous variables and probability density functions
When working with continuous random variables, we describe probability distributions
using a probability density function (PDF) rather than a probability mass function. A
probability density function must satisfy the following properties:
It must map from the domain of the random variable whose distribution it de-
scribes to the real numbers.
x, p(x) 0. Note that we do not require p(x) 1.
X
p(x)dx = 1.
38
A probability density function does not give the probability of a specific state directly,
instead the probability of landing inside an infinitesimal region with volume δx is given
by p(x)δx.
We can integrate the density function to find the actual probability mass of a set of
points. Specifically, the probability that x lies in some set S is given by the integral of
p(x) over that set. In the univariate example, the probability that x lies in the interval
[a, b] is given by
b
a
p(x)dx.
For an example of a probability density function corresponding to a specific proba-
bility density over a continuous random variable, consider a uniform distribution on an
interval of the real numbers. We can do this with a function u(x; a, b), where a and b are
the endpoints of the interval, with b > a. (The “;” notation means “parameterized by”;
we consider x to be the argument of the function, while a and b are parameters that
define the function) To ensure that there is no probability mass outside the interval, we
say u(x; a, b) = 0 for all x ∈ [a, b]. Within [a, b], u(x; a, b) =
1
ba
. We can see that this is
nonnegative everywhere. Additionally, it integrates to 1. We often denote that x follows
the uniform distribution on [a, b] by writing x U(a, b).
3.4 Marginal probability
Sometimes we know the probability distribution over a set of variables, and we want to
know the probability distribution over just a subset of them. The probability distribution
over the subset is known as the marginal probability.
For example, suppose we have discrete random variables X and Y , and we know
P (X, Y ). We can find P (X) with the sum rule:
x X, P (X = x) =
y
P (X = x, Y = y).
The name “marginal probability” comes from the process of computing marginal
probabilities on paper. When the values of P(X, Y ) are written in a grid with different
values of x in rows and different values of y in columns, it is natural to sum across a
row of the grid, then write P (x) in the margin of the paper just to the right of the row.
For continuous variables, we need to use integration instead of summation:
p(x) =
Y
p(x, y)dy.
3.5 Conditional probability
In many cases, we are interested in the probability of some event, given that some other
event has happened. This is called a conditional probability. We denote the conditional
probability that Y = y given X = x as P (Y = y | X = x). This conditional probability
can be computed with the formula
P (Y = y | X = x) = P (Y = y, X = x)/P (X = x).
39
Note that this is only defined when P(X = x) > 0. We cannot compute the conditional
probability conditioned on an event that never happens.
It is important not to confuse conditional probability with computing what would
happen if some action were undertaken. The conditional probability that a person is
from Germany given that they speak German is quite high, but if a randomly selected
person is taught to speak German, their country of origin does not change. Computing
the consequences of an action is called making an intervention query. Intervention
queries are the domain of causal modeling, which we do not explore in this book.
3.6 The chain rule
Any joint probability distribution over many random variables may be decomposed into
conditional distributions over only one variable:
P (X
(1)
, . . . , X
(n)
) = P (X
(1)
n
i=2
P (X
(i)
| X
(1)
, . . . , X
(i1)
).
This observation is known as the chain rule or product rule of probability. It follows
immediately from the definition of conditional probability. For example, applying the
definition twice, we get
P (A, B, C) = P (A|B, C)P (B, C)
P (B, C) = P (B|C)P(C)
P (A, B, C) = P (A|B, C)P (B|C)P (C).
Note how every statement about probabilities remains true if we add conditions (stuff
on the right-hand side of the vertical bar) consistently on all the “P”’s in the statement.
We can use this to derive the same thing differently:
P (A, B|C) = P (A|B, C)P (B|C)
P (A, B, C) = P (A, B|C)P (C) = P(A|B, C)P (B|C)P (C).
3.7 Independence and conditional independence
Two random variables X and Y are independent if their probability distribution can be
expressed as a product of two factors, one involving only X and one involving only Y :
x X, y Y, p(X = x, Y = y) = p(X = x)p(Y = y).
Two random variables X and Y are conditionally independent given a random vari-
able Z if the conditional probability distribution over X and Y factorizes in this way
for every value of Z:
x X, y Y, z Z, p(X = x, Y = y | Z = z) = p(X = x | Z = z)p(Y = y | Z = z).
40
3.8 Expectation, variance, and covariance
The expectation or expected value of some function f(x) with respect to a probability
distribution P (X) is the average or mean value that f takes on when x is drawn from
P . For discrete variables this can be computed with a summation:
E
xP
[f(x)] =
x
P (x)f(x),
while for continuous variables, it is computed with an integral:
E
xP
[f(x)] =
X
p(x)f(x)dx.
When the identity of the distribution is clear from the context, we may simply write the
name of the random variable that the expectation is over, e.g. E
X
[f(x)]. If it is clear
which random variable the expectation is over, we may omit the subscript entirely, e.g.
E[f(x)].
Expectations are linear, for example, E[αf(x) + βg(x)] = αE[f(x)] + βE[g(x)].
The variance gives a measure of how much the different values of a function are
spread apart:
Var(f(x)) = E
(f(x) E[f(x)])
2
.
When the variance is low, the values of f(x) cluster near their expected value. The
square root of the variance is known as the standard deviation.
The covariance gives some sense of how much two values are linearly related to each
other, as well as the scale of these variables:
Cov(f(x), g(y)) = E [(f(x) E [f(x)]) (g(y) E [g(y)])] .
High absolute values of the covariance mean that the values change a lot and are both
far from their respective means at the same time. If the sign of the covariance is
positive, then the values tend to change in the same direction, while if it is negative,
they tend to change in opposite directions. Other measures such as correlation normalize
the contribution of each variable in order to measure only how much the variables are
related, rather than also being affected by the scale of the separate variables.
The concepts of covariance and dependence are conceptually related, but are in
fact distinct concepts. Two random variables have non-zero covariance only if they
are dependent. However, they may have zero covariance without being independent.
For example, suppose we first generate x, then generate s {−1, 1} with each state
having probability 0.5, then generate y as E[x] + s(x E[x]). Clearly, x and y are not
independent, because y only has two possible values given x. However, Cov(x, y) = 0.
The covariance matrix of a random vector x R
n
is an n ×n matrix, such that
Cov(x)
i,j
= Cov(x
i
, x
j
).
Note that the diagonal elements give Cov(x
i
, x
i
) = Var(x
i
).
41
3.9 Information theory
Information theory is a branch of applied mathematics that revolves around quantifying
how much information is present in a signal. This field is fundamental to many areas
of electrical engineering and computer science. In this textbook, we mostly use a few
key ideas from information theory to characterize probability distributions or quantify
similarity between probability distributions. For more detail on information theory,
see (Cover, 2006).
The basic intuition behind information theory is that learning that an unlikely event
has occurred is more informative than learning that a likely event has occurred. A
message saying “the sun rose this morning” is so uninformative as to be unnecessary to
send, but a message saying “there was a solar eclipse this morning” is very informative.
We would like to quantify information in a way that formalizes this intuition. Specif-
ically,
Likely events should have low information content.
Unlikely events should have high information content
If one event is half as likely as another, then learning about the former event should
convey twice as much information as the latter.
In order to satisfy all three of these properties, we define the self-information of an
event X = x to be I(x) = log P (x). In this book, we always use log to mean the
natural logarithm, with base e. Our definition of I(x) is therefore written in units of
nats. One nat is the amount of information gained by observing an event of probability
1
e
. Other texts use base-2 logarithms and units called bits or shannons; information
measured in bits is just a rescaling of information measured in nats.
Self-information deals only with a single outcome. We can quantify the amount of
uncertainty in an entire probability distribution using the Shannon entropy
1
:
H(X) = E
xP
[I(x)].
In other words, the Shannon entropy of a distribution is the expected amount of infor-
mation in an event drawn from that distribution. Distributions that are nearly deter-
ministic have low entropy; distributions that are closer to uniform have high entropy.
See Fig. 3.1 for a demonstration.
If we have two separate probability distributions P (X) and Q(X) over the same
random variable X, we can measure how different these two distributions are using the
Kullback-Leibler (KL) divergence:
D
KL
(P Q) = E
xP
log
P (x)
Q(x)
.
1
Shannon entropy is named for Claude Shannon, the father of information theory. For an interesting
biographical account of Shannon and some of his contemporaries, see Fortune’s Formula by William
Poundstone.
42
Figure 3.1: This plot shows how distributions that are closer to deterministic have
low Shannon entropy while distributions that are close to uniform have high Shannon
entropy. On the horizontal axis, we plot p, the probability of a binary random variable
being equal to 1. When p is near 0, the distribution is nearly deterministic, because
the random variable is nearly always 0. When p is near 1, the distribution is nearly
deterministic, because the random variable is nearly always 1. When p = 0.5, the
entropy is maximal, because the distribution is uniform over the two outcomes.
43
The KL divergence has many useful properties, most notably that it is non-negative.
The KL divergence is 0 if and only if P and Q are the same distribution in the case
of discrete variables, or equal “almost everywhere” in the case of continuous variables
(see section 3.13 for details). Because the KL divergence is non-negative and measures
the difference between two distributions, it is often conceptualized as measuring some
sort of distance between these distributions. However, it is not a true distance measure
because it is not symmetric, i.e. D
KL
(P Q) = D
KL
(QP ) for some P and Q.
When computing many of these quantities, it is common to encounter expressions
of the form 0 log 0. By convention, in the context of information theory, we treat these
expressions as lim
x0
x log x = 0.
Many information theoretic concepts can be interpreted in terms of the number of
bits required to send certain messages. While these interpretations are valuable in other
fields, they have not usually been of much practical importance in the context of deep
learning.
3.10 Common probability distributions
Several simple probability distributions are useful in many contexts in machine learning.
3.10.1 Bernoulli Distribution
The Bernoulli distribution is a distribution over a single binary random variable. It is
controlled by a single parameter φ [0, 1], which gives the probability of the random
variable being equal to 1. It has the following properties:
P (X = 1) = φ
P (X = 0) = 1 φ
P (X = x) = φ
x
(1 φ)
1x
E
X
[X] = φ
Var
X
(X) = φ(1 φ)
H(X) = (φ 1) log(1 φ) φ log φ.
3.10.2 Multinoulli Distribution
The Multinoulli distribution is a distribution over a single discrete variable with k differ-
ent states, where k is finite
2
. The Multinoulli distribution is parameterized by a vector
p [0, 1]
k1
, where p
i
gives the probability of the i-th state. The final, k-th state’s
2
“Multinoulli” is a recently coined term. The Multinoulli distribution is a special case of the
multinomial distribution. A multinomial distribution is the distribution over vectors in {0, . . . , k}
n
representing how many times each of the k categories is visited when n samples are drawn from a
Multinoulli distribution. Many texts use the term “multinomial” to refer to Multinoulli distributions
without clarifying that they refer only to the n = 1 case.
44
probability is given by 1 1
p. Note that we must constrain 1
p 1. Multinoulli
distributions are often used to refer to distributions over categories of objects, so we do
not usually assume that state 1 has numerical value 1, etc. For this reason, we do not
usually need to compute the expectation or variance of Multinoulli-distributed random
variables.
The Bernoulli and Multinoulli distributions are sufficient to describe any distribution
over their domain. This is because they model discrete variables for which it is feasible
to simply enumerate all of the states. When dealing with continuous variables, there are
uncountably many states, so any distribution described by a small number of parameters
must impose strict limits on the distribution.
3.10.3 Gaussian Distribution
The most commonly used distribution over real numbers is the normal distribution, also
known as the Gaussian distribution:
N(x | µ, σ
2
) =
1
2πσ
2
exp
1
2σ
2
(x µ)
2
.
See Fig. 3.2 for a schematic.
The two parameters µ R and σ R
+
control the normal distribution. µ gives the
coordinate of the central peak. This is also the mean of the distribution, i.e. E[X] = µ.
The standard deviation of the distribution is given by σ, i.e. Var(X) = σ
2
.
Note that when we evaluate the PDF, we need to square and invert σ. When we
need to frequently evaluate the PDF with different parameter values, a more efficient
way of parameterizing the distribution is to use a parameter β R
+
to control the
precision or inverse variance of the distribution:
N(x | µ, β
1
) =
β
2π
exp
1
2
β(x µ)
2
.
Normal distributions are a sensible choice for many applications. In the absence of
prior knowledge about what form a distribution over the real numbers should take, the
normal distribution is a good default choice for two major reasons.
First, many distributions we wish to model are truly close to being normal distri-
butions. The central limit theorem shows that the sum of many independent random
variables is approximately normally distributed. This means that in practice, many
complicated systems can be modeled successfully as normally distributed noise, even if
the system can be decomposed into parts with more structured behavior.
Second, the normal distribution in some sense makes the fewest assumptions of
any distribution over the reals, so choosing to use it inserts the least amount of prior
knowledge into a model. Out of all distributions with the same variance, the normal
distribution has the highest entropy. It is not possible to place a uniform distribution
on all of R. The closest we can come to doing so is to use a normal distribution with
high variance.
45
Figure 3.2: The normal distribution: The normal distribution N(x | µ, σ
2
) exhibits a
classic “bell curve” shape, with the x coordinate of its central peak given by µ, and
the width of its peak controlled by σ. In this example, we depict the standard normal
distribution, with µ = 0 and σ = 1.
46
The normal distribution generalizes to R
n
, in which case it is known as the multi-
variate normal distribution. It may be parameterized with a positive definite symmetric
matrix Σ:
N(x | µ, Σ) =
1
(2π)
n
det(Σ)
exp
1
2
(x µ)
Σ
1
(x µ)
.
The parameter µ still gives the mean of the distribution, though now it is vector-
valued. The parameter Σ gives the covariance matrix of the distribution. As in the
univariate case, the covariance is not necessarily the most computationally efficient way
to parameterize the distribution, since we need to invert Σ to evaluate the PDF. We
can instead use a precision matrix β:
N(x | µ, β
1
) =
det(β)
(2π)
n
exp
1
2
(x µ)
β(x µ)
.
3.10.4 Dirac Distribution
In some cases, we wish to specify that all of the mass in a probability distribution
clusters around a single point. This can be accomplished by defining a PDF using the
Dirac delta function, δ(x):
p(x) = δ(x µ).
The Dirac delta function is defined such that it is zero-valued everywhere but 0, yet
integrates to 1. By defining p(x) to be δ shifted by µ we obtain an infinitely narrow
and infinitely high peak of probability mass where x = µ.
A common use of the Dirac delta distribution is as a component of the so-called
empirical distribution,
ˆp(x) =
1
n
n
i=1
δ(x x
i
)
which puts probability mass
1
n
on each of the n points x
1
, . . . x
n
forming a given data
set or collection of samples. The Dirac delta distribution is only necessary to define the
empirical distribution over continuous variables. For discrete variables, the situation is
simpler: an empirical distribution can be conceptualized as a Multinoulli distribution,
with a probability associated to each possible input value that is simply equal to the
empirical frequency of that value in the training set.
We can view the empirical distribution formed from a dataset of training examples as
specifying the distribution that we sample from when we train a model on this dataset.
Another important perspective on the empirical distribution is that it is the probability
density that maximizes the likelihood of of the training data (see Section 5.6). Many
machine learning algorithms can be configured to have arbitrarily high capacity. If given
enough capacity, these algorithms will simply learn the empirical distribution. This is
a bad outcome because the model does not generalize at all and assigns infinitesimal
probability to any point in space that did not occur in the training set. A central
47
problem in machine learning is studying how to limit the capacity of a model in a way
that prevents it from simply learning the empirical distribution while also allowing it to
learn complicated functions.
The empirical distribution is a particular form of mixture, discussed next.
3.10.5 Mixtures of Distributions and Gaussian Mixture
It is also common to define probability distributions by composing other simpler prob-
ability distributions. One common way of combining distributions is to construct a
mixture distribution. A mixture distribution is made up of several component distribu-
tions. On each trial, the choice of which component distribution generates the sample
is determined by sampling a component identity from a Multinoulli distribution:
P (X) =
i
P (C = i)P (X | C = i)
where P (C) is the Multinoulli distribution over component identities. In chapter 9, we
explore the art of building complex probability distributions from simple ones in more
detail. Note that we can think of the variable C as a non-observed (or latent) random
variable that is related to X through their joint distribution P (X, C) = P (X|C)P (C).
Latent variables are discussed further in Section 9.4.2.
A very powerful and common type of mixture model is the Gaussian mixture model,
in which the components P (X | C = i) are Gaussians, each with its mean µ
i
and
covariance Σ
i
. Some mixtures can have more constraints, for example, the covariances
could be shared across components, i.e., Σ
i
= Σ
j
= Σ, or the covariance matrices could
be constrained to be diagonal or simply equal to a scalar times the identity. A Gaussian
mixture model is a universal approximator of densities, in the sense that any smooth
density can be approximated to a particular precision by a Gaussian mixture model with
enough components. Gaussian mixture models have been used in many settings, and
are particularly well known for their use as acoustic models in speech recognition (Bahl
et al., 1987).
3.11 Useful properties of common functions
Certain functions arise often while working with probability distributions, especially the
probability distributions used in deep learning models.
One of these functions is the logistic sigmoid:
σ(x) =
1
1 + exp(x)
.
The logistic sigmoid is commonly used to produce the φ parameter of a Bernoulli dis-
tribution because its range is (0, 1), which lies within the valid range of values for the φ
parameter. See Fig. 3.3 for a graph of the sigmoid function.
Another commonly encountered function is the softplus function:
ζ(x) = log (1 + exp(x)) .
48
Figure 3.3: The logistic sigmoid function.
49
Figure 3.4: The softplus function.
The softplus function can be useful for producing the β or σ parameter of a normal
distribution because its range is R
+
. It also arises commonly when manipulating ex-
pressions involving sigmoids. The name of the softplus function comes from the fact
that it is a smoothed or “softened” version of
x
+
= max(0, x).
See Fig. 3.4 for a graph of the softplus function.
The following properties are all useful to have memorized:
σ(x) =
exp(x)
exp(x) + exp(0)
d
dx
σ(x) = σ(x)(1 σ(x))
1 σ(x) = σ(x)
log σ(x) = ζ(x)
d
dx
ζ(x) = σ(x)
50
ζ(x) ζ(x) = x
x (0, 1), σ
1
(x) = log
x
1 x
x > 0, ζ
1
(x) = log (exp(x) 1)
The final property provides extra justification for the name “softplus”, since x
+
x
= x.
3.12 Bayes’ rule
We often find ourselves in a situation where we know P (Y | X) and need to know
P (X | Y ). Fortunately, if we also know P (X), we can compute the desired quantity
using Bayes’ rule:
P (X | Y ) =
P (X)P (Y | X)
P (Y )
.
Note that while P(Y ) appears in the formula, it is usually feasible to compute P (Y ) =
x
P (Y |x)P (x), so we do not need to begin with knowledge of P(Y ).
Bayes’ rule is straightforward to derive from the definition of conditional probability,
but it is useful to know the name of this formula since many texts refer to it by name.
It is named after the Reverend Thomas Bayes, who first discovered a special case of the
formula. The general version presented here was independently discovered by Pierre-
Simon Laplace.
3.13 Technical details of continuous variables
A proper formal understanding of continuous random variables and probability density
functions requires developing probability theory in terms of a branch of mathematics
known as measure theory. Measure theory is beyond the scope of this textbook, but we
can briefly sketch some of the issues that measure theory is employed to resolve.
In section 3.3.2, we saw that the probability of x lying in some set S is given by
the integral of p(x) over the set S. Some choices of set S can produce paradoxes. For
example, it is possible to construct two sets S
1
and S
2
such that P(S
1
) + P (S
2
) > 1 but
S
1
S
2
= . These sets are generally constructed making very heavy use of the infinite
precision of real numbers, for example by making fractal-shaped sets or sets that are
defined by transforming the set of rational numbers
3
. One of the key contributions of
measure theory is to provide a characterization of the set of sets that we can compute
the probability of without encountering paradoxes. In this book, we only integrate over
sets with relatively simple descriptions, so this aspect of measure theory never becomes
a relevant concern.
For our purposes, measure theory is more useful for describing theorems that apply
to most points in R
n
but do not apply to some corner cases. Measure theory provides a
3
The Banach-Tarski theorem provides a fun example of such sets.
51
rigorous way of describing that a set of points is negligibly small. Such a set is is said to
have “measure zero”. We do not formally define this concept in this textbook. However,
it is useful to understand the intuition that a set of measure zero occupies no volume
in the space we are measuring. For example, within R
2
, a line has measure zero, while
a filled polygon has positive measure. Likewise, an individual point has measure zero.
Any union of countably many sets that each have measure zero also has measure zero
(so the set of all the rational numbers has measure zero, for instance).
Another useful term from measure theory is almost everywhere”. A property that
holds almost everywhere holds throughout all of space except for on a set of measure
zero. Because the exceptions occupy a negligible amount of space, they can be safely
ignored for many applications. Some important results in probability theory hold for all
discrete values but only hold “almost everywhere” for continuous values.
One other detail we must be aware of relates to handling random variables that are
deterministic functions of one another. Suppose we have two random variables, x and
y, such that y = g(x). You might think that p
y
(y) = p
x
(g
1
(y)). This is actually not
the case.
Suppose y =
x
2
and x U (0, 1). If we use the rule p
y
(y) = p
x
(2y) then p
y
will be
0 everywhere except the interval [0,
1
2
], and it will be 1 on this interval. This means
−∞
p
y
(y)dy =
1
2
which violates the definition of a probability distribution.
This common mistake is wrong because it fails to account for the distortion of space
introduced by the function g(x). Recall that the probability of x lying in an infinitesi-
mally small region with volume δx is given by p(x)δx. Since g can expand or contract
space, the infinitesimal volume surrounding x in x space may have different volume in
y space. To correct the problem, we need to preserve the property
|p
y
(g(x))dy| = |p
x
(x)dx|.
Solving from this, we obtain
p
y
(y) = p
x
(g
1
(y))|
x
y
|
or equivalently
p
x
(x) = p
y
(g(x))|
g
(x)
x
|. (3.1)
In higher dimensions, the absolute value of the derivative generalizes to the determinant
of the Jacobian matrix the matrix with J
i,j
=
x
i
y
j
.
3.14 Example: Naive Bayes
We now know enough probability theory that we can derive some simple machine learn-
ing tools.
52
The Naive Bayes model is a simple probabilistic model that is often used to recognize
patterns. The model consists of one random variable C representing a category, and a
set of random variables F = {F
(1)
, . . . , F
(n)
} representing features of objects in each
category. In this example, we’ll use Naive Bayes to diagnose patients as having the flu
or not. C can thus have two values: c
0
representing the category of patients who do
not have the flu, and c
1
representing the category of patients who do. Suppose F
(1)
is the random variable representing whether the patient has a sore throat, with f
(1)
0
representing no sore throat, and f
(1)
1
representing a sore throat. Suppose F
(2)
R is
the patient’s temperature in degrees Celsius.
When using the Naive Bayes model, we assume that all of the features are indepen-
dent from each other given the category:
P (C, F
(1)
, . . . , F
(n)
) = P (C
i
P (F
(i)
| C).
These assumptions are very strong and unlikely to be true in practice, hence the name
“naive”. Surprisingly, Naive Bayes often produces good predictions in practice, and is
a good baseline model to start with when tackling a new problem.
Beyond these conditional independence assumptions, the Naive Bayes framework
does not specify anything about the probability distribution. The specific choice of
distributions is left up to the designer. In our flu example, let’s make P (C) a Bernoulli
distribution, with P (C = c
1
) = φ
(C)
. We can also make P (F
(1)
| C) a Bernoulli
distribution, with
P (F
(1)
= f
(1)
1
| C = c) = φ
F
c
.
In other words, the Bernoulli parameter changes depending on the value of C. Finally, we
need to choose the distribution over F
(2)
. Since F
(2)
is real-valued, a normal distribution
is a good choice. Because F
(2)
is a temperature, there are hard limits to the values it
can take on—it cannot go below 0K, for example. Fortunately, these values are so far
from the values measured in human patients that we can safely ignore these hard limits.
Values outside the hard limits will receive extremely low probability under the normal
distribution so long as the mean and variance are set correctly. As with F
(1)
, we need
to use different parameters for different values of C, to represent that patients with the
flu have different temperatures than patients without it:
F
(2)
N(F
(2)
| µ
c
, σ
2
c
).
Now we are ready to determine how likely a patient is to have the flu. To do this,
we want to compute P(C | F ), but we know P (C) and P (F | C). This suggests that
we should use Bayes’ rule to determine the desired distribution. The word “Bayes” in
the name “Naive Bayes” comes from this frequent use of Bayes’ rule in conjunction with
the model. We begin by applying Bayes’ rule:
P (C | F ) =
P (C)P (F | C)
P (F )
. (3.2)
53
We do not know P (F ). Fortunately, it is easy to compute:
P (F ) =
cC
P (C = c, F ) (by the sum rule)
=
cC
P (C = c)P (F | C = c) (by the chain rule).
Substituting this result back into equation 3.2, we obtain
P (C | F ) =
P (C)P (F | C)
cC
P (C = c)P (F | C = c)
=
P (C
i
P (F
(i)
| C)
cC
P (C = c
i
P (F
(i)
| C = c)
by the Naive Bayes assumptions. This is as far as we can simplify the expression for a
general Naive Bayes model.
We can simplify the expression further by substituting in the definitions of the par-
ticular probability distributions we have defined for our flu diagnosis example:
P (C = c | F
(1)
= f
1
, F
(2)
= f
2
) =
g(c)
c
C
g(c
)
where
g(c) = P (C = c)P (F
(1)
= f
(1)
| C = c)p(F
(2)
= f
(2)
| C = c).
Since C only has two possible values in our example, we can simplify this to:
P (C = c | F
(1)
= f
1
, F
(2)
= f
2
) =
g(1)
g(0) + g(1)
=
1
1 +
g(0)
g(1)
=
1
1 + exp (log g(0) log g(1))
= σ (log g(1) log g(0)) . (3.3)
To go further, let’s simplify log g(i):
log g(i) = log
φ
(C)i
(1 φ
(C)
)
1i
φ
(F )f
1
1
(1 φ
(F )
1
)
1f
1
1
2πσ
2
i
exp
1
2σ
2
i
(f
2
µ
i
)
2
= i log φ
(C)
+(1i) log
1 φ
(C)
+f
1
log φ
(F )
i
+(1f
1
) log(1φ
(F )
i
)
1
2
log
1
2πσ
2
i
1
2σ
2
i
(f
2
µ
i
)
2
.
Substituting this back into equation 3.3, we obtain
P (C = c | F
(1)
= f
1
, F
(2)
= f
2
) =
54
σ
log φ
(C)
log(1 φ
(C)
) + f
1
log φ
(F )
1
+ (1 f
1
) log(1 φ
(F )
1
)
f
1
log φ
(F )
0
+ (1 f
1
) log(1 φ
(F )
0
)
1
2
log 2πσ
2
1
+
1
2
log 2πσ
2
0
1
2σ
2
1
(f
2
µ
1
)
2
+
1
2σ
2
0
(f
2
µ
0
)
2
.
From this formula, we can read off various intuitive properties of the Naive Bayes
classifier’s behavior on this example problem, regarding the inference that can be drawn
from a trained model. The probability of the patient having the flu grows like a sigmoidal
curve. We move farther to the left as f
2
, the patient’s temperature, moves farther away
from µ
1
, the average temperature of a flu patient.
55