
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
machine learning makes heavy use of probability theory.
This is because machine learning must always deal with uncertain quantities,
and sometimes may also need to deal with stochastic quantities. Uncertainty
and stochasticity can arise from many sources. Researchers have made com-
pelling arguments for quantifying uncertainty using probability since at least the
1980s. Many of the arguments presented here are summarized from or inspired
by (Pearl, 1988). Much earlier work in probability and engineering introduced
and developed the underlying fundamental notions, such as the notion of ex-
changeability (de Finetti, 1937), Cox’s theorem as the foundations of Bayesian
inference (Cox, 1946), and the theory of stochastic processes (Doob, 1953).
Nearly all activities require some ability to reason in the presence of uncer-
tainty. In fact, beyond mathematical statements that are true by definition, it is
difficult to think of any proposition that is absolutely true or any event that is
absolutely guaranteed to occur.
One source of uncertainty is incomplete observability. When we cannot ob-
serve something, we are uncertain about its true nature. In machine learning, it
is often the case that we can observe a large amount of data, but there is not a
data instance for every situation we care about. We are also generally not able to
observe directly what process generates the data. Since we are uncertain about
what process generates the data, we are also uncertain about what happens in
the situations for which we have not observed data points. Lack of observability
can also give rise to apparent stochasticity. Deterministic systems can appear
stochastic when we cannot observe all of the variables that drive the behavior of
the system. For example, consider a game of Russian roulette. The outcome is
deterministic if you know which chamber of the revolver is loaded. If you do not
know this important information, then it is a game of chance. In many cases, we
are able to observe some quantity, but our measurement is itself uncertain. For
example, laser range finders may have several centimeters of random error.
Uncertainty can also arise from the simplifications we make in order to model
real-world processes. For example, if we discretize space, then we immediately
become uncertain about the precise position of objects: each object could be
anywhere within the discrete cell that we know it occupies.
Conceivably, the universe itself could have stochastic dynamics, but we make
no claim on this subject.
In many cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, even if our modeling system has the fidelity to
accommodate a complex rule. For example, the simple rule “Most birds fly” is
cheap to develop and is broadly useful, while a rule of the form, “Birds fly, except
for very young birds that have not yet learned to fly, sick or injured birds that have
lost the ability to fly, flightless species of birds including the cassowary, ostrich,
47