
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
machine learning makes heavy use of probability theory.
This is because machine learning must always deal with uncertain quantities,
and sometimes may also need to deal with stochastic (non-deterministic) quan-
tities. Uncertainty and stochasticity can arise from many sources. Researchers
have made compelling arguments for quantifying uncertainty using probability
since at least the 1980s. Many of the arguments presented here are summarized
from or inspired by Pearl (1988).
Nearly all activities require some ability to reason in the presence of uncer-
tainty. In fact, beyond mathematical statements that are true by definition, it is
difficult to think of any proposition that is absolutely true or any event that is
absolutely guaranteed to occur.
There are three possible sources of uncertainty:
1. Inherent stochasticity in the system being modeled. For example, most
interpretations of quantum mechanics describe the dynamics of subatomic
particles as being probabilistic. We can also create theoretical scenarios that
we postulate to have random dynamics, such as a hypothetical card game
where we assume that the cards are truly shuffled into a random order.
2. Incomplete observability. Even deterministic systems can appear stochastic
when we cannot observe all of the variables that drive the behavior of the
system. For example, in the Monty Hall problem, a game show contestant is
asked to choose between three doors and wins a prize held behind the chosen
door. Two doors lead to a goat while a third leads to a car. The outcome
given the contestant’s choice is deterministic, but from the contestant’s point
of view, the outcome is uncertain.
3. Incomplete modeling. When we use a model that must discard some of the
information we have observed, the discarded information results in uncer-
tainty in the model’s predictions. For example, suppose we build a robot
that can exactly observe the location of every object around it. If the robot
discretizes space when predicting the future location of these objects, then
the discretization makes the robot immediately become uncertain about
the precise position of objects: each object could be anywhere within the
discrete cell that it was observed to occupy.
In many cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, even if the true rule is deterministic and our
modeling system has the fidelity to accommodate a complex rule. For example,
the simple rule “Most birds fly” is cheap to develop and is broadly useful, while a
rule of the form, “Birds fly, except for very young birds that have not yet learned
to fly, sick or injured birds that have lost the ability to fly, flightless species of birds
49