Notation
This section provides a concise reference describing the notation used throughout
this book. If you are unfamiliar with any of these mathematical concepts, this
notation reference may seem intimidating. However, do not despair, we describe
most of these ideas in chapters 1-3.
Numbers and Arrays
a A scalar (integer or real) value with the name “a”
a A vector with the name “a”
A A matrix with the name “A”
A A tensor with the name “A”
I
n
Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
e
i
Standard basis vector [0, . . . , 0, 1, 0, . . . ,0] with a 1 at position i.
diag(a) A square, diagonal matrix with entries given by a
a A scalar random variable with the name “a”
a A vector-valued random variable with the name “a”
A A matrix-valued random variable with the name “A”
ix
CONTENTS
Sets and Graphs
A A set with the name “A”
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n
[a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B Set subtraction, i.e., the elements of A that are not in B
G A graph with the name “G”
P a
G
(x
i
) The parents of x
i
in G.
Indexing
a
i
Element i of vector a, with indexing starting at 1
a
i
All elements of vector a except for element i
A
i,j
Element i, j of matrix A
A
i,:
Row i of matrix A
A
:,i
Column i of matrix A
A
i,j,k
Element (i, j, k) of a 3-D tensor A
A
:,:,i
2-D slice of a 3-D tensor
a
i
Element i of the random vector a
Linear Algebra Operations
A
>
Transpose of matrix A
A
+
Moore-Penrose pseudoinverse of A
A B Element-wise (Hadamard) product of A and B
x
CONTENTS
Calculus
dy
dx
Derivative of y with respect to x
y
x
Partial derivative of y with respect to x
x
y Gradient of y with respect to x
X
y Matrix derivatives of y with respect to x
f
x
Jacobian matrix J R
m×n
of a function f : R
n
R
m
H(f)(x) The Hessian matrix of f at input point x
Z
f(x)dx Definite integral over the entire domain of x
Z
S
f(x)dx Definite integral with respect to x over the set S
Probability and Information Theory
ab The random variables a and b are independent.
ab | c They are are conditionally independent given c.
E
xP
[f(x)] or Ef(x) Expectation of f (x) with respect to P (x)
Var(f(x)) Variance of f(x) under P (x)
Cov(f(x), g(x)) Covariance of f(x) and g(x) under P (x, y)
H(x) Shannon entropy of the random variable x
D
KL
(P kQ) Kullback-Leibler divergence of P and Q
xi
CONTENTS
Functions
f g Composition of the functions f and g
f(x; θ) A function of x parameterized by θ
log x Natural logarithm of x
σ(x) Logistic sigmoid, 1/(1 + exp(x))
ζ(x) Softplus, log(1 + exp(x))
||x||
p
L
p
norm of x
x
+
Positive part of x, i.e., max(0, x)
1
condition
is 1 if the condition is true, 0 otherwise.
Sometimes we write f(x), f(X), or f(X), when f is a function of a scalar rather
than a vector, matrix, or tensor. In this case, we mean to apply f to the array
element-wise. For example, if C = σ(X), then C
i,j,k
= σ(X
i,j,k
) for all valid values
of i, j and k.
Datasets and distributions
X A set of training examples
x
(i)
The i-th example (input) from a dataset
y
(i)
or y
(i)
The target associated with x
(i)
for supervised learning
X The m × n matrix with input example x
(i)
in row X
i,:
xii