Chapter 2
Linear algebra
Linear algebra is a branch of mathematics that is widely used throughout science and
engineering. However, because linear algebra is a form of continuous rather than dis-
crete mathematics, many computer scientists have little experience with it. A good
understanding of linear algebra is essential for understanding deep learning and working
with deep learning algorithms. We therefore begin the technical content of the book
with a focused presentation of the key linear algebra ideas that are most important in
deep learning.
If you are already familiar with linear algebra, feel free to skip this chapter. If
you have previous experience with these concepts but need a detailed reference sheet
to review key formulas, we recommend The Matrix Cookbook (Petersen and Pedersen,
2006). If you have no exposure at all to linear algebra, this chapter will teach you
enough to get by, but we highly recommend that you also consult another resource
focused exclusively on teaching linear algebra, such as (Shilov, 1977).
2.1 Scalars, vectors, matrices and tensors
The study of linear algebra involves several types of mathematical objects:
Scalars: A scalar is just a single number, in contrast to most of the other objects
studied in linear algebra, which are usually arrays of multiple numbers. We write
scalars in italics. We usually give scalars lower-case variable names. When we
introduce them, we specify what kind of number they are. For example, we might
say “Let s R be the slope of the line,” while defining a real-valued scalar, or
“Let n N be the number of units,” while defining a natural number scalar.
Vectors: A vector is an array of numbers. The numbers have an order to them, and
we can identify each individual number by its index in that ordering. Typically we
give vectors lower case names written in bold typeface, such as x. The elements
of the vector are identified by writing its name in italic typeface, with a subscript.
The first element of x is x
1
, the second element is x
2
, and so on. We also need
to say what kind of numbers are stored in the vector. If each element is in R,
20
and the vector has n elements, then the vector lies in the set formed by taking
the Cartesian product of R n times, denoted as R
n
. When we need to explicitly
identify the elements of a vector, we write them as a column enclosed in square
brackets:
x =
x
1
x
2
.
.
.
x
n
.
We can think of vectors as identifying points in space, with each element giving
the coordinate along a different axis.
Sometimes we need to index a set of elements of a vector. In this case, we define a
set containing the indices, and write the set as a subscript. For example, to access
x
1
, x
3
, and x
6
, we define the set S = {1, 3, 6} and write x
S
. We use the sign
to index the complement of a set. For example x
1
is the vector containing all
elements of x except for x
1
, and x
S
is the vector containing all of the elements
of x except for x
1
, x
3
, and x
6
.
Matrices: A matrix is a 2-D array of numbers, so each element is identified by two
indices instead of just one. We usually give matrices upper-case variable names
with bold typeface, such as A. If a real-valued matrix A has a height of m and
a width of n, then we say that A R
m×n
. We usually identify the elements of a
matrix with its lower-case name, and the indices are listed with separating commas.
For example, a
1,1
is the upper left entry of of A and a
m,n
is the bottom right entry.
We can identify all of the numbers with vertical coordinate i by writing a “:” for
the horizontal coordinate. For example, A
i,:
denotes the horizontal cross section
of A with vertical coordinate i. This is known as the i-th row of A. Likewise,
A
:,i
is the i-th column of A. When we need to explicitly identify the elements of
a matrix, we write them as an array enclosed in square brackets:
a
1,1
a
1,2
a
2,1
a
2,2
.
Sometimes we may need to index matrix-valued expressions that are not just a
single letter. In this case, we use subscripts after the expression, but do not
convert anything to lower case. For example, f(A)
i,j
gives element (i, j) of the
matrix computed by applying the function f to A.
Tensors: In some cases we will need an array with more than two axes. In the
general case, an array of numbers arranged on a regular grid with a variable number
of axes is known as a tensor.
One important operation on matrices is the transpose. The transpose of a matrix is
the mirror image of the matrix across a diagonal line running down and to the right,
21
BRIEF ARTICLE
THE AUTHOR
A =
2
4
a
1,1
a
1,2
a
2,1
a
2,2
a
3,1
a
3,2
3
5
) A
>
=
a
1,1
a
2,1
a
3,1
a
1,2
a
2,2
a
3,2
1
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
starting from its upper left corner. See Fig. 2.1 for a graphical depiction of this operation.
We denote the transpose of a matrix A as A
, and it is defined such that
(A
)
i,j
= a
j,i
.
Vectors can be thought of as matrices that contain only one column. The transpose
of a vector is therefore a matrix with only one row. Sometimes we define a vector by
writing out its elements in the text inline as a row matrix, then using the transpose
operator to turn it into a standard column vector, e.g. x = [x
1
, x
2
, x
3
]
.
2.2 Multiplying matrices and vectors
One of the most important operations involving matrices is multiplication of two ma-
trices. The matrix product of matrices A and B is a third matrix C. In order for this
product to be defined, A must have the same number of columns as B has rows. If A
is of shape m ×n and B is of shape n ×p, then C is of shape m ×p. We can write the
matrix product just by placing two or more matrices together, e.g.
C = AB.
The product operation is defined by
c
i,j
=
k
a
i,k
b
k,j
.
Note that the standard product of two matrices is not just a matrix containing the
product of the individual elements. Such an operation exists and is called the element-
wise product or Hadamard product, and is denoted in this book
1
as A B.
The dot product between two vectors x and y of the same dimensionality is the
matrix product x
y. We can think of the matrix product C = AB as computing c
i,j
as the dot product between row i of A and column j of B.
1
The element-wise product is used relatively rarely, so the notation for it is not as standardized as
the other operations described in this chapter.
22
Matrix product operations have many useful properties that make mathematical
analysis of matrices more convenient. For example, matrix multiplication is distributive:
A(B + C) = AB + AC.
It is also associative:
A(BC) = (AB)C.
Matrix multiplication is not commutative, unlike scalar multiplication.
The transpose of a matrix product also has a simple form:
(AB)
= B
A
.
Since the focus of this textbook is not linear algebra, we do not attempt to develop a
comprehensive list of useful properties of the matrix product here, but the reader should
be aware that many more exist.
We can also multiply matrices and vectors by scalars. To multiply by a scalar, we
just multiply every element of the matrix or vector by that scalar:
sA =
sa
1,1
. . . sa
1,n
.
.
.
.
.
.
.
.
.
sa
m,1
. . . sa
m,n
We now know enough linear algebra notation to write down a system of linear equa-
tions:
Ax = b (2.1)
where A R
m×n
is a known matrix, b R
m
is a known vector, and x R
n
is a
vector of unknown variables we would like to solve for. Each element x
i
of x is one
of these unknowns to solve for. Each row of A and each element of b provide another
constraint. We can rewrite equation 2.1 as:
A
1,:
x = b
1
A
2,:
x = b
2
. . .
A
m,:
x = b
m
or, even more explicitly, as:
a
1,1
x
1
+ a
1,2
x
2
+ ··· + a
1,n
x
n
= b
1
a
2,1
x
1
+ a
2,2
x
2
+ ··· + a
2,n
x
n
= b
2
. . .
a
m,1
x
1
+ a
m,2
x
2
+ ··· + a
m,n
x
n
= b
m
.
Matrix-vector product notations provides a more compact representation for equa-
tions of this form.
23
1 0 0
0 1 0
0 0 1
Figure 2.2: Example identity matrix: This is I
3
.
2.3 Identity and inverse matrices
Linear algebra offers a powerful tool called matrix inversion that allows us to solve
equation 2.1 for many values of A.
To describe matrix inversion, we first need to define the concept of an identity matrix.
An identity matrix is a matrix that does not change any vector when we multiply that
vector by that matrix. We denote the n-dimensional identity matrix as I
n
. Formally,
x R
n
, I
n
x = x.
The structure of the identity matrix is simple: all of the entries along the main diagonal
(where the row coordinate is the same as the column coordinate) are 1, while all of the
other entries are zero. See Fig. 2.2 for an example.
The matrix inverse of A is denoted as A
1
, and it is defined as the matrix such that
A
1
A = I
n
.
We can now solve equation 2.1 by the following steps:
Ax = b
A
1
Ax = A
1
b
I
n
x = A
1
b
x = A
1
b.
Of course, this depends on it being possible to find A
1
. We discuss the conditions
for the existence of A
1
in the following section.
When A
1
exists, several different algorithms exist for finding it in closed form. In
theory, the same inverse matrix can then be used to solve the equation many times for
different values of b. However, A
1
is primarily useful as a theoretical tool, and should
not actually be used in practice for most software applications. Because A
1
can only
be represented with limited precision on a digital computer, algorithms that make use
of the value of b can usually obtain more accurate estimates of x.
24
2.4 Linear dependence, span, and rank
In order for A
1
to exist, equation 2.1 must have exactly one solution for every value
of b. However, it is also possible for the system of equations to have no solutions or
infinitely many solutions for some values of b. It is not possible to have more than one
but less than infinitely many solutions for a particular b; if both x and y are solutions
then
z = αx + (1 α)y
is also a solution for any real α.
To analyze how many solutions the equation has, we can think of the columns of A
as specifying different directions we can travel from the origin (the point specified by
the vector of all zeros), and determine how many ways there are of reaching b. In this
view, each element of x specifies how far we should travel in each of these directions,
i.e. x
i
specifies how far to move in the direction of column i:
Ax =
i
x
i
A
:,i
.
In general, this kind of operation is called a linear combination. Formally, a linear
combination of some set of vectors {v
(1)
, . . . , v
(n)
} is given by multiplying each vector
v
(i)
by a corresponding scalar coefficient and adding the results:
i
c
i
v
(i)
.
The span of a set of vectors is the set of all points obtainable by linear combination of
the original vectors.
Determining whether Ax = b has a solution thus amounts to testing whether b is
in the span of the columns of A. This particular span is known as the column space or
the range of A.
In order for the system Ax = b to have a solution for all values of b R
m
, we
therefore require that the column space of A be all of R
m
. If any point in R
m
is excluded
from the column space, that point is a potential value of b that has no solution. This
implies immediately that A must have at least m columns, i.e., n m. Otherwise, the
dimensionality of the column space must be less than m. For example, consider a 3 × 2
matrix. The target b is 3-D, but x is only 2-D, so modifying the value of x at best
allows us to trace out a 2-D plane within R
3
. The equation has a solution if and only if
b lies on that plane.
Having n m is only a necessary condition for every point to have a solution. It is
not a sufficient condition, because it is possible for some of the columns to be redundant.
Consider a 2 ×2 matrix where both of the columns are equal to each other. This has the
same column space as a 2 ×1 matrix containing only one copy of the replicated column.
In other words, the column space is still just a line, and fails to encompass all of R
2
,
even though there are two columns.
25
Formally, this kind of redundancy is known as linear dependence. A set of vectors is
linearly independent if no vector in the set is a linear combination of the other vectors.
If we add a vector to a set that is a linear combination of the other vectors in the set, the
new vector does not add any points to the set’s span. This means that for the column
space of the matrix to encompass all of R
m
, the matrix must have at least m linearly
independent columns. This condition is both necessary and sufficient for equation 2.1
to have a solution for every value of b.
In order for the matrix to have an inverse, we additionally need to ensure that
equation 2.1 has at most one solution for each value of b. To do so, we need to ensure
that the matrix has at most m columns. Otherwise there is more than one way of
parameterizing each solution.
Together, this means that the matrix must be square, that is, we require that m = n,
and that all of the columns must be linearly independent. A square matrix with linearly
dependent columns is known as singular.
If A is not square or is square but singular, it can still be possible to solve the
equation. However, we can not use the method of matrix inversion to find the solution.
So far we have discussed matrix inverses as being multiplied on the left. It is also
possible to define an inverse that is multiplied on the right:
AA
1
= I.
For square matrices, the left inverse and right inverse are the same.
2.5 Norms
Sometimes we need to measure the size of a vector. In machine learning, we usually
measure the size of vectors using an l
p
norm:
||x||
p
=
i
|x
i
|
p
1
p
for p R, p 1.
Norms, including the l
p
norm, are functions mapping vectors to non-negative values,
satisfying these properties that make them behave like distances between points:
f (x) = 0 x = 0
f (x + y) f(x) + f(y) (the triangle inequality)
α R, f(αx) = |α|f(x)
The l
2
norm, with p = 2, is known as the Euclidean norm. It is simply the Euclidean
distance from the origin to the point identified by x. This is probably the most common
norm used in machine learning. It is also common to measure the size of a vector using
the squared l
2
norm, which can be calculated simply as x
x.
26
The squared l
2
norm is more convenient to work with mathematically and computa-
tionally than the l
2
norm itself. For example, the derivatives of the squared l
2
norm with
respect to each element of x each depend only on the corresponding element of x, while
all of the derivatives of the l
2
norm depend on the entire vector. In many contexts, the
l
2
norm may be undesirable because it increases very slowly near the origin. In several
machine learning applications, it is important to discriminate between elements that are
exactly zero and elements that are small but nonzero. In these cases, we turn to a func-
tion that grows at the same rate in all locations, but retains mathematical simplicity:
the l
1
norm. The l
1
norm may be simplified to
||x||
1
=
i
|x
i
|.
The l
1
norm is commonly used in machine learning when the difference between zero
and nonzero elements is very important. Every time an element of x moves away from
0 by , the l
1
norm increases by .
We sometimes measure the size of the vector by counting its number of nonzero
elements (and when we use the l
1
norm, we often use it as a proxy for this function).
Some authors refer to this function as the l
0
norm,” but this is incorrect terminology,
because scaling the vector by α does not change the number of nonzero entries.
One other norm that commonly arises in machine learning is the l
norm, also
known as the max norm. This norm simplifies to
||x||
= max
i
|x
i
|,
e.g., the absolute value of the element with the largest magnitude in the vector.
Sometimes we may also wish to measure the size of a matrix. In the context of deep
learning, the most common way to do this is with the otherwise obscure Frobenius norm
||A||
F
=
i,j
a
2
i,j
,
which is analogous to the l
2
norm of a vector.
The dot product of two vectors can be rewritten in terms of norms. Specifically,
x
y = ||x||
2
||y||
2
cos θ
where θ is the angle between x and y.
2.6 Special kinds of matrices and vectors
Some special kinds of matrices and vectors are particularly useful.
Diagonal matrices only have non-zero entries along the main diagonal. Formally, a
matrix D is diagonal if and only if d
i,j
= 0 for all i = j. We’ve already seen one example
of a diagonal matrix: the identity matrix, where all of the diagonal entries are 1. In this
27
book
2
, we write diag(v) to denote a square diagonal matrix whose diagonal entries are
given by the entries of the vector v. Diagonal matrices are of interest in part because
multiplying by a diagonal matrix is very computationally efficient. To compute diag(v)x,
we only need to scale each element x
i
by v
i
. In other words, diag(v)x = v x. Inverting
a diagonal matrix is also efficient. The inverse exists only if every diagonal entry is
nonzero, and in that case, diag(v)
1
= diag([1/v
1
, . . . , 1/v
n
]
). In many cases, we may
derive some very general machine learning algorithm in terms of arbitrary matrices, but
obtain a less expensive (and less descriptive) algorithm by restricting some matrices to
be diagonal.
A symmetric matrix is any matrix that is equal to its own transpose:
A = A
.
Symmetric matrices often arise when the entries are generated by some function of two
arguments that does not depend on the order of the arguments. For example, if A is a
matrix of distance measurements, with a
i,j
giving the distance from point i to point j,
then a
i,j
= a
j,i
because distance functions are symmetric.
A unit vector is a vector with unit norm:
||x||
2
= 1.
A vector x and a vector y are orthogonal to each other if x
y = 0. If both vectors
have nonzero norm, this means that they are at 90 degree angles to each other. In R
n
,
at most n vectors may be mutually orthogonal with nonzero norm. If the vectors are
not only orthogonal but also have unit norm, we call them orthonormal.
An orthogonal matrix is a square matrix whose rows are mutually orthonormal and
whose columns are mutually orthonormal:
A
A = AA
= I.
This implies that
A
1
= A
,
so orthogonal matrices are of interest because their inverse is very cheap to compute.
Counterintuitively, there is no special term for a matrix whose rows or columns are
orthogonal but not orthonormal.
2.7 Eigendecomposition
An eigenvector of a square matrix A is a non-zero vector v such that multiplication by
A alters only the scale of v:
Av = λv.
The scalar λ is known as the eigenvalue corresponding to this eigenvector. (One can
also find a left eigenvector such that v
A = λv, but we are usually concerned with right
eigenvectors).
2
There is not a standardized notation for constructing a diagonal matrix from a vector.
28
Figure 2.3: An example of the effect of eigenvectors and eigenvalues. Here, we have
a matrix A with two orthonormal eigenvectors, v
(1)
with eigenvalue λ
1
and v
(2)
with
eigenvalue λ
2
. Left) We plot the set of all unit vectors u R
2
as a blue circle. Right)
We plot the set of all points Au. By observing the way that A distorts the unit circle,
we can see that it scales space in direction v
(i)
by λ
i
.
Note that if v is an eigenvector of A, then so is any rescaled vector sv for s R, s = 0.
Moreover, sv still has the same eigenvalue. For this reason, we usually only look for
unit eigenvectors.
We can construct a matrix A with eigenvectors {v
(1)
, . . . , v
(n)
} and corresponding
eigenvalues {λ
1
, . . . , λ
n
} by concatenating the eigenvectors into a matrix V , one column
per eigenvector, and concatenating the eigenvalues into a vector λ. Then the matrix
A = V diag(λ)V
1
has the desired eigenvalues and eigenvectors. If we make V an orthogonal matrix, then
we can think of A as scaling space by λ
i
in direction v
(i)
. See Fig. 2.3 for an example.
We have seen that constructing matrices with specific eigenvalues and eigenvectors
allows us to stretch space in desired directions. However, we often want to decompose
matrices into their eigenvalues and eigenvectors. Doing so can help us to analyze certain
properties of the matrix, much as decomposing an integer into its prime factors can help
us understand the behavior of that integer.
Not every matrix can be decomposed into eigenvalues and eigenvectors. In some
cases, the decomposition exists, but may involve complex rather than real numbers.
Fortunately, in this book, we usually need to decompose only a specific class of matrices
that have a simple decomposition. Specifically, every real symmetric matrix can be
decomposed into an expression using only real-valued eigenvectors and eigenvalues:
A = QΛQ
,
29
where Q is an orthogonal matrix composed of eigenvectors of A, and Λ is a diagonal
matrix, with λ
i,i
being the eigenvalue corresponding to Q
:,i
.
While any real symmetric matrix A is guaranteed to have an eigendecomposition,
the eigendecomposition is not unique. If any two or more eigenvectors share the same
eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors
with that eigenvalue, and we could equivalently choose a Q using those eigenvectors
instead. By convention, we usually sort the entries of Λ in descending order. Under this
convention, the eigendecomposition is unique only if all of the eigenvalues are unique.
The eigendecomposition of a matrix tells us many useful facts about the matrix. The
matrix is singular if and only if any of the eigenvalues are 0. The eigendecomposition
can also be used to optimize quadratic expressions of the form f(x) = x
Ax subject
to ||x||
2
= 1. Whenever x is equal to an eigenvector of A, f takes on the value of
the corresponding eigenvalue. The maximum value of f within the constraint region
is the maximum eigenvalue and its minimum value within the constraint region is the
minimum eigenvalue.
A matrix whose eigenvalues are all positive is called positive definite. A matrix whose
eigenvalues are all positive or zero-valued is called positive semidefinite. Likewise, if all
eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative
or zero-valued, it is negative semidefinite. Positive semidefinite matrices are interesting
because they guarantee that x, x
Ax 0. Positive definite matrices additionally
guarantee that x
Ax = 0 x = 0.
2.8 The trace operator
The trace operator gives the sum of all of the diagonal entries of a matrix:
Tr(A) =
i
a
i,i
.
The trace operator is useful for a variety of reasons. Some operations that are
difficult to specify without resorting to summation notation can be specified using matrix
products and the trace operator. For example, the trace operator provides an alternative
way of writing the Frobenius norm of a matrix:
||A||
F
=
Tr(A
A).
The trace operator also has many useful properties that make it easy to manipulate
expressions involving the trace operator. For example, the trace operator is invariant to
the transpose operator:
Tr(A) = Tr(A
).
The trace of a square matrix composed of many factors is also invariant to moving
the last factor into the first position:
Tr(ABC) = Tr(BCA) = Tr(CAB)
or more generally,
Tr(Π
n
i=1
F
(i)
) = Tr(F
(n)
Π
n1
i=1
F
(i)
).
30
2.9 Determinant
The determinant of a square matrix, denoted det(A), is a function mapping matrices
to real scalars. The determinant is equal to the product of all the matrix’s eigenvalues.
The absolute value of the determinant can be thought of as a measure of how much
multiplication by the matrix expands or contracts space. If the determinant is 0, then
space is contracted completely along at least one dimension, causing it to lose all of its
volume. If the determinant is 1, then the transformation is volume-preserving.
2.10 Example: Principal components analysis
One simple machine learning algorithm, principal components analysis (PCA) can be
derived using only knowledge of basic linear algebra.
Suppose we have a collection of m points {x
(1)
, . . . , x
(m)
} in R
n
. Suppose we would
like to apply lossy compression to these points, i.e. we would like to find a way of storing
the points that requires less memory but may lose some precision. We would like to lose
as little precision as possible.
One way we can encode these points is to represent a lower-dimensional version of
them. For each point x
(i)
R
n
we will find a corresponding code vector c
(i)
R
l
. If l is
smaller than n, it will take less memory to store the code points than the original data.
We can use matrix multiplication to map the code back into R
n
. Let the reconstruction
r(c) = Dc, where D R
n×l
is the matrix defining the decoding.
To simplify the computation of the optimal encoding, we constrain the columns of
D to be orthogonal to each other. (Note that D is still not technically “an orthogonal
matrix” unless l = n)
With the problem as described so far, many solutions are possible, because we can
increase the scale of D
:,i
if we decrease c
i
proportionally for all points. To give the
problem a unique solution, we constrain all of the columns of D to have unit norm.
In order to turn this basic idea into an algorithm we can implement, the first thing
we need to do is figure out how to generate the optimal code point c
for each input
point x. One way to do this is to minimize the distance between the input point x and
its reconstruction, r(c). We can measure this distance using a norm. In the principal
components algorithm, we use the l
2
norm:
c
= arg min
c
||x r(c)||
2
.
We can switch to the squared l
2
norm instead of the l
2
norm itself, because both are
minimized by the same value of c. This is because the l
2
norm is non-negative and the
squaring operation is monotonically increasing for non-negative arguments.
c
= arg min
c
||x r(c)||
2
2
The function being minimized simplifies to
(x r(c))
(x r(c))
31
(by the definition of the l
2
norm)
= x
x x
r(c) r(c)
x + r(c)
r(c)
(by the distributive property)
= x
x 2x
r(c) + r(c)
r(c)
(because a scalar is equal to the transpose of itself).
We can now change the function being minimized again, to omit the first term, since
this term does not depend on c:
c
= arg min
c
2x
r(c) + r(c)
r(c).
To make further progress, we must substitute in the definition of r(c):
c
= arg min
c
2x
Dc + c
D
Dc
= arg min
c
2x
Dc + c
I
l
c
(by the orthogonality and unit norm constraints on D)
= arg min
c
2x
Dc + c
c
We can solve this optimization problem using vector calculus (see section 4.3 if you
do not know how to do this):
c
(2x
Dc + c
c) = 0
2x
D + 2c = 0
c = x
D. (2.2)
This is good news: we can optimally encode x just using a vector-product operation.
Next, we need to choose the encoding matrix D. To do so, we revisit the idea of
minimizing the l
2
distance between inputs and reconstructions. However, since we will
use the same matrix D to decode all of the points, we can no longer consider the points
in isolation. Instead, we must minimize the l
2
distance for all the points simultaneously:
D
= arg min
D
i,j
(x
(i)
j
r
(i)
j
)
2
subject to D
D = I
l
(2.3)
To derive the algorithm for finding D
, we will start by considering the case where l =
1. In this case, D is just a single vector, d. Substituting equation 2.2 into equation 2.3
and simplifying D into d, the problem reduces to
d
= arg min
d
i
||x
(i)
x
(i)
dd||
2
2
subject to ||d||
2
= 1.
32
At this point, it can be helpful to rewrite the problem in terms of matrices. This
will allow us to use more compact notation. Let X R
m×n
be the matrix defined by
stacking all of the vectors describing the points, such that X
i,:
= x
(i)
. We can now
rewrite the problem as
d
= arg min
d
||X Xdd
||
2
F
subject to d
d = 1
= arg min
d
Tr
X Xdd
X Xdd
subject to d
d = 1
(by the alternate definition of the Frobenius norm)
= arg min
d
Tr(X
X X
Xdd
dd
X
X + dd
X
Xdd
) subject to d
d = 1
= arg min
d
Tr(X
X)Tr(X
Xdd
)Tr(dd
X
X)+Tr(dd
X
Xdd
) subject to d
d = 1
= arg min
d
Tr(X
Xdd
) Tr(dd
X
X) + Tr(dd
X
Xdd
) subject to d
d = 1
(because terms not involving d do not affect the arg min)
= arg min
d
2 Tr(X
Xdd
) + Tr(dd
X
Xdd
) subject to d
d = 1
(because the trace is invariant to transpose)
= arg min
d
2 Tr(X
Xdd
) + Tr(X
Xdd
dd
) subject to d
d = 1
(because we can cycle the order of the matrices inside a trace)
= arg min
d
2 Tr(X
Xdd
) + Tr(X
Xdd
) subject to d
d = 1
(due to the constraint)
= arg min
d
Tr(X
Xdd
) subject to d
d = 1
= arg max
d
Tr(X
Xdd
) subject to d
d = 1
= arg max
d
Tr(d
X
Xd) subject to d
d = 1
This optimization problem may be solved using eigendecomposition. Specifically, the
optimal d is given by the eigenvector of X
X corresponding to the largest eigenvalue.
In the general case, where l > 1, D is given by the l eigenvectors corresponding to
the largest eigenvalues. This may be shown using proof by induction. We recommend
writing this proof as an exercise.
33