Chapter 2

Linear algebra

Linear algebra is a branch of mathematics that is widely used throughout science and

engineering. However, because linear algebra is a form of continuous rather than dis-

crete mathematics, many computer scientists have little experience with it. A good

understanding of linear algebra is essential for understanding deep learning and working

with deep learning algorithms. We therefore begin the technical content of the book

with a focused presentation of the key linear algebra ideas that are most important in

deep learning.

If you are already familiar with linear algebra, feel free to skip this chapter. If

you have previous experience with these concepts but need a detailed reference sheet

to review key formulas, we recommend The Matrix Cookbook (Petersen and Pedersen,

2006). If you have no exposure at all to linear algebra, this chapter will teach you

enough to get by, but we highly recommend that you also consult another resource

focused exclusively on teaching linear algebra, such as (Shilov, 1977).

2.1 Scalars, vectors, matrices and tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars: A scalar is just a single number, in contrast to most of the other objects

studied in linear algebra, which are usually arrays of multiple numbers. We write

scalars in italics. We usually give scalars lower-case variable names. When we

introduce them, we specify what kind of number they are. For example, we might

say “Let s ∈ R be the slope of the line,” while deﬁning a real-valued scalar, or

“Let n ∈ N be the number of units,” while deﬁning a natural number scalar.

• Vectors: A vector is an array of numbers. The numbers have an order to them, and

we can identify each individual number by its index in that ordering. Typically we

give vectors lower case names written in bold typeface, such as x. The elements

of the vector are identiﬁed by writing its name in italic typeface, with a subscript.

The ﬁrst element of x is x

, the second element is x

, and so on. We also need

to say what kind of numbers are stored in the vector. If each element is in R,

and the vector has n elements, then the vector lies in the set formed by taking

the Cartesian product of R n times, denoted as R

. When we need to explicitly

identify the elements of a vector, we write them as a column enclosed in square

brackets:

x =













We can think of vectors as identifying points in space, with each element giving

the coordinate along a diﬀerent axis.

Sometimes we need to index a set of elements of a vector. In this case, we deﬁne a

set containing the indices, and write the set as a subscript. For example, to access

, x

, and x

, we deﬁne the set S = {1, 3, 6} and write x

. We use the − sign

to index the complement of a set. For example x

−1

is the vector containing all

elements of x except for x

, and x

−S

is the vector containing all of the elements

of x except for x

, x

, and x

• Matrices: A matrix is a 2-D array of numbers, so each element is identiﬁed by two

indices instead of just one. We usually give matrices upper-case variable names

with bold typeface, such as A. If a real-valued matrix A has a height of m and

a width of n, then we say that A ∈ R

m×n

. We usually identify the elements of a

matrix with its lower-case name, and the indices are listed with separating commas.

For example, a

1,1

is the upper left entry of of A and a

m,n

is the bottom right entry.

We can identify all of the numbers with vertical coordinate i by writing a “:” for

the horizontal coordinate. For example, A

i,:

denotes the horizontal cross section

of A with vertical coordinate i. This is known as the i-th row of A. Likewise,

:,i

is the i-th column of A. When we need to explicitly identify the elements of

a matrix, we write them as an array enclosed in square brackets:



1,1

1,2

2,1

2,2



Sometimes we may need to index matrix-valued expressions that are not just a

single letter. In this case, we use subscripts after the expression, but do not

convert anything to lower case. For example, f(A)

i,j

gives element (i, j) of the

matrix computed by applying the function f to A.

• Tensors: In some cases we will need an array with more than two axes. In the

general case, an array of numbers arranged on a regular grid with a variable number

of axes is known as a tensor.

One important operation on matrices is the transpose. The transpose of a matrix is

the mirror image of the matrix across a diagonal line running down and to the right,

BRIEF ARTICLE

THE AUTHOR

A =

1,1

1,2

2,1

2,2

3,1

3,2

) A



1,1

2,1

3,1

1,2

2,2

3,2



Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the

main diagonal.

starting from its upper left corner. See Fig. 2.1 for a graphical depiction of this operation.

We denote the transpose of a matrix A as A



, and it is deﬁned such that



)

i,j

= a

j,i

Vectors can be thought of as matrices that contain only one column. The transpose

of a vector is therefore a matrix with only one row. Sometimes we deﬁne a vector by

writing out its elements in the text inline as a row matrix, then using the transpose

operator to turn it into a standard column vector, e.g. x = [x

, x

]



2.2 Multiplying matrices and vectors

One of the most important operations involving matrices is multiplication of two ma-

trices. The matrix product of matrices A and B is a third matrix C. In order for this

product to be deﬁned, A must have the same number of columns as B has rows. If A

is of shape m ×n and B is of shape n ×p, then C is of shape m ×p. We can write the

matrix product just by placing two or more matrices together, e.g.

C = AB.

The product operation is deﬁned by

i,j



i,k

k,j

Note that the standard product of two matrices is not just a matrix containing the

product of the individual elements. Such an operation exists and is called the element-

wise product or Hadamard product, and is denoted in this book

as A ◦B.

The dot product between two vectors x and y of the same dimensionality is the

matrix product x



y. We can think of the matrix product C = AB as computing c

i,j

as the dot product between row i of A and column j of B.

The element-wise product is used relatively rarely, so the notation for it is not as standardized as

the other operations described in this chapter.

Matrix product operations have many useful properties that make mathematical

analysis of matrices more convenient. For example, matrix multiplication is distributive:

A(B + C) = AB + AC.

It is also associative:

A(BC) = (AB)C.

Matrix multiplication is not commutative, unlike scalar multiplication.

The transpose of a matrix product also has a simple form:

(AB)



= B



Since the focus of this textbook is not linear algebra, we do not attempt to develop a

comprehensive list of useful properties of the matrix product here, but the reader should

be aware that many more exist.

We can also multiply matrices and vectors by scalars. To multiply by a scalar, we

just multiply every element of the matrix or vector by that scalar:

sA =







1,1

. . . sa

1,n

m,1

. . . sa

m,n







We now know enough linear algebra notation to write down a system of linear equa-

tions:

Ax = b (2.1)

where A ∈ R

m×n

is a known matrix, b ∈ R

is a known vector, and x ∈ R

is a

vector of unknown variables we would like to solve for. Each element x

of x is one

of these unknowns to solve for. Each row of A and each element of b provide another

constraint. We can rewrite equation 2.1 as:

1,:

x = b

2,:

x = b

. . .

m,:

x = b

or, even more explicitly, as:

1,1

+ a

1,2

+ ··· + a

1,n

= b

2,1

+ a

2,2

+ ··· + a

2,n

= b

. . .

m,1

+ a

m,2

+ ··· + a

m,n

= b

Matrix-vector product notations provides a more compact representation for equa-

tions of this form.





1 0 0

0 1 0

0 0 1





Figure 2.2: Example identity matrix: This is I

2.3 Identity and inverse matrices

Linear algebra oﬀers a powerful tool called matrix inversion that allows us to solve

equation 2.1 for many values of A.

To describe matrix inversion, we ﬁrst need to deﬁne the concept of an identity matrix.

An identity matrix is a matrix that does not change any vector when we multiply that

vector by that matrix. We denote the n-dimensional identity matrix as I

. Formally,

∀x ∈ R

, I

x = x.

The structure of the identity matrix is simple: all of the entries along the main diagonal

(where the row coordinate is the same as the column coordinate) are 1, while all of the

other entries are zero. See Fig. 2.2 for an example.

The matrix inverse of A is denoted as A

−1

, and it is deﬁned as the matrix such that

−1

A = I

We can now solve equation 2.1 by the following steps:

Ax = b

−1

Ax = A

−1

x = A

−1

x = A

−1

Of course, this depends on it being possible to ﬁnd A

−1

. We discuss the conditions

for the existence of A

−1

in the following section.

When A

−1

exists, several diﬀerent algorithms exist for ﬁnding it in closed form. In

theory, the same inverse matrix can then be used to solve the equation many times for

diﬀerent values of b. However, A

−1

is primarily useful as a theoretical tool, and should

not actually be used in practice for most software applications. Because A

−1

can only

be represented with limited precision on a digital computer, algorithms that make use

of the value of b can usually obtain more accurate estimates of x.

2.4 Linear dependence, span, and rank

In order for A

−1

to exist, equation 2.1 must have exactly one solution for every value

of b. However, it is also possible for the system of equations to have no solutions or

inﬁnitely many solutions for some values of b. It is not possible to have more than one

but less than inﬁnitely many solutions for a particular b; if both x and y are solutions

then

z = αx + (1 − α)y

is also a solution for any real α.

To analyze how many solutions the equation has, we can think of the columns of A

as specifying diﬀerent directions we can travel from the origin (the point speciﬁed by

the vector of all zeros), and determine how many ways there are of reaching b. In this

view, each element of x speciﬁes how far we should travel in each of these directions,

i.e. x

speciﬁes how far to move in the direction of column i:

Ax =



:,i

In general, this kind of operation is called a linear combination. Formally, a linear

combination of some set of vectors {v

(1)

, . . . , v

(n)

} is given by multiplying each vector

(i)

by a corresponding scalar coeﬃcient and adding the results:



(i)

The span of a set of vectors is the set of all points obtainable by linear combination of

the original vectors.

Determining whether Ax = b has a solution thus amounts to testing whether b is

in the span of the columns of A. This particular span is known as the column space or

the range of A.

In order for the system Ax = b to have a solution for all values of b ∈ R

, we

therefore require that the column space of A be all of R

. If any point in R

is excluded

from the column space, that point is a potential value of b that has no solution. This

implies immediately that A must have at least m columns, i.e., n ≥ m. Otherwise, the

dimensionality of the column space must be less than m. For example, consider a 3 × 2

matrix. The target b is 3-D, but x is only 2-D, so modifying the value of x at best

allows us to trace out a 2-D plane within R

. The equation has a solution if and only if

b lies on that plane.

Having n ≥ m is only a necessary condition for every point to have a solution. It is

not a suﬃcient condition, because it is possible for some of the columns to be redundant.

Consider a 2 ×2 matrix where both of the columns are equal to each other. This has the

same column space as a 2 ×1 matrix containing only one copy of the replicated column.

In other words, the column space is still just a line, and fails to encompass all of R

even though there are two columns.

Formally, this kind of redundancy is known as linear dependence. A set of vectors is

linearly independent if no vector in the set is a linear combination of the other vectors.

If we add a vector to a set that is a linear combination of the other vectors in the set, the

new vector does not add any points to the set’s span. This means that for the column

space of the matrix to encompass all of R

, the matrix must have at least m linearly

independent columns. This condition is both necessary and suﬃcient for equation 2.1

to have a solution for every value of b.

In order for the matrix to have an inverse, we additionally need to ensure that

equation 2.1 has at most one solution for each value of b. To do so, we need to ensure

that the matrix has at most m columns. Otherwise there is more than one way of

parameterizing each solution.

Together, this means that the matrix must be square, that is, we require that m = n,

and that all of the columns must be linearly independent. A square matrix with linearly

dependent columns is known as singular.

If A is not square or is square but singular, it can still be possible to solve the

equation. However, we can not use the method of matrix inversion to ﬁnd the solution.

So far we have discussed matrix inverses as being multiplied on the left. It is also

possible to deﬁne an inverse that is multiplied on the right:

−1

= I.

For square matrices, the left inverse and right inverse are the same.

2.5 Norms

Sometimes we need to measure the size of a vector. In machine learning, we usually

measure the size of vectors using an l

norm:

||x||







for p ∈ R, p ≥ 1.

Norms, including the l

norm, are functions mapping vectors to non-negative values,

satisfying these properties that make them behave like distances between points:

• f (x) = 0 ⇒ x = 0

• f (x + y) ≤ f(x) + f(y) (the triangle inequality)

• ∀α ∈ R, f(αx) = |α|f(x)

The l

norm, with p = 2, is known as the Euclidean norm. It is simply the Euclidean

distance from the origin to the point identiﬁed by x. This is probably the most common

norm used in machine learning. It is also common to measure the size of a vector using

the squared l

norm, which can be calculated simply as x



The squared l

norm is more convenient to work with mathematically and computa-

tionally than the l

norm itself. For example, the derivatives of the squared l

norm with

respect to each element of x each depend only on the corresponding element of x, while

all of the derivatives of the l

norm depend on the entire vector. In many contexts, the

norm may be undesirable because it increases very slowly near the origin. In several

machine learning applications, it is important to discriminate between elements that are

exactly zero and elements that are small but nonzero. In these cases, we turn to a func-

tion that grows at the same rate in all locations, but retains mathematical simplicity:

the l

norm. The l

norm may be simpliﬁed to

||x||



The l

norm is commonly used in machine learning when the diﬀerence between zero

and nonzero elements is very important. Every time an element of x moves away from

0 by , the l

norm increases by .

We sometimes measure the size of the vector by counting its number of nonzero

elements (and when we use the l

norm, we often use it as a proxy for this function).

Some authors refer to this function as the “l

norm,” but this is incorrect terminology,

because scaling the vector by α does not change the number of nonzero entries.

One other norm that commonly arises in machine learning is the l

∞

norm, also

known as the max norm. This norm simpliﬁes to

||x||

∞

= max

e.g., the absolute value of the element with the largest magnitude in the vector.

Sometimes we may also wish to measure the size of a matrix. In the context of deep

learning, the most common way to do this is with the otherwise obscure Frobenius norm

||A||





i,j

which is analogous to the l

norm of a vector.

The dot product of two vectors can be rewritten in terms of norms. Speciﬁcally,



y = ||x||

||y||

cos θ

where θ is the angle between x and y.

2.6 Special kinds of matrices and vectors

Some special kinds of matrices and vectors are particularly useful.

Diagonal matrices only have non-zero entries along the main diagonal. Formally, a

matrix D is diagonal if and only if d

i,j

= 0 for all i = j. We’ve already seen one example

of a diagonal matrix: the identity matrix, where all of the diagonal entries are 1. In this

book

, we write diag(v) to denote a square diagonal matrix whose diagonal entries are

given by the entries of the vector v. Diagonal matrices are of interest in part because

multiplying by a diagonal matrix is very computationally eﬃcient. To compute diag(v)x,

we only need to scale each element x

by v

. In other words, diag(v)x = v ◦x. Inverting

a diagonal matrix is also eﬃcient. The inverse exists only if every diagonal entry is

nonzero, and in that case, diag(v)

−1

= diag([1/v

, . . . , 1/v

]



). In many cases, we may

derive some very general machine learning algorithm in terms of arbitrary matrices, but

obtain a less expensive (and less descriptive) algorithm by restricting some matrices to

be diagonal.

A symmetric matrix is any matrix that is equal to its own transpose:

A = A



Symmetric matrices often arise when the entries are generated by some function of two

arguments that does not depend on the order of the arguments. For example, if A is a

matrix of distance measurements, with a

i,j

giving the distance from point i to point j,

then a

i,j

= a

j,i

because distance functions are symmetric.

A unit vector is a vector with unit norm:

||x||

= 1.

A vector x and a vector y are orthogonal to each other if x



y = 0. If both vectors

have nonzero norm, this means that they are at 90 degree angles to each other. In R

at most n vectors may be mutually orthogonal with nonzero norm. If the vectors are

not only orthogonal but also have unit norm, we call them orthonormal.

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and

whose columns are mutually orthonormal:



A = AA



= I.

This implies that

−1

= A



so orthogonal matrices are of interest because their inverse is very cheap to compute.

Counterintuitively, there is no special term for a matrix whose rows or columns are

orthogonal but not orthonormal.

2.7 Eigendecomposition

An eigenvector of a square matrix A is a non-zero vector v such that multiplication by

A alters only the scale of v:

Av = λv.

The scalar λ is known as the eigenvalue corresponding to this eigenvector. (One can

also ﬁnd a left eigenvector such that v



A = λv, but we are usually concerned with right

eigenvectors).

There is not a standardized notation for constructing a diagonal matrix from a vector.

Figure 2.3: An example of the eﬀect of eigenvectors and eigenvalues. Here, we have

a matrix A with two orthonormal eigenvectors, v

(1)

with eigenvalue λ

and v

(2)

with

eigenvalue λ

. Left) We plot the set of all unit vectors u ∈ R

as a blue circle. Right)

We plot the set of all points Au. By observing the way that A distorts the unit circle,

we can see that it scales space in direction v

(i)

by λ

Note that if v is an eigenvector of A, then so is any rescaled vector sv for s ∈ R, s = 0.

Moreover, sv still has the same eigenvalue. For this reason, we usually only look for

unit eigenvectors.

We can construct a matrix A with eigenvectors {v

(1)

, . . . , v

(n)

} and corresponding

eigenvalues {λ

, . . . , λ

} by concatenating the eigenvectors into a matrix V , one column

per eigenvector, and concatenating the eigenvalues into a vector λ. Then the matrix

A = V diag(λ)V

−1

has the desired eigenvalues and eigenvectors. If we make V an orthogonal matrix, then

we can think of A as scaling space by λ

in direction v

(i)

. See Fig. 2.3 for an example.

We have seen that constructing matrices with speciﬁc eigenvalues and eigenvectors

allows us to stretch space in desired directions. However, we often want to decompose

matrices into their eigenvalues and eigenvectors. Doing so can help us to analyze certain

properties of the matrix, much as decomposing an integer into its prime factors can help

us understand the behavior of that integer.

Not every matrix can be decomposed into eigenvalues and eigenvectors. In some

cases, the decomposition exists, but may involve complex rather than real numbers.

Fortunately, in this book, we usually need to decompose only a speciﬁc class of matrices

that have a simple decomposition. Speciﬁcally, every real symmetric matrix can be

decomposed into an expression using only real-valued eigenvectors and eigenvalues:

A = QΛQ



where Q is an orthogonal matrix composed of eigenvectors of A, and Λ is a diagonal

matrix, with λ

i,i

being the eigenvalue corresponding to Q

:,i

While any real symmetric matrix A is guaranteed to have an eigendecomposition,

the eigendecomposition is not unique. If any two or more eigenvectors share the same

eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors

with that eigenvalue, and we could equivalently choose a Q using those eigenvectors

instead. By convention, we usually sort the entries of Λ in descending order. Under this

convention, the eigendecomposition is unique only if all of the eigenvalues are unique.

The eigendecomposition of a matrix tells us many useful facts about the matrix. The

matrix is singular if and only if any of the eigenvalues are 0. The eigendecomposition

can also be used to optimize quadratic expressions of the form f(x) = x



Ax subject

to ||x||

= 1. Whenever x is equal to an eigenvector of A, f takes on the value of

the corresponding eigenvalue. The maximum value of f within the constraint region

is the maximum eigenvalue and its minimum value within the constraint region is the

minimum eigenvalue.

A matrix whose eigenvalues are all positive is called positive deﬁnite. A matrix whose

eigenvalues are all positive or zero-valued is called positive semideﬁnite. Likewise, if all

eigenvalues are negative, the matrix is negative deﬁnite, and if all eigenvalues are negative

or zero-valued, it is negative semideﬁnite. Positive semideﬁnite matrices are interesting

because they guarantee that ∀x, x



Ax ≥ 0. Positive deﬁnite matrices additionally

guarantee that x



Ax = 0 ⇒ x = 0.

2.8 The trace operator

The trace operator gives the sum of all of the diagonal entries of a matrix:

Tr(A) =



i,i

The trace operator is useful for a variety of reasons. Some operations that are

diﬃcult to specify without resorting to summation notation can be speciﬁed using matrix

products and the trace operator. For example, the trace operator provides an alternative

way of writing the Frobenius norm of a matrix:

||A||



Tr(A



A).

The trace operator also has many useful properties that make it easy to manipulate

expressions involving the trace operator. For example, the trace operator is invariant to

the transpose operator:

Tr(A) = Tr(A



The trace of a square matrix composed of many factors is also invariant to moving

the last factor into the ﬁrst position:

Tr(ABC) = Tr(BCA) = Tr(CAB)

or more generally,

Tr(Π

i=1

(i)

) = Tr(F

(n)

n−1

i=1

(i)

2.9 Determinant

The determinant of a square matrix, denoted det(A), is a function mapping matrices

to real scalars. The determinant is equal to the product of all the matrix’s eigenvalues.

The absolute value of the determinant can be thought of as a measure of how much

multiplication by the matrix expands or contracts space. If the determinant is 0, then

space is contracted completely along at least one dimension, causing it to lose all of its

volume. If the determinant is 1, then the transformation is volume-preserving.

2.10 Example: Principal components analysis

One simple machine learning algorithm, principal components analysis (PCA) can be

derived using only knowledge of basic linear algebra.

Suppose we have a collection of m points {x

(1)

, . . . , x

(m)

} in R

. Suppose we would

like to apply lossy compression to these points, i.e. we would like to ﬁnd a way of storing

the points that requires less memory but may lose some precision. We would like to lose

as little precision as possible.

One way we can encode these points is to represent a lower-dimensional version of

them. For each point x

(i)

∈ R

we will ﬁnd a corresponding code vector c

(i)

∈ R

. If l is

smaller than n, it will take less memory to store the code points than the original data.

We can use matrix multiplication to map the code back into R

. Let the reconstruction

r(c) = Dc, where D ∈ R

n×l

is the matrix deﬁning the decoding.

To simplify the computation of the optimal encoding, we constrain the columns of

D to be orthogonal to each other. (Note that D is still not technically “an orthogonal

matrix” unless l = n)

With the problem as described so far, many solutions are possible, because we can

increase the scale of D

:,i

if we decrease c

proportionally for all points. To give the

problem a unique solution, we constrain all of the columns of D to have unit norm.

In order to turn this basic idea into an algorithm we can implement, the ﬁrst thing

we need to do is ﬁgure out how to generate the optimal code point c

∗

for each input

point x. One way to do this is to minimize the distance between the input point x and

its reconstruction, r(c). We can measure this distance using a norm. In the principal

components algorithm, we use the l

norm:

∗

= arg min

||x −r(c)||

We can switch to the squared l

norm instead of the l

norm itself, because both are

minimized by the same value of c. This is because the l

norm is non-negative and the

squaring operation is monotonically increasing for non-negative arguments.

∗

= arg min

||x −r(c)||

The function being minimized simpliﬁes to

(x −r(c))



(x −r(c))

(by the deﬁnition of the l

norm)

= x



x −x



r(c) −r(c)



x + r(c)



r(c)

(by the distributive property)

= x



x −2x



r(c) + r(c)



r(c)

(because a scalar is equal to the transpose of itself).

We can now change the function being minimized again, to omit the ﬁrst term, since

this term does not depend on c:

∗

= arg min

−2x



r(c) + r(c)



r(c).

To make further progress, we must substitute in the deﬁnition of r(c):

∗

= arg min

−2x



Dc + c



= arg min

−2x



Dc + c



(by the orthogonality and unit norm constraints on D)

= arg min

−2x



Dc + c



We can solve this optimization problem using vector calculus (see section 4.3 if you

do not know how to do this):

∇

(−2x



Dc + c



c) = 0

−2x



D + 2c = 0

c = x



D. (2.2)

This is good news: we can optimally encode x just using a vector-product operation.

Next, we need to choose the encoding matrix D. To do so, we revisit the idea of

minimizing the l

distance between inputs and reconstructions. However, since we will

use the same matrix D to decode all of the points, we can no longer consider the points

in isolation. Instead, we must minimize the l

distance for all the points simultaneously:

∗

= arg min





i,j

(i)

− r

(i)

)

subject to D



D = I

(2.3)

To derive the algorithm for ﬁnding D

∗

, we will start by considering the case where l =

1. In this case, D is just a single vector, d. Substituting equation 2.2 into equation 2.3

and simplifying D into d, the problem reduces to

∗

= arg min



||x

(i)

− x

(i)

dd||

subject to ||d||

= 1.

At this point, it can be helpful to rewrite the problem in terms of matrices. This

will allow us to use more compact notation. Let X ∈ R

m×n

be the matrix deﬁned by

stacking all of the vectors describing the points, such that X

i,:

= x

(i)

. We can now

rewrite the problem as

∗

= arg min

||X − Xdd



subject to d



d = 1

= arg min



X − Xdd









X − Xdd





subject to d



d = 1

(by the alternate deﬁnition of the Frobenius norm)

= arg min

Tr(X



X − X



Xdd



− dd



X + dd



Xdd



) subject to d



d = 1

= arg min

Tr(X



X)−Tr(X



Xdd



)−Tr(dd



X)+Tr(dd



Xdd



) subject to d



d = 1

= arg min

−Tr(X



Xdd



) −Tr(dd



X) + Tr(dd



Xdd



) subject to d



d = 1

(because terms not involving d do not aﬀect the arg min)

= arg min

−2 Tr(X



Xdd



) + Tr(dd



Xdd



) subject to d



d = 1

(because the trace is invariant to transpose)

= arg min

−2 Tr(X



Xdd



) + Tr(X



Xdd



) subject to d



d = 1

(because we can cycle the order of the matrices inside a trace)

= arg min

−2 Tr(X



Xdd



) + Tr(X



Xdd



) subject to d



d = 1

(due to the constraint)

= arg min

−Tr(X



Xdd



) subject to d



d = 1

= arg max

Tr(X



Xdd



) subject to d



d = 1

= arg max

Tr(d



Xd) subject to d



d = 1

This optimization problem may be solved using eigendecomposition. Speciﬁcally, the

optimal d is given by the eigenvector of X



X corresponding to the largest eigenvalue.

In the general case, where l > 1, D is given by the l eigenvectors corresponding to

the largest eigenvalues. This may be shown using proof by induction. We recommend

writing this proof as an exercise.