|
pda-tt22-notes-clean.pdf
3 Principal Components Analysis (PCA)
If we have a large number of $p$ variables collected on a set of $n$ observations
it can be hard to visualize and summarize the dataset.
Principal components analysis (PCA) is a useful method that can allow
us to summarize the dataset with a small number of representative variables
or dimensions. In other words, we would like to find a low-dimensional
representation of the data that captures as much of the information in the
dataset as possible. For instance, if we can find a two-dimensional (2D)
representation of the data that captures most of the information, then we can
plot the observations in this 2D space, and maybe carry out further analysis
within this 2D space. PCA provides a tool to do this. The mathematics
underlying PCA build on ideas and results that we have learned in the
Prelims Linear Algebra courses.
The question then is ‘how do we define a representative variable or dimension?’
Answering this question using PCA involves the following key ideas :
- Each of the dimensions found by PCA is a linear combination of the
variables.
- PCA seeks a small number of dimensions that are as interesting as
possible, where interesting is measured by how variable the observations
are along those dimensions. These dimensions are refered to as principal
components, or PCs for short.
Why is variability (or variance) a good metric for determining interestingness?
In situations where the data consist of separated clusters of observations
in $p$-dimensional space, directions that maximize variance can provide good
separation of the clusters. This is illustrated in Figure 15 which shows
an example of a dataset in 2 dimensions with 2 clear clusters of points.
When the data points are projected onto the axis labelled A this results in
clear separation of the points and a variance of 12.98. When the points are
projected onto the orthogonal axis labelled B this results in no separation
of the points and a much lower variance of 1.80. So in this example variance
seems to be a good way to choose a projection that separates the two clusters.
Another way to think about this is that we have found a rotation of the
data points to maximize the variance. A rotation is a set of orthogonal
projections. Figure 16 shows the data points before rotation (left) and after
rotation (right). The points in the two clusters are well separated on the
new x-axis.
Figure 15: Example showing the variance of points projected on two
(orthogonal) projections (A) and (B). The projection (A) which separates
the points well has the highest variance. Raw data Data rotated to Principal Components Figure 16: Example showing the variance of points projected on two
(orthogonal) projections (A) and (B). The projection (A) which separates
the points well has the highest variance.
3.1 Finding the first principal component
The PCA components are found one at a time. The first principal component
is a linear combination of the set of $p$ variables, denoted $Z_1$, such that
$$
Z_{1}=\alpha_{11} X_{1}+\alpha_{12} X_{2}+\ldots+\alpha_{1 p} X_{p}
$$
We can write
$$
Z_{1}=\alpha_{1}^{T} X
$$
where $\alpha_{1}=\left(\alpha_{11}, \alpha_{12}, \ldots, \alpha_{1 p}\right)^{T}$ and $X=\left(X_{1}, \ldots, X_{p}\right)^{T}$ are both column vectors. Then it can shown (Exercise Sheet 1) that
$$
\operatorname{var}\left(Z_{1}\right)=\alpha_{1}^{T} \Sigma \alpha_{1}
$$
where $\boldsymbol{\Sigma}$ is the $p \times p$ covariance matrix of the random variable $X$.
There are two problems with seeking to maximize this variance
- Since we only have a sample of data from each of the $p$-variables we do not know the true covariance $\boldsymbol{\Sigma}$, but we can use the sample covariance matrix $\mathbf{S}$ instead.
- This variance is unbounded as the $\alpha_{i}$ 's increase, so we need to add a constraint on $\alpha$ such that
$$
\sum_{j=1}^{p} \alpha_{j 1}^{2}=\alpha_{1}^{T} \alpha_{1}=1
$$
Making these changes results in the constrained maximization problem $\max _{\alpha_{1}} \alpha_{1}^{T} \mathbf{S} \alpha_{1} \quad$ subject to $\quad \alpha_{1}^{T} \alpha_{1}=1$
In other words, we try to maximize the sample variance of the first principal component $\left(\alpha_{1}^{T} \mathbf{S} \alpha_{1}\right)$ subject to the constraint $\alpha_{1}^{T} \alpha_{1}=1$. We can solve this using Lagrange Multipliers (see Prelims Introductory Calculus course for a reminder on this technique).
Let
$$
\mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=\alpha_{1}^{T} \mathbf{S} \alpha_{1}-\lambda_{1}\left(\alpha_{1}^{T} \alpha_{1}-1\right)
$$
We then need the vector of partial derivatives of $\mathcal{L}$ with respect to the vector $\alpha_{1}$. This is the gradient vector $\nabla \mathcal{L}$ seen in Intro Calculus
Vector calculus results
Let $a$ and $x$ are $p$-column vectors and $\mathbf{B}$ a $p \times p$ symmetric matrix with rows $\mathbf{B}_{1}, \mathbf{B}_{2}, \ldots, \mathbf{B}_{p}$. Then
- Let $y=a^{T} x=\sum_{i=1}^{p} a_{i} x_{i}$ be a scalar function, then
$$
\nabla y=\left(\frac{\partial y}{\partial x_{1}}, \frac{\partial y}{\partial x_{2}}, \ldots, \frac{\partial y}{\partial x_{p}}\right)^{T}=\left(a_{1}, a_{2}, \ldots, a_{p}\right)^{T}=a
$$
- Let $z=x^{T} \mathbf{B} x=\sum_{i=1}^{p} \sum_{j=1}^{p} \mathbf{B}_{i j} x_{i} x_{j}$ then consider
$$
\nabla z=\left(\frac{\partial z}{\partial x_{1}}, \frac{\partial z}{\partial x_{2}}, \ldots, \frac{\partial z}{\partial x_{p}}\right)^{T}
$$
Since $\frac{\partial z}{\partial x_{i}}=2 \sum_{j=1}^{p} \mathbf{B}_{i j} x_{j}=2 \mathbf{B}_{i} x$ we have that
$$
\nabla z=2 \mathbf{B} x
$$
Re-writing we have
$$
\mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=\alpha_{1}^{T} \mathbf{S} \alpha_{1}-\lambda_{1} \alpha_{1}^{T} I_{p} \alpha_{1}+\lambda_{1}
$$
where $I_{p}$ denotes the $p \times p$ identity matrix.
Using these results we have that
$$
\nabla \mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=2 \mathbf{S} \alpha_{1}-2 \lambda_{1} I_{p} \alpha_{1}=2 \mathbf{S} \alpha_{1}-2 \lambda_{1} \alpha_{1}
$$
Setting this equal to 0 results in the eigenvalue equation
\begin{equation}\label1
\mathbf{S} \alpha_{1}=\lambda_{1} \alpha_{1}
\end{equation}
which means $\lambda_{1}$ and $\alpha_{1}$ will be an eigenvalue and eigenvector of $S$ respectively. We can find the eigenvalues of $\mathbf{S}$ by solving the characteristic polynomial
$$
\chi_{S}(x)=\operatorname{det}\left(\mathbf{S}-x I_{p}\right)=0
$$
There will be at most $p$ roots of this equation, so we need to determine which one to choose. If we multiply equation \eqref{1} by $\alpha_{1}^{T}$ then we get
\begin{equation}\label2
\alpha_{1}^{T} \mathbf{S} \alpha_{1}=\lambda_{1}
\end{equation}
Thus $\lambda_{1}$ is the sample variance of the first principal component which we are trying to maximize, so we choose $\lambda_{1}$ to be the largest eigenvalue of $\mathbf{S}$, and $\alpha_{1}$ is the corresponding eigenvector.
3.2 Finding the second (and subsequent) principal components
Now consider finding the second principal component. It will also be a linear combination of the set of $p$ variables, denoted $Z_{2}$, such that
$$
Z_{2}=\alpha_{2}^{T} X=\alpha_{21} X_{1}+\alpha_{22} X_{2}+\ldots+\alpha_{2 p} X_{p}
$$
where $\alpha_{2}=\left(\alpha_{21}, \alpha_{22}, \ldots, \alpha_{2 p}\right)^{T}$. As before we need the constraint
$$
\alpha_{2}^{T} \alpha_{2}=1
$$
However, we must add another constraint in order to find a distinct eigenvector. As described above we wish the linear combinations of $p$ variables to be orthogonal projections, so we add in the further constraint that
$$
\alpha_{1}^{T} \alpha_{2}=0
$$
and this leads to another Lagrange multipliers problem where we seek to maximize
$$
\mathcal{L}\left(\alpha_{2}, \lambda_{2}, m\right)=\alpha_{2}^{T} \mathbf{S} \alpha_{2}-\lambda_{2}\left(\alpha_{2}^{T} \alpha_{2}-1\right)-m \alpha_{1}^{T} \alpha_{2}
$$
Taking partial derivatives with respect to $\alpha_{2}$ leads to
$$
\nabla \mathcal{L}\left(\alpha_{2}, \lambda_{2}, m\right)=2 \mathbf{S} \alpha_{2}-2 \lambda_{2} \alpha_{2}-m \alpha_{1}
$$
Setting equal to 0 gives
\begin{equation}\label3
\mathbf{S} \alpha_{2}-\lambda_{2} \alpha_{2}=\frac{1}{2} m \alpha_{1}
\end{equation}
Pre-multiplying by $\alpha_{1}^{T}$, remembering that $\alpha_{1}^{T} \alpha_{2}=0$ and $\alpha_{1}^{T} \alpha_{1}=1$, gives
$$
\alpha_{1}^{T} \mathbf{S} \alpha_{2}=\frac{1}{2} m
$$
However, pre-multiplying \eqref{1} by $\alpha_{2}^{T}$, we get
\begin{equation}\label4
\alpha_{2}^{T} \mathbf{S} \alpha_{1}=\alpha_{1}^{T} \mathbf{S} \alpha_{2}=0
\end{equation}
since $\alpha_{2}^{T} \mathbf{S} \alpha_{1}$ is a scalar and $\mathbf{S}$ is symmetric, which implies that $m=0$ and equation \eqref{3} reduces to the eigenvalue equation
\begin{equation}\label5
\mathbf{S} \alpha_{2}=\lambda_{2} \alpha_{2}
\end{equation}
So $\lambda_{2}$ is also an eigenvalue of $\mathbf{S}$ with associated eigenvector $\alpha_{2}$. As before we can show that $\lambda_{2}$ is equal to $\alpha_{2}^{T} \mathbf{S} \alpha_{2}$ which is the sample variance of the second principal component. We must also ensure that $\alpha_{1}^{T} \mathbf{S} \alpha_{2}=0$, so this is achieved by setting $\lambda_{2}$ to be the second largest eigenvalue, with $\alpha_{2}$ being it's corresponding eigenvector.
The above process can be continued for the other principal components. This results in a sequence of principal components ordered by their variance. In other words, if we consider the eigenvalue decomposition of $S$ (see Linear Algebra recap)
$$
\mathbf{S}=\mathbf{V D V}^{T}
$$
where $\mathbf{D}$ is a diagonal matrix of eigenvalues ordered in decreasing value, and $\mathbf{V}$ is a $p \times p$ matrix of the corresponding eigenvectors as orthonormal columns $\left(v_{1}, \ldots, v_{p}\right)$, then we have that
$$
\alpha_{i}=v_{i} \text { and } \lambda_{i}=\mathbf{D}_{i i} \text { for } i=1, \ldots, p
$$
3.3 Plotting the principal components
If $\mathbf{X}$ is the $n \times p$ data matrix of the $n$ observations on $p$ variable, $\mathbf{S}$ is the $p \times p$ sample covariance matrix, with ordered eigendecomposition $\mathbf{S}=\mathbf{V D V}^{T}$ then the data can be transformed to the $p$ principal component directions.
If we define $\mathbf{Z}$ to be an $n \times p$ matrix containing the transformed data, such that $\mathbf{Z}_{i j}$ is the value of the $j$ th principal component for the $i$ th observation then we have
\begin{align}
\mathbf{Z}_{i j} &=\alpha_{j 1} \mathbf{X}_{i 1}+\alpha_{j 2} \mathbf{X}_{i 2}+\ldots \alpha_{j p} \mathbf{X}_{i p}\label6\\
&=\mathbf{V}_{1 j} \mathbf{X}_{i 1}+\mathbf{V}_{2 j} \mathbf{X}_{i 2}+\ldots \mathbf{V}_{p j} \mathbf{X}_{i p}\label7
\end{align}
In other words we take the vector product of the $i$ th row of $\mathbf{X}$ and the $j$ th column of $\mathbf{V}$. The matrix $\mathbf{V}$ is known as the loadings matrix. In matrix notation we can write this as
$$
\mathbf{Z}=\mathbf{X V}
$$
We can then plot the columns of $\mathbf{Z}$ against each other to visualize the data as represented by those pairs of principal components. The matrix $\mathbf{Z}$ is known as the scores matrix. The columns of this matrix contain the projections onto the principal components.
Figure 17 shows a pairs plot of the 5 PCs for the Crabs dataset. The points have been coloured according to the 4 groups Blue Male (dark blue), Blue Female (light blue), Orange Male (orange), Orange Female (yellow). From this plot we can see that PCs 2 and 3 seem to show good discrimination of these 4 groups. This is shown more clearly in Figure 18 .
Figure 17: Pairs plot of PC components for the Crabs dataset. |
|