找回密码
 快速注册
搜索
查看: 41|回复: 4

SVD分解 存在性证明

[复制链接]

3147

主题

8381

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65357
QQ

显示全部楼层

hbghlyj 发表于 2022-6-23 14:32 |阅读模式


en.wikipedia.org/wiki/Singular_value_decomposition#Existence_proofs

An eigenvalue λ of a matrix M is characterized by the algebraic relation Mu = λu. When M is Hermitian, a variational characterization is also available. Let M be a real n × n symmetric matrix. Define

$${\displaystyle {\begin{cases}f:\mathbb {R} ^{n}\to \mathbb {R} \\f:\mathbf {x} \mapsto \mathbf {x} ^{\textsf {T}}\mathbf {M} \mathbf {x} \end{cases}}}$$

By the extreme value theorem, this continuous function attains a maximum at some u when restricted to the unit sphere $\{\|x\| = 1\}$. By the Lagrange multipliers theorem, u necessarily satisfies

$${\displaystyle \nabla \mathbf {u} ^{\textsf {T}}\mathbf {M} \mathbf {u} -\lambda \cdot \nabla \mathbf {u} ^{\textsf {T}}\mathbf {u} =0}$$

for some real number λ. The nabla symbol, , is the del operator (differentiation with respect to x). Using the symmetry of M we obtain

$${\displaystyle \nabla \mathbf {x} ^{\textsf {T}}\mathbf {M} \mathbf {x} -\lambda \cdot \nabla \mathbf {x} ^{\textsf {T}}\mathbf {x} =2(\mathbf {M} -\lambda \mathbf {I} )\mathbf {x} .}$$

Therefore Mu = λu, so u is a unit length eigenvector of M. For every unit length eigenvector v of M its eigenvalue is f(v), so λ is the largest eigenvalue of M. The same calculation performed on the orthogonal complement of u gives the next largest eigenvalue and so on. The complex Hermitian case is similar; there f(x) = x* M x is a real-valued function of 2n real variables.

Singular values are similar in that they can be described algebraically or from variational principles. Although, unlike the eigenvalue case, Hermiticity, or symmetry, of M is no longer required.

This section gives these two arguments for existence of singular value decomposition.

3147

主题

8381

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65357
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-6-23 14:36


pda-tt22-notes-clean.pdf

3 Principal Components Analysis (PCA)

If we have a large number of $p$ variables collected on a set of $n$ observations it can be hard to visualize and summarize the dataset.
Principal components analysis (PCA) is a useful method that can allow us to summarize the dataset with a small number of representative variables or dimensions. In other words, we would like to find a low-dimensional representation of the data that captures as much of the information in the dataset as possible. For instance, if we can find a two-dimensional (2D) representation of the data that captures most of the information, then we can plot the observations in this 2D space, and maybe carry out further analysis within this 2D space. PCA provides a tool to do this. The mathematics underlying PCA build on ideas and results that we have learned in the Prelims Linear Algebra courses.

The question then is ‘how do we define a representative variable or dimension?’ Answering this question using PCA involves the following key ideas :
  • Each of the dimensions found by PCA is a linear combination of the variables.
  • PCA seeks a small number of dimensions that are as interesting as possible, where interesting is measured by how variable the observations are along those dimensions. These dimensions are refered to as principal components, or PCs for short.
Why is variability (or variance) a good metric for determining interestingness? In situations where the data consist of separated clusters of observations in $p$-dimensional space, directions that maximize variance can provide good separation of the clusters. This is illustrated in Figure 15 which shows an example of a dataset in 2 dimensions with 2 clear clusters of points. When the data points are projected onto the axis labelled A this results in clear separation of the points and a variance of 12.98. When the points are projected onto the orthogonal axis labelled B this results in no separation of the points and a much lower variance of 1.80. So in this example variance seems to be a good way to choose a projection that separates the two clusters.

Another way to think about this is that we have found a rotation of the data points to maximize the variance. A rotation is a set of orthogonal projections. Figure 16 shows the data points before rotation (left) and after rotation (right). The points in the two clusters are well separated on the new x-axis.

Figure 15: Example showing the variance of points projected on two (orthogonal) projections (A) and (B). The projection (A) which separates the points well has the highest variance.
Raw data
Data rotated to Principal Components
Figure 16: Example showing the variance of points projected on two (orthogonal) projections (A) and (B). The projection (A) which separates the points well has the highest variance.

3.1 Finding the first principal component

The PCA components are found one at a time. The first principal component is a linear combination of the set of $p$ variables, denoted $Z_1$, such that $$ Z_{1}=\alpha_{11} X_{1}+\alpha_{12} X_{2}+\ldots+\alpha_{1 p} X_{p} $$ We can write $$ Z_{1}=\alpha_{1}^{T} X $$ where $\alpha_{1}=\left(\alpha_{11}, \alpha_{12}, \ldots, \alpha_{1 p}\right)^{T}$ and $X=\left(X_{1}, \ldots, X_{p}\right)^{T}$ are both column vectors. Then it can shown (Exercise Sheet 1) that $$ \operatorname{var}\left(Z_{1}\right)=\alpha_{1}^{T} \Sigma \alpha_{1} $$ where $\boldsymbol{\Sigma}$ is the $p \times p$ covariance matrix of the random variable $X$. There are two problems with seeking to maximize this variance
  • Since we only have a sample of data from each of the $p$-variables we do not know the true covariance $\boldsymbol{\Sigma}$, but we can use the sample covariance matrix $\mathbf{S}$ instead.
  • This variance is unbounded as the $\alpha_{i}$ 's increase, so we need to add a constraint on $\alpha$ such that $$ \sum_{j=1}^{p} \alpha_{j 1}^{2}=\alpha_{1}^{T} \alpha_{1}=1 $$
Making these changes results in the constrained maximization problem
$\max _{\alpha_{1}} \alpha_{1}^{T} \mathbf{S} \alpha_{1} \quad$ subject to $\quad \alpha_{1}^{T} \alpha_{1}=1$
In other words, we try to maximize the sample variance of the first principal component $\left(\alpha_{1}^{T} \mathbf{S} \alpha_{1}\right)$ subject to the constraint $\alpha_{1}^{T} \alpha_{1}=1$. We can solve this using Lagrange Multipliers (see Prelims Introductory Calculus course for a reminder on this technique).
Let $$ \mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=\alpha_{1}^{T} \mathbf{S} \alpha_{1}-\lambda_{1}\left(\alpha_{1}^{T} \alpha_{1}-1\right) $$ We then need the vector of partial derivatives of $\mathcal{L}$ with respect to the vector $\alpha_{1}$. This is the gradient vector $\nabla \mathcal{L}$ seen in Intro Calculus

Vector calculus results

Let $a$ and $x$ are $p$-column vectors and $\mathbf{B}$ a $p \times p$ symmetric matrix with rows $\mathbf{B}_{1}, \mathbf{B}_{2}, \ldots, \mathbf{B}_{p}$. Then
  • Let $y=a^{T} x=\sum_{i=1}^{p} a_{i} x_{i}$ be a scalar function, then $$ \nabla y=\left(\frac{\partial y}{\partial x_{1}}, \frac{\partial y}{\partial x_{2}}, \ldots, \frac{\partial y}{\partial x_{p}}\right)^{T}=\left(a_{1}, a_{2}, \ldots, a_{p}\right)^{T}=a $$
  • Let $z=x^{T} \mathbf{B} x=\sum_{i=1}^{p} \sum_{j=1}^{p} \mathbf{B}_{i j} x_{i} x_{j}$ then consider $$ \nabla z=\left(\frac{\partial z}{\partial x_{1}}, \frac{\partial z}{\partial x_{2}}, \ldots, \frac{\partial z}{\partial x_{p}}\right)^{T} $$ Since $\frac{\partial z}{\partial x_{i}}=2 \sum_{j=1}^{p} \mathbf{B}_{i j} x_{j}=2 \mathbf{B}_{i} x$ we have that $$ \nabla z=2 \mathbf{B} x $$
Re-writing we have $$ \mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=\alpha_{1}^{T} \mathbf{S} \alpha_{1}-\lambda_{1} \alpha_{1}^{T} I_{p} \alpha_{1}+\lambda_{1} $$ where $I_{p}$ denotes the $p \times p$ identity matrix.
Using these results we have that $$ \nabla \mathcal{L}\left(\alpha_{1}, \lambda_{1}\right)=2 \mathbf{S} \alpha_{1}-2 \lambda_{1} I_{p} \alpha_{1}=2 \mathbf{S} \alpha_{1}-2 \lambda_{1} \alpha_{1} $$ Setting this equal to 0 results in the eigenvalue equation \begin{equation}\label1 \mathbf{S} \alpha_{1}=\lambda_{1} \alpha_{1} \end{equation} which means $\lambda_{1}$ and $\alpha_{1}$ will be an eigenvalue and eigenvector of $S$ respectively. We can find the eigenvalues of $\mathbf{S}$ by solving the characteristic polynomial $$ \chi_{S}(x)=\operatorname{det}\left(\mathbf{S}-x I_{p}\right)=0 $$ There will be at most $p$ roots of this equation, so we need to determine which one to choose. If we multiply equation \eqref{1} by $\alpha_{1}^{T}$ then we get \begin{equation}\label2 \alpha_{1}^{T} \mathbf{S} \alpha_{1}=\lambda_{1} \end{equation} Thus $\lambda_{1}$ is the sample variance of the first principal component which we are trying to maximize, so we choose $\lambda_{1}$ to be the largest eigenvalue of $\mathbf{S}$, and $\alpha_{1}$ is the corresponding eigenvector.

3.2 Finding the second (and subsequent) principal components

Now consider finding the second principal component. It will also be a linear combination of the set of $p$ variables, denoted $Z_{2}$, such that $$ Z_{2}=\alpha_{2}^{T} X=\alpha_{21} X_{1}+\alpha_{22} X_{2}+\ldots+\alpha_{2 p} X_{p} $$ where $\alpha_{2}=\left(\alpha_{21}, \alpha_{22}, \ldots, \alpha_{2 p}\right)^{T}$. As before we need the constraint $$ \alpha_{2}^{T} \alpha_{2}=1 $$ However, we must add another constraint in order to find a distinct eigenvector. As described above we wish the linear combinations of $p$ variables to be orthogonal projections, so we add in the further constraint that $$ \alpha_{1}^{T} \alpha_{2}=0 $$ and this leads to another Lagrange multipliers problem where we seek to maximize $$ \mathcal{L}\left(\alpha_{2}, \lambda_{2}, m\right)=\alpha_{2}^{T} \mathbf{S} \alpha_{2}-\lambda_{2}\left(\alpha_{2}^{T} \alpha_{2}-1\right)-m \alpha_{1}^{T} \alpha_{2} $$ Taking partial derivatives with respect to $\alpha_{2}$ leads to $$ \nabla \mathcal{L}\left(\alpha_{2}, \lambda_{2}, m\right)=2 \mathbf{S} \alpha_{2}-2 \lambda_{2} \alpha_{2}-m \alpha_{1} $$ Setting equal to 0 gives \begin{equation}\label3 \mathbf{S} \alpha_{2}-\lambda_{2} \alpha_{2}=\frac{1}{2} m \alpha_{1} \end{equation} Pre-multiplying by $\alpha_{1}^{T}$, remembering that $\alpha_{1}^{T} \alpha_{2}=0$ and $\alpha_{1}^{T} \alpha_{1}=1$, gives $$ \alpha_{1}^{T} \mathbf{S} \alpha_{2}=\frac{1}{2} m $$ However, pre-multiplying \eqref{1} by $\alpha_{2}^{T}$, we get \begin{equation}\label4 \alpha_{2}^{T} \mathbf{S} \alpha_{1}=\alpha_{1}^{T} \mathbf{S} \alpha_{2}=0 \end{equation} since $\alpha_{2}^{T} \mathbf{S} \alpha_{1}$ is a scalar and $\mathbf{S}$ is symmetric, which implies that $m=0$ and equation \eqref{3} reduces to the eigenvalue equation \begin{equation}\label5 \mathbf{S} \alpha_{2}=\lambda_{2} \alpha_{2} \end{equation} So $\lambda_{2}$ is also an eigenvalue of $\mathbf{S}$ with associated eigenvector $\alpha_{2}$. As before we can show that $\lambda_{2}$ is equal to $\alpha_{2}^{T} \mathbf{S} \alpha_{2}$ which is the sample variance of the second principal component. We must also ensure that $\alpha_{1}^{T} \mathbf{S} \alpha_{2}=0$, so this is achieved by setting $\lambda_{2}$ to be the second largest eigenvalue, with $\alpha_{2}$ being it's corresponding eigenvector.

The above process can be continued for the other principal components. This results in a sequence of principal components ordered by their variance. In other words, if we consider the eigenvalue decomposition of $S$ (see Linear Algebra recap) $$ \mathbf{S}=\mathbf{V D V}^{T} $$ where $\mathbf{D}$ is a diagonal matrix of eigenvalues ordered in decreasing value, and $\mathbf{V}$ is a $p \times p$ matrix of the corresponding eigenvectors as orthonormal columns $\left(v_{1}, \ldots, v_{p}\right)$, then we have that $$ \alpha_{i}=v_{i} \text { and } \lambda_{i}=\mathbf{D}_{i i} \text { for } i=1, \ldots, p $$

3.3 Plotting the principal components

If $\mathbf{X}$ is the $n \times p$ data matrix of the $n$ observations on $p$ variable, $\mathbf{S}$ is the $p \times p$ sample covariance matrix, with ordered eigendecomposition $\mathbf{S}=\mathbf{V D V}^{T}$ then the data can be transformed to the $p$ principal component directions.
If we define $\mathbf{Z}$ to be an $n \times p$ matrix containing the transformed data, such that $\mathbf{Z}_{i j}$ is the value of the $j$ th principal component for the $i$ th observation then we have \begin{align} \mathbf{Z}_{i j} &=\alpha_{j 1} \mathbf{X}_{i 1}+\alpha_{j 2} \mathbf{X}_{i 2}+\ldots \alpha_{j p} \mathbf{X}_{i p}\label6\\ &=\mathbf{V}_{1 j} \mathbf{X}_{i 1}+\mathbf{V}_{2 j} \mathbf{X}_{i 2}+\ldots \mathbf{V}_{p j} \mathbf{X}_{i p}\label7 \end{align} In other words we take the vector product of the $i$ th row of $\mathbf{X}$ and the $j$ th column of $\mathbf{V}$. The matrix $\mathbf{V}$ is known as the loadings matrix. In matrix notation we can write this as $$ \mathbf{Z}=\mathbf{X V} $$ We can then plot the columns of $\mathbf{Z}$ against each other to visualize the data as represented by those pairs of principal components. The matrix $\mathbf{Z}$ is known as the scores matrix. The columns of this matrix contain the projections onto the principal components.

Figure 17 shows a pairs plot of the 5 PCs for the Crabs dataset. The points have been coloured according to the 4 groups Blue Male (dark blue), Blue Female (light blue), Orange Male (orange), Orange Female (yellow). From this plot we can see that PCs 2 and 3 seem to show good discrimination of these 4 groups. This is shown more clearly in Figure 18 .
Figure 17: Pairs plot of PC components for the Crabs dataset.

3147

主题

8381

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65357
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-6-23 15:14
Loadings are the covariances/correlations between the original variables and the unit-scaled components. This answer shows geometrically what loadings are and what are coefficients associating components with variables in PCA or factor analysis.

stats.stackexchange.com/questions/143905/loadings-vs-eigenvector ... o-use-one-or-another


Trying to be non-technical... Imagine you have a multivariate data, a multidimensional cloud of points. When you compute covariance matrix of those you actually (a) center the cloud, i.e. put the origin as the multidimensional mean, the coordinate system axes now cross in the centre of the cloud, (b) encrypt the information about the shape of the cloud and how it is oriented in the space by means of variance-covariance entries. So, most of the important info about the shape of the data as a whole is stored in the covariance matrix.

Then you do eigen-decomposition of that martrix and obtain the list of eigenvalues and the corresponding number of eigenvectors. Now, the 1st principal component is the new, latent variable which can be displayed as the axis going through the origin and oriented along the direction of the maximal variance (thickness) of the cloud. The variance along this axis, i.e. the variance of the coordinates of all points on it, is the first eigenvalue, and the orientation of the axis in space referenced to the original axes (the variables) is defined by the 1st eigenvector: its entries are the cosines between it and those original axes. The aforementioned coordinates of data points on the 1st component are the 1st principal component values, or component scores; they are computed as the product of (centered) data matrix and the eigenvector.

"After" the 1st pr. component got measured it is, to say, "removed" from the cloud with all the variance it accounted for, and the dimensionality of the cloud drops by one. Next, everything is repeated with the second eigenvalue and the second eigenvector - the 2nd pr. component is being recorded, and then "removed". Etc.

So, once again: eigenvectors are direction cosines for principal components, while eigenvalues are the magnitude (the variance) in the principal components. Sum of all eigenvalues is equal to the sum of variances which are on the diagonal of the variance-covariance matrix. If you transfer the "magnitudinal" information stored in eigenvalues over to eigenvectors to add it to the "orientational" information stored therein you get what is called principal component loadings; these loadings - because they carry both types of information - are the covariances between the original variables and the principal components.

stats.stackexchange.com/questions/2691/making-sense-of-principal ... envalues/35653#35653

3147

主题

8381

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65357
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-6-23 15:40
There is a faster way of obtaining the scores matrix $\mathbf{Z}=\mathbf{X} \mathbf{V}$ that contains the projections onto the PCs. We need to use the following result

Theorem (Singular Value Decomposition) If $\boldsymbol{X}$ is a $n \times p$ matrix of real numbers then there exists a factorization, called the Singular Value Decomposition (SVD) of $\boldsymbol{X}$, with the following form
$$
\boldsymbol{X}=\boldsymbol{P} \boldsymbol{\Lambda} \boldsymbol{Q}^{T}
$$
where
  • $\boldsymbol{P}$ is a $n \times n$ matrix such that $\boldsymbol{P} \boldsymbol{P}^{T}=\boldsymbol{P}^{T} \boldsymbol{P}=\boldsymbol{I}_{n}$
  • $\boldsymbol{Q}$ is a $p \times p$ matrix such that $\boldsymbol{Q} \boldsymbol{Q}^{T}=\boldsymbol{Q}^{T} \boldsymbol{Q}=\boldsymbol{I}_{p}$
  • $\Lambda$ is $n \times p$ diagonal matrix with $\min (\mathrm{n}, \mathrm{p})$ non-negative real numbers on the diagonal.

The diagonal entries of $\boldsymbol{\Lambda}$ are known as singular values of $\boldsymbol{X}$.

Using the SVD we can write the following
\begin{align}
\boldsymbol{S}=\frac{1}{n-1} \boldsymbol{X}^{T} \boldsymbol{X} &=\frac{1}{n-1} \boldsymbol{Q} \boldsymbol{\Lambda}^{T} \boldsymbol{P}^{T} \boldsymbol{P} \boldsymbol{\Lambda} \boldsymbol{Q}^{T} \\
&=\frac{1}{n-1} \boldsymbol{Q} \boldsymbol{\Lambda}^{T} \boldsymbol{\Lambda} \boldsymbol{Q}^{T}
\end{align}
Note that $\Lambda^{T} \Lambda$ is a $p \times p$ diagonal matrix with entries that the squares of the entries in $\boldsymbol{\Lambda}$. This has the same form as the eigendecomposition of $\boldsymbol{S}=\mathbf{V D V}^{T}$ where $\boldsymbol{V}=\boldsymbol{Q}$ and $\boldsymbol{D}=\frac{1}{n-1} \boldsymbol{\Lambda}^{T} \boldsymbol{\Lambda}$
Therefore
$$
\boldsymbol{Z}=\boldsymbol{X} \boldsymbol{V}=\boldsymbol{X} \boldsymbol{Q}=\boldsymbol{P} \boldsymbol{\Lambda}
$$
which implies that we need to calculate $\boldsymbol{P}$ and $\boldsymbol{\Lambda}$. This can be achieved using the eigendecomposition of the $n \times n$ matrix $\boldsymbol{X} \boldsymbol{X}^{T}$ since
$$
\boldsymbol{X} \boldsymbol{X}^{T}=\boldsymbol{P} \boldsymbol{\Lambda} \boldsymbol{Q}^{T} \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{P}^{T}=\boldsymbol{P}\left(\boldsymbol{\Lambda} \boldsymbol{\Lambda}^{T}\right) \boldsymbol{P}^{T}
$$
In other words, we can eigendecompose $\boldsymbol{X} \boldsymbol{X}^{T}$ and obtain $\boldsymbol{P}$ and $\boldsymbol{\Lambda}$ from that. Calculating the eigendecomposition of $\boldsymbol{X} \boldsymbol{X}^{T}$ scales like $O\left(n^{3}\right)$ which is much less than the eigendecomposition of $\boldsymbol{X}^{T} \boldsymbol{X}$ which scales like $O\left(p^{3}\right)$. Also, it is much easier to store and work with an $n \times n$ matrix than a $p \times p$ matrix when $n\ll p$.

3147

主题

8381

回帖

6万

积分

$\style{scale:11;fill:#eff}꩜$

积分
65357
QQ

显示全部楼层

 楼主| hbghlyj 发表于 2022-6-23 15:46


3.8 PCA as minimizing reconstruction error

There is another way to derive principal components analysis that uses the idea that we might try to find the low-dimensional approximation that is as close as possible to the original data.

If $\mathbf{X}$ is our $n \times p$ data matrix then a rank-1 approximation to the matrix can be constructed as $$ \widetilde{\mathbf{X}}=z_{1} w_{1}^{T} $$ where $z_{1}$ is an $n$-column vector and $w_{1}$ is $p$-column vector. Together the product $z_{1} w_{1}^{T}$ is an $n \times p$ matrix, and the idea is that we find the best pair of vectors $z_{1}$ and $w_{1}$ to minimize the distance.

We can measure the distance between $\mathbf{X}$ and $\widetilde{\mathbf{X}}$ by sum of the squared differences between the two matrices.
The $i$ th row of $\mathbf{X}$ is $x_{i}$. The $i$ th row of $\widetilde{\mathbf{X}}$ is $z_{i 1} w_{1}$. \begin{align} J\left(z_{1}, w_{1}\right) &=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-z_{i 1} w_{1}\right)^{T}\left(x_{i}-z_{i 1} w_{1}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left[x_{i}^{T} x_{i}-2 z_{i 1} w_{1}^{T} x_{i}+w_{1}^{T} z_{i 1}^{T} z_{i 1} w_{1}\right]\\ &=\frac{1}{n} \sum_{i=1}^{n}\left[x_{i}^{T} x_{i}-2 z_{i 1} w_{1}^{T} x_{i}+z_{i 1}^{2}\left(w_{1}^{T} w_{1}\right)\right] \end{align} Since we can arbitrarily scale $z_{1}$ and $w_{1}$ so that the product $\tilde{\mathbf{X}}=z_{1} w_{1}^{T}$ stays the same, we can add the constraint that $w_{1}^{T} w_{1}=1$. Taking derivatives wrt $z_{i 1}$ and equating to zero gives $$ \frac{\partial}{\partial z_{i 1}} J\left(z_{1}, w_{1}\right)=\frac{1}{n}\left[-2 w_{1}^{T} x_{i}+2 z_{i 1}\right]=0 \Rightarrow z_{i 1}=w_{1}^{T} x_{i}=x_{i}^{T} w_{1} $$ or in matrix form $$ z_{1}=\mathbf{X} w_{1} $$ Plugging this back in gives \begin{aligned} J\left(w_{1}\right)&=\frac{1}{n} \sum_{i=1}^{n} x_{i}^{T} x_{i}-\frac{1}{n} \sum_{i=1}^{n} z_{i 1}^{2}\\ &=\frac{1}{n} \sum_{i=1}^{n} x_{i}^{T} x_{i}-\frac{1}{n} \sum_{i=1}^{n}\left(w_{1}^{T} x_{i}\right)\left(w_{1}^{T} x_{i}\right)^{T} \\ &=\text { constant }-\frac{1}{n} \sum_{i=1}^{n} w_{1}^{T} x_{i} x_{i}^{T} w_{1} \end{aligned} Now if we assume that the columns of $\mathbf{X}$ have been mean centered then $$ \hat{\mathbf{\Sigma}}=\frac{1}{n} \mathbf{X}^{T} \mathbf{X}=\frac{1}{n} \sum_{i=1}^{n} x_{i} x_{i}^{T} $$ which gives \begin{align} J\left(w_{1}\right)=\quad \text { constant }-w_{1}^{T} \hat{\boldsymbol{\Sigma}} w_{1} \end{align} To minimize this we need to maximize $w_{1}^{T} \hat{\boldsymbol{\Sigma}} w_{1}$ subject to the constraint $w_{1}^{T} w_{1}=1$. We can replace $\hat{\boldsymbol{\Sigma}}$ with $\mathbf{S}$ since $\mathbf{S} \propto \hat{\boldsymbol{\Sigma}}$ and solve using Lagrange multipliers $$ \max _{w_{1}} w_{1}^{T} \mathbf{S} w_{1} \quad \text { subject to } \quad w_{1}^{T} w_{1}=1 $$ which leads to the eigenvalue equation $$ \mathbf{S} w_{1}=\lambda_{1} w_{1} $$ This is the same equation we had before (see eqn 1), so $w_{1}$ and $\lambda_{1}$ are an eigenvector and eigenvalue of $\mathbf{S}$ respectively. Since $w_{1}^{T} \mathbf{S} w_{1}=\lambda_{1}$ is what we are trying to maximize we choose $\lambda_{1}$ to be the largest eigenvalue of $\mathbf{S}$ and $w_{1}$ its associated eigenvector. Thus the best rank-1 approximation to $\mathbf{X}$ is $z_{1} w_{1}^{T}=\mathbf{X} w_{1} w_{1}^{T}$

This approach can be extended to find the best rank-2 approximation as $$ \mathbf{X} \approx z_{1} w_{1}^{T}+z_{2} w_{2}^{T} $$ using the constraints $w_{2}^{T} w_{2}=1$ and $w_{2}^{T} w_{1}=0$ and assuming that we have already estimated $w_{1}$ and $z_{1}$. It can be shown that $w_{2}$ is the eigenvector associated with the second largest eigenvalue of $\mathbf{S}$.

A general extension to the best rank-$q$ approximation to $\mathbf{X}$ then follows from that, and can be written as $$ \boldsymbol{X} \approx \sum_{i=1}^{q} z_{i} w_{i}^{T} $$

3.8.1 Using PCA for compression

The data matrix $\mathbf{X}$ has a total of $n p$ elements, and the best rank-$q$ approximation in equation 16 has $q(n+p)$ elements. Also, the derivation assumed the $\mathbf{X}$ had been mean centered so really we need to store the column mean vector $\hat{\mu}$, which adds another $p$ values.

So if we can find a relatively small value of $q$ such $q(n+p)+p\ll n p$ for which the distance between the rank- $q$ approximation and $\mathbf{X}$ is small then the approximation can be used to compactly store the majority of the information about the dataset $\mathbf{X}$. This is an example of data compression and can be useful when storing or transmitting a dataset.

手机版|悠闲数学娱乐论坛(第3版)

GMT+8, 2025-3-4 21:33

Powered by Discuz!

× 快速回复 返回顶部 返回列表