Principal Component Analysis (PCA) is a way to compress data lossily. I think it’s cool — way cooler than its dry Wikipedia page lets on — because it involves building a code defined by the structure of a dataset.

**The premise**: we have a set of $d$-dimensional vectors $\mathbf{x}$, sampled from a static probability distribution $p(\mathbf{x})$. $d$ might be uncomfortably large, and we want to find a more compact representation of the dataset that contains most of the relevant information. PCA attempts to do this using a linear encoding of the data.

You can describe PCA as doing any of the following:

- Reducing the data’s dimensionality with minimal reconstruction error
- Identifying high-variance modes in the input
- Compressing Gaussian data with minimal information loss

Just a linear transform applied to the data, like so:

$$ \bz = W \bx $$

where $W$ is a $(k, d)$ code matrix with rank $k$, and $\bz$ is the code variable. Dimensionality reduction means taking $k \le d$.

Note that if $W$ has rank $k$, so does $R = W^T W$. This *reconstruction* map therefore has $k$ eigenvectors with nonzero eigenvalues, and $d-k$ eigenvectors with zero eigenvalues, which span the null space.

Applying this encoding to all points $\bx$ induces a new probability distribution in the code space,

$$ p(\bz) = \int d\bx \, \delta( W \bx - \bz) p(\bx) $$

– in words, the probability of a particular code vector $\bz$ is equal to the sum of all the probabilities of the data vectors $\bx$ that are mapped to it.

The obvious question to ask is: how much information is preserved about the input distribution? Intuitively, it makes sense that if $k=d$, no information should be lost, but that as $k$ is reduced, some information *might* be. This can be quantified by the mutual information between the code’s input and output:

$$ I(\bx, \bz) = H(\bx) + H(\bz) - H(\bx, \bz) $$

where $H$ is the Shannon entropy. Since the mapping $\bx \to \bz$ is deterministic, this just reduces to $I(\bx, \bz) = H(\bz)$, the code entropy. Applying the definition $H = -\int d\bx p(\bx) \log p(\bx)$ and a little algebra gives

$$ H(\bz) = -\int d\bx \, p(\bx) \log \left[ \int d\bx’ \delta(W(\bx’ - \bx)) p(\bx’) \right] $$

Incidentally, since

$$ \int d\bx’ \, \delta(W(\bx’ - \bx)) p(\bx’) \ge p(\bx), $$

$H(\bz) \le H(\bx)$. This is as it should be: you can’t *add* entropy by applying deterministic transformations to data, but you can lose some, by squashing it down into fewer dimensions.

Note that the delta function in this expression serves just to restrain the integral to regions in input space where $W(\bx’ - \bx) =0$, that is, the null space of $W$. So it can be written a little more elegantly after a change in variables:

$$ H(\bz) = - \int d\bx \, p(\bx) \log \int_{\textrm{Null}(W)} d\bv \, p(\bv + \bx) $$

The mutual information says that *only the choice of null space matters*, not how the code matrix acts on the surviving vectors. This means we can choose to normalize the code matrices such that all the nonzero eigenvalues are one. In Dirac notation, if \(R = \sum_i r_i \ket{r_i} \bra{r_i}\) for some eigenvalues \(r_i\) and corresponding eigenvectors \(\ket{r_i} \), we can always assume \(r_1, \dots, r_k = 1\) (since the reconstruction map is positive semidefinite), and that $r_{k+1}, … , r_d$ are zero. This is very convenient, because it turns the encoding into a geometric problem: which subspace do we map to zero, and which “code” subspace do we preserve?

Note that with all eigenvalues zero or one, the reconstruction operator is a *projection* satisfying $R^2 = R$.

One way to pick the code space is minimize the error associated with applying the reconstruction operator: if, on average, applying $R$ doesn’t distort the input too much, the intermediate variables $\bz$ must be storing a lot of useful information.

Assume a set of $N$ samples $\bx^i$ have been drawn from the source distribution $p(\bx)$. The mean squared error of reconstruction is

$$ E = \frac{1}{N} \sum_{i=1}^d |\bx^i - R \bx^i | ^2 $$

Form the data matrix $X$ by stacking the samples as rows, $X_{ij} = (\bx^i)_j$, and suppose for convenience that the empirical mean has already been subtracted off these data:

$$ \frac{1}{N} \sum_i \bx^i =0 $$

Now we can define the covariance matrix $C = \frac{1}{N} X^T X$; since the data are zero-mean, $C_{ij}$ gives the empirical covariance of the $i$th and $j$th features (technically, with the $1/N$ normalization this isn’t an unbiased estimator – but it’s easier to remember, and anyway for large $N$ close enough). This is a positive semidefinite matrix, and so can be written

$$
C = \sum_{i=1}^d \lambda_i \ket{\lambda_i} \bra{\lambda_i}

$$

for some eigenvalues $\lambda_i \ge 0$. With these definitions, the average reconstruction error can be written

$$ E = \textrm{Tr}\left[ C - CR - RC + RCR \right] $$

where \(\textrm{Tr}\) denotes the trace operation, \( \textrm{Tr} A = \sum_i A_{ii} \). As noted above, we can restrict ourselves to normalized code matrices, for which $R^2 = R$; using the cyclic property of the trace then gives

$$ E = \textrm{Tr} C - \textrm{Tr} \left[ RC \right] = \sum_{i=1}^d \lambda_i \left( 1 - \bra{\lambda_i} R \ket{\lambda_i} \right) $$

Now it’s easy to pick an $k$-dimensional subspace to minimize this – assuming the eigenvalues are listed in descending order, just take $R$ to be the projection onto the span of $\ket{\lambda_1} \dots \ket{\lambda_k}$. These are the eponymous principal components, and give the minimal reconstruction error among all $k$-dimensional linear codes:

$$ E_{\textrm{PCA}} = \sum_{i=k+1}^d \lambda_i $$

with the code matrix

$$ W = \sum_{i=1}^k \ket{v_i} \bra{\lambda_i} $$

where ${ \ket{v_i} }$ is any basis for $\mathbb{R}^m$.

Incidentally, an autoencoder with a single hidden layer, linear activations, and mean-squared-error cost function is trained precisely to minimize the reconstruction error on the dataset as defined above – the weight matrix just learns the code subspace.

The eigenpairs of the covariance matrix $C$ have a very simple `physical’ interpretation: the eigenvectors give you a set of directions in input space which are linearly uncorrelated, and the eigenvalues give you the variance of the data along each direction.

Concretely, any direction in feature space can be specified by a unit vector $\hat u = (u_1, … u_d)$, which corresponds to the random variable $U = \sum_i u_i x_i$, $x_i$ being the components of $\bx$. Then, since the data are zero-mean, the covariance of $U$ with $V = \sum_i v_i x_i$ is

$$ \textrm{Cov}(U,V) = \sum_{i,j} u_i v_j \mathbb{E}[x_i x_j] = \hat{u} \cdot C \hat{v} $$

If $\hat{u}, \hat{v}$ are taken to be two distinct eigenvectors of $C$, this is zero, meaning the principal directions are uncorrelated. If they’re equal, we get the variance along a single direction. Writing the unit vector $\hat u$ in the eigenbasis of $C$, $\hat u = \sum_{\lambda} u_{\lambda} \ket{\lambda}$,

$$ \textrm{Var}(U) = \sum_{\lambda} |u_{\lambda}|^2 \lambda $$

— a weighted sum over the eigenvalues. Clearly the direction with highest variance is just the $\ket{\lambda}$ with the highest eigenvalue, and so on.

Suppose that instead of minimizing reconstruction error, we want to perform dimensional reduction in a fashion that maximizes the mutual information between the inputs and code variables. How do we find $W$ in a data-dependent fashion?

I don’t know how to do this for the generic input distribution. But it turns out you can find it exactly for a special case of input – a Gaussian – and this corresponds exactly to PCA.

Assume the data $\bx$ are drawn from a $d$-dimensional Gaussian distribution, that it’s zero mean, and that the coordinate axes have been rotated so as to make its covariance matrix diagonal: $\mathbb{E}[x_i x_j] = \delta_{ij} \sigma_i^2$.

The game we’re playing is finding the null space that minimizes the information loss. Well, the Gaussian distribution is like an ellipsoidal blob aligned with the coordinate axes, so by symmetry the optimal choice of null space must be defined by a subset of the coordinate unit vectors. Suppose we choose to ‘keep’ the first $k$ axes, and allow the rest of them to constitute the null space of $W$. Then the mutual information integral above is easily evaluated, since the Gaussian factorizes:

$$ \log \int_{\textrm{Null}(W)} d\bv \, p(\bv + \bx) = \frac{1}{\sqrt{(2 \pi)^k \sigma_1^2 \dots \sigma_k^2 }} $$

which means the entropy in the code space is

$$ H(\bz) = \frac{1}{2} \sum_{i=1}^m \log(2 \pi \sigma_i^2 ) $$

Therefore, the optimal null space is defined by the directions with the smallest variance, which leads to PCA. In this case, the information loss in encoding is known exactly, from the sum of the log-variances that were discarded.

Good question. It’s hard to say, but you can get a pessimistic bound on the quantity of information retained. From the inequality

$$ H(\bz) = H(z_1, \dots, z_m) \le H(z_1) + \dots + H(z_m) $$

and the fact that the means (zero) and variances ($\sigma_i^2 = \lambda_i$) are known, we can use the fact that the Gaussian is the maximal-entropy distribution with a fixed mean and variance:

$$ H(z_i) \le H_{\textrm{gaussian}}(0, \sigma_i^2) = \frac{1}{2} \log \left( 2 \pi \sigma_i^2 \right) $$

so that

$$ H(\bz) \le \sum_{i=1}^m \frac{1}{2} \log \left( 2 \pi \sigma_i^2 \right) $$

So once you’ve got the eigenvalues of the covariance matrix, you can always (in principle) check whether the PCA provides you with enough bandwidth.