The Beauty in Simplicity

Manifold learning

This week’s reading focused on manifold learning in Optimal Manifold Representation of Data: An Information Theoretic Approach by Chigirev and Bialek (2003). Central to manifold learning is the notion that many high dimensional datasets can often be explained by a few latent variables, or that rather than filling the entire probability space, the data lies on a lower dimensional manifold. The dimensionality of this manifold is then the dimensionality of the latent space, and the coordinate system of this manifold represents the latent variables. It may be helpful to first review how one may approach dimensionality reduction in a simpler setting before delving further into manifold learning. Let’s have a look at principal component analysis (PCA).

Principal component analysis

Suppose we have a design matrix $X \in \mathbb{R}^{n \times p}$, where $x_1, \dots, x_n$ are the row vectors of $X$. Then, the singular value decomposition (SVD) of $X$ gives us

$$ \begin{align} X &= U \Sigma W^{\intercal} \nonumber \end{align} $$

where the columns of $U \in \mathbb{R}^{n \times n}$ are the left singular vectors of $X$, the columns of $W \in \mathbb{R}^{p \times p}$ are the right singular vectors of $X$, and $\Sigma \in \mathbb{R}^{n \times p}$ is a diagonal matrix with singular values $\sigma_{(k)}$ along the main diagonal for $k = 1, \dots, p$. Since the columns of $U$ and $W$ are orthogonal unit vectors, we can also write $X^{\intercal}X$ as

$$ \begin{align} X^\intercal X &= W \Sigma^{\intercal} U^{\intercal} U \Sigma W^{\intercal} \nonumber \newline &= W \Sigma^\intercal \Sigma W^{\intercal} \nonumber \newline &= W \Sigma^2 W^{\intercal} \label{svd} \end{align} $$

Observe that $X^{\intercal}X$ is square (with dimension $p$) and symmetric (and thus diagonalisable). Let $Q \in \mathbb{R}^{p \times p}$ be the orthogonal matrix of eigenvectors of $X^{\intercal}X$ and $\Lambda \in \mathbb{R}^{p \times p}$ be the diagonal matrix of corresponding eigenvalues. Then the resulting eigendecomposition of $X^{\intercal}X$ is

$$ \begin{align} X^{\intercal} X &= Q \Lambda Q^{-1} \label{eigendecomp} \end{align} $$

Equating $\eqref{eigendecomp}$ and $\eqref{svd}$, we have thus shown that the matrix of eigenvectors in $X^{\intercal}X$ are the same as the right singular vectors of $X$, and the square roots of the eigenvalues of $X^{\intercal}X$ are the same as the singular values of $X$.

Now, recall that PCA is defined as an orthogonal linear transformation that maps the original data to a new coordinate system such that the projection along the $k^{th}$ coordinate is the $k^{th}$ largest for $k = 1, \dots, p$. More precisely, let the transformation be defined by the set of vectors $w_{k} \in \mathbb{R}^{p}$ that are applied to each row $x_{i} \in \mathbb{R}^{p}$ of $X$, such that the vector of principal component scores of $x_i$ transformed by $w_{k}$ is $t^{i,k} = [t^{i,k}_{1}, \dots, t^{i,k}_{p}]^{\intercal}$, where each entry of $t^{i,k}$ is given by

$$ \begin{align} t_j^{i,k} &= x_{i,j} \cdot w_{k,j} \qquad \forall j = 1, \dots, p \nonumber \end{align} $$

Define $X_{(k)}$ as $X$ after the first $k - 1$ principal components are removed. It follows that

$$ \begin{align} X_{(k)} &= X - \sum_{s=1}^{k-1} X w_{s} w_{s}^{\intercal} \nonumber \end{align} $$

and the $k^{th}$ weight vector $w_{k}$ can be computed as

$$ \begin{align} w_{k} &= \underset{\Vert w \Vert = 1}{\mathrm{argmax}} \left \{ \Vert X_{(k)} w \Vert ^2 \right \} \nonumber \newline &= \underset{w}{\mathrm{argmax}} \left\{\frac{w^{\intercal} X_{(k)}^{\intercal} X_{(k)} w }{w^{\intercal}w} \right \} \label{optimprob} \end{align} $$

Since $X^{\intercal}X$ is positive semi-definite, the solution to $\eqref{optimprob}$ is the $k^{th}$ largest eigenvector of $X^{\intercal}X$ and the value of the objective is the $k^{th}$ eigenvalue of $X^{\intercal}X$. It follows from above that the columns of $W$ are $w_{1}, \dots, w_{p}$, and the full principal components decomposition of $X$ can be given as

$$ \begin{align} T &= XW \label{fullpca} \end{align} $$

which maps a data vector $x_{i}$ from an original space of $p$ features to a new space of $p$ features which are uncorrelated over the dataset. Dimensionality reduction can thus we achieved by keeping only the first $l < p$ principal components in $\eqref{fullpca}$, which gives the truncated transformation

$$ \begin{align} T_{l} &= X W_{l} \nonumber \end{align} $$

where $W_{l} \in \mathbb{R}^{p \times l}$ and $T_{l} \in \mathbb{R}^{n \times l}$, where the columns of $W_{l}$ are the first $l$ columns of $W$ and therefore the first $l$ eigenvectors of $X^{\intercal}X$.

Generalising dimensionality reduction

In summary, PCA and other traditional tools for dimensionality reduction tend to project the original dataset onto a hyperplane, making the reduced coordinates easy to interpret. However, they are unable to account for nonlinear correlations in a dataset, which requires relaxing the assumption that the data can be modelled as a hyperplane and instead allow for it to lie on a general low-dimensional manifold of unknown shape and dimensionality. Other attempts to accommodate for nonlinear correlations map the data into a higher dimensional feature space with the hope that the correlations will become linearised and PCA can then be applied.

Chigirev and Bialek (2003) formulate dimensionality reduction as a compression problem, where the bottleneck is the information available to manifold coordinates. The optimal manifold is then the best reconstruction of the original dataset under the restriction that the coordinates can only be transmitted through a channel of fixed capacity.

The optimal manifold

Suppose that we have a dataset $X$ in a high-dimensional state space $\mathbb{R}^{D}$ described by density $p(x)$, and we want to find a simpler description of the data. One possibility is to define a manifold $\mathcal{M}$ and a stochastic map $P_{\mathcal{M}}: x \rightarrow P_{\mathcal{M}}(\mu \mid x)$ to points $\mu$ on the manifold, which we say defines a manifold description of $X$. Now, let the distortion measure $D(\mathcal{M}, P_{\mathcal{M}}, p)$ be defined as

$$ \begin{align} D(\mathcal{M}, P_{\mathcal{M}}, p) &= \int_{x \in \mathbb{R}^{D}} \int_{\mu \in \mathcal{M}} p(x) P_{\mathcal{M}} (\mu \mid x) | x - \mu |^2 \quad D \mu d^{D} x \label{distortion} \end{align} $$

The stochastic map $P_{\mathcal{M}}(\mu \mid x)$ and the density $p(x)$ define a joint probability function $P(\mathcal{M}, X)$ that allows us to calculate the mutual information between the data and its manifold representation:

$$ \begin{align} I(X, \mathcal{M}) &= \int_{x \in X} \int_{\mu \in \mathcal{M}} P(x, \mu) \log \left[ \frac{P(x, \mu)}{p(x) P_{\mathcal{M}}(\mu)} \right] \quad D \mu d^{D} x \label{mutualinfo} \end{align} $$

This information measure tells us how many bits (on average) are required to encode $x$ into $\mu$, or the necessary capacity of the channel needed to transmit the compressed data. Ideally, we want a manifold description ${\mathcal{M}, P_{\mathcal{M}} }$ that has low distortion $D(\mathcal{M}, P_{\mathcal{M}}, p)$ and good compression (or low $I(X, \mathcal{M})$), where the more bits we are willing to provide for the description of the data, the more detailed a manifold can be constructed. This is formalised in the concept of an optimal manifold – given a dataset $X$ and channel capacity $I$, a manifold description ${\mathcal{M}, P_{\mathcal{M}} }$ that minimises the distortion $D(\mathcal{M}, P_{\mathcal{M}}, p)$ and requires only information $I$ for representing an element of $X$ will be called an optimal manifold $\mathcal{M}(I,X)$.

To find the optimal manifold, we must solve a constrained optimisation problem. For Langrange multiplier $\lambda$, consider the functional $\mathcal{F}(\mathcal{M}, P_{\mathcal{M}}) = D + \lambda I$. Then, for $d \le D$, let $\gamma(t): t \rightarrow \mathcal{M}$ map points in $t \in \mathbb{R}^{d}$ onto the manifold so that reparameterising $\mathcal{M}$ with $t$ gives

$$ \begin{align} D &= \int_{x \in \mathbb{R}^{D}} \int_{t \in \mathbb{R}^{d}} p(x) P(t \mid x) (\mu \mid x) | x - \gamma(t) |^2 \quad d^{d} t d^{D} x \nonumber \newline I &= \int_{x \in \mathbb{R}^{D}} \int_{t \in \mathbb{R}^{d}} p(x) P(t \mid x) \log \left[ \frac{P(t \mid x)}{P(t)} \right] \quad d^{d} t d^{D} x \nonumber \newline \mathcal{F}(\gamma(t), P(t \mid x)) &= D + \lambda I \end{align} $$

First order conditions stipulate that $\frac{\delta \mathcal{F}}{\delta \gamma(t)} = 0$ and $\frac{\delta \mathcal{F}}{\delta P(t \mid x)} = 0$, giving us the following set of self consistent equations

$$ \begin{align} P(t) &= \int_{x \in \mathbb{R}^{D}} p(x) P(t \mid x) \quad d^{D} x \label{foc1} \newline \gamma(t) &= \frac{1}{P(t)} \int_{x \in \mathbb{R}^{D}} x p(x) P(t \mid x) \quad d^{D} x \label{foc2} \newline P(t \mid x) &= \frac{P(t)}{\Pi(x)} \exp \left( -\frac{1}{\lambda} |x - \gamma(t) |^2 \right) \label{foc3} \end{align} $$

where $\Pi(x)$ is a normalising constant for $P(t \mid x)$.

A Blahut–Arimoto algorithm

In practice, we do not have the complete density $p(x)$ and instead have a discrete number of samples. The density is thus approximated as $p(x) = \frac{1}{N} \sum_{i=1}^{N} \delta (x - x_i)$ where $N$ is the number of samples, $i$ is the sample label, and $x_i$ is the vector describing the $i^{th}$ sample. Similarly, instead of $t$ being a continuous variable, we use a discrete set $t \in {t_1, \dots, t_K }$ of $K$ points to model the manifold. Then $P(t \mid x)$ becomes $P_k(x_i)$, $\gamma(t)$ becomes $\gamma_k$, and $P(t)$ becomes $P_k$. Let $n$ be the iteration number, and $\alpha = 1, \dots, D$ be a coordinate index in $\mathbb{R}^{D}$. Then, instead of $\eqref{foc1}$, $\eqref{foc2}$ and $\eqref{foc3}$, we have the iterative scheme

$$ \begin{align} P_k^{(n)} &= \frac{1}{N} \sum_{i=1}^{N} P_k^{(n)}(x_i) \label{alg1} \newline \gamma_{k, \alpha}^{(n)} &= \frac{1}{P_k^{(n)}} \frac{1}{N} \sum_{i=1}^{N} x_{i,\alpha} P_{k}^{(n)} (x_i) \label{alg2} \newline P_{k}^{(n+1)}(x_i) &= \frac{P_k^{(n)}}{\Pi^{(n)}(x_i)} \exp \left\{-\frac{1}{\lambda} | x_i - \gamma_k^{(n)} |^2 \right\} \label{alg3} \end{align} $$

where $\gamma_k^{(0)}$ and $P_k^{(0)}(x_i)$ can be initialised by choosing $K$ points at random from the dataset, and letting $\gamma_k = x_{i, (k)}$ and $P_k^{(0)} = \frac{1}{K}$. The stopping criterion is $\max_{k} |\gamma_k^{(n)} - \gamma_k^{(n-1)}| < \epsilon$, where $\epsilon$ determines the precision with which the manifold points are located. Note that we also require the information distortion cost $\lambda = -\frac{\delta D}{\delta I}$ as a parameter in the above algorithm. That is, for a given value of $I$, we can find the manifold description $(\mathcal{M}, P(\mathcal{M} \mid X))$ by using the above algorithm to get arbitrarily close to the given $I$.

Some intuition

The Blahut-Arimoto algorithm in $\eqref{alg1}$, $\eqref{alg2}$ and $\eqref{alg3}$ produces a collection of $K$ manifold points, $\gamma_k \in \mathcal{M} \subset \mathbb{R}^{D}$, and a stochastic projection map $P_k(x_i)$. Consider a ball of radius $r$ centered at some point on the manifold, and we begin to grow the ball. Then, the number of points on the manifold that fall inside the ball will scale with $r^{d}$, where $d$ is the intrinsic dimensionality of the manifold. This will not necessarily be true for the points in the original dataset, since it will resemble the whole embedding space $\mathbb{R}^{D}$ locally. More precisely, for the points lying on the manifold, the correlation dimension remains constant, but the correlation dimension for the original dataset quickly approaches that of the embedding space as $r$ decreases.

More is not always better

The key idea in this paper is to offer a generalised alternative to finding low-dimensional manifolds in high-dimensional data by restricting the mutual information between points on the learned manifold and points in the original dataset. This is in contrast to most other methods that employ explicit regularisation or deliberately constrain the geometric features of the resulting manifold. Sections 5 and 6 of the paper provide some examples of the algorithm in action and a more detailed discussion of its performance.

comments powered by Disqus