Dimensionality Reduction Methods

Dimensionality reduction methods have a number of important practical applications. The more extreme form of dimensionality reduction, where high-dimensional data is reduced into 2 dimensions is often used for the purpose of visualization. There is a number of approaches: we will mention two of these: each has a different principle and also a slightly different aim.

Principal Component Analysis (PCA)

One of the best-known and most straightforward dimensionality reduction methods is principal component analysis. It is a linear method, which transforms the original data into a new orthogonal basis, i.e. it moves the axes so that the fist axes captures the dimension along which there is the most variation, the second axis captures slightly less variation and so on. The idea is illustrated in the figure below. Clearly there is most variation along the direction indicated by the larger arrow. This is the first principal component. The smaller arrow corresponds to the second principal component.

PCA: the principle.

To reduce dimensionality, one simply needs to drop some of the last and least informative principal components. That way it will not be possible to reconstruct the original data precisely, but its most important characteristics should be preserved.

Uniform Manifold Approximation and Projection

Uniform Manifold Approximation and Projection (UMAP) is a very popular nonlinear method for dimensionality reduction, which is built upon a different intuition and has a slightly different aim than PCA. It is very good at visualization.

One of the key insights is that when trying to preserve the structure of a high-dimensional space, one should focus on small rather than large distances. If points are close to each other, they should be kept close in the reduced space as well, if possible. Larger distances may be misleading, as shown in the figure below: note the distance between two data points indicated by the red arrow. If we look at the smaller distances: between closely packed points, we discover that there is a 2D manifold embedded in the 3D space and we should actually measure the distance between the two points along it (as indicated by the green arrow).

Large distance can be misleading in dimensionality reduction.

UMAP and similar methods basically cast the dimensionality reduction problem as an optimization problem: they are optimizing the placement of points in the low-dimensional space so that nearby points remain close to each other in the reduced space (relying on small distances) and some of the global structure of the space is preserved as well in as far as possible.

Dimensionality Reduction Methods

Related

Ancestors