Multidimensional scaling refers to a wide range of techniques which, generally speaking, all try to take measurements of distances (or similarities) and turn those measurements into coordinates of points in a low-dimensional space.
There are many, many variants of “multidimensional scaling”, with different assumptions on the measurements, and different ways to reconstruct the values. Here we will focus on only one of them, typically known as “Classical” multidimensional scaling, or CMDS. For more, the best reference is Borg and Groenen’s Modern Multidimensional Scaling^{1}.
Imagine that, instead of knowing the coordinates of a set of points, all you knew were the distance from each point to each other point. Intuitively, it seems like it should be possible to convert this information back into coordinates which respect the distances. Classical MDS attempts to do precisely that. Let’s start with the algorithm itself:
The decision of how many coordinates to use for the MDS is similar to the decision of how many principal components to choose: the smaller the values in $\Lambda$, the more precise the result will be (that is, the squared distance between $X_i$ and $X_j$ will become closer to the actual value in $D_{ij}$).
In fact, the easiest way to understand how CMDS works is to compare it directly to PCA. Specifically, let’s look at what happens if we started from an actual matrix $X$ from which to build $D$. In this case,
\[\begin{align} D_{ij} &=& || X_i - X_j ||^2 \\ D_{ij} &=& \langle X_i - X_j, X_i - X_j \rangle\\ D_{ij} &=& \langle X_i, X_i \rangle + \langle X_j, X_j \rangle - 2 \langle X_i, X_j \rangle. \end{align}\]So what we get is that every entry $i, j$ of the matrix $D$ can be represented by a sum of three terms. So we write the matrix $D$ as a sum of three matrices:
\[\begin{align} D &=& (\langle X_i, X_i \rangle)_{ij} + (\langle X_j, X_j \rangle)_{ij} + (-2 \langle X_i, X_j \rangle)_{ij}\\ D &=& A + B + C\end{align}\]Now we follow step 2 of the algorithm above, and let
\[\begin{align} P &=& -1/2 H D H \\ P &=& -1/2 H (A + B + C) H \\ P &=& -(1/2) H A H -(1/2) H B H -(1/2) H C H \end{align}\]Now note that all of the columns of $A$ are identical, because the values only depend on the row index $i$. Similarly, all of the rows in $B$ are identical, because the values only depend on the column index $j$. This means that $HAH = HBH = 0$!, since column-centering $A$ will subtract the average column value; the analogous thing happens to row-centering $B$. As a result,
\[\begin{align} P &=& -1/2 H D H \\ P &=& H (\langle X_i, X_j)_{ij} \rangle H \end{align}\]So our $P$ matrix is now exactly equal to a centered matrix of inner products of $X$, even though we never used the inner products directly – all we had access to was squared distances. So if we take $P$ to be the matrix $M$ on step 3 of our second PCA algorithm, we get to recover $X$ exactly! This is really neat.
The same algorithm also applies directly when all we have access to is a notion of similarity between points. Here the idea is even simpler. Let’s say that we have a way to give a numerical value of similarity between points such that if two points are similar to each other, the value is relatively large, and if two points are relatively dissimilar, the value is relatively small. Then we can pretend that this value is “like an inner product”, create a matrix of similarities, center the matrix in both rows and columns, and just compute the PCA directly like above. This will actually recover good coordinates the respect the similarities!
Borg, I. and Groenen, P.J.F. (2005). Modern multidimensional scaling. 2nd edition. New York: Springer. ↩