# Intro

High-dimensional geometry is one of those topics that some researchers spend their entire careers on. While we only be able to spend one lecture on the topic, this should be enough for you to gain the intuition we need to explain a lot of the trouble we find ourselves in when doing ML.

The goal for this lecture is to convince you that Euclidean spaces with many dimensions look nothing like the three dimensional space we live in. We will spend some time unlearning some things that seem like they should exist, and some time learning “what high-dimensional space feels like”.

## Background: tail bounds

We need to use a little bit of probability in this lecture. Specifically, we will need to know about tail bounds, ways to characterize the behavior of random variables far away from the center (at “the tails”). These are all inequalities and they all will say, in different ways, that it’s unlikely (ie. low probability) that a draw from a random variable will be far from the expectation of the random variable.

Tail bounds are exceedingly useful because calculating expectations of random variables is an easy thing to do, but calculating probabilities of extreme events directly is a very hard thing to do. Tail bounds provide us a way to convert statements about expectations to statements about probabilities, at the expense of some looseness coming from the inequality. We will see that more sophisticated tail bounds are progressively less loose.

### Markov’s inequality

I find it easier to remember Markov’s inequality by using a mnemonic: “If the average population height is 5 feet, a random person is at most 10% likely to be taller than 50ft.” Markov’s inequality is merely a generalization of the idea. Instead of height and 5 feet, it’s any non-negative random variable $X$ and its expectation. Instead of 10%, it’s any probability $a$:

$P(X \ge a) \le \frac{E[X]}{a}$

While Markov’s inequality is the most fundamental tail bound, it’s fairly loose. For example, in the US the average male is $5.86$ feet tall, but clearly the probability of being $58$ feet tall is less than $10%$. What we’re not taking into account here is the standard deviation of the population, (roughly) the expected distance of any point from the mean of the population1.

If we know that the standard deviation is bounded, how can we take advantage of that?

### Chebyshev’s inequality

Chebyshev’s inequality comes out directly from applying Markov’s inequality to the squared distance of a draw from a variable $X$ to its mean $\mu$, $(X - \mu)^2$, using the probability that this is larger than $(k\sigma)^2$:

$\begin{eqnarray*} P((X - \mu)^2 \ge (k\sigma)^2) & \le & \frac{E[(X - \mu)^2]}{(k\sigma)^2} \\ P (|X - \mu| \ge k\sigma) & = & \frac{\sigma^2}{k^2 \sigma^2} \\ & = & 1/(k^2) \end{eqnarray*}$

Now, we can add additional information to our original problem, namely that the standard deviation of male height is about 4 inches. In our original problem, 58 feet is about 13 standard deviations away from the mean, which means that the probability of seeing someone that tall is at most 1/169. This is much closer to our intuition than 1/10, but it’s still not great, right? Clearly it’s not the case that about one in two hundred men are 60 feet tall.

So what do we do? This is where the math gets more intricate. The important intuition to keep in mind is that the next tail bounds we will study take advantage of bounding not only standard deviations (which are about the expectation of quadratic distances to the mean), but we will bound the expectation of all powers of distances to the mean. This will let us get exponentially better bounds.

### Exponential tail bounds, Chernoff bound

The exposition here follows that of Martin Wainwright’s course, and specifically, Chapter 2 of the course notes.

The main theoretical trick to achieve better bounds is to study not only the deviations of the values and their squares, but all powers at the same time. We do this by studying the random variable $Y = e^{\lambda(X - \mu)}$2. This gives a random variable that combines all powers, which we plug into Markov’s inequality to get

$P[X - \mu \ge t] = P[e^\lambda(X - \mu) \ge e^{\lambda t}] \le \frac{E[e^{\lambda(X - \mu)}}{e^{\lambda t}}$

Taking the log of the left and right hand sides, we get an expression that is true for all values of $\lambda$ where only the right side depends on $\lambda$. So we should pick $\lambda$ to get the tightest bound possible. This is “the” Chernoff bound:

$\log P[X - \mu \ge t] \le - \sup_{\lambda \in [0, b]} \left \{ \lambda t - \log E [e^{\lambda (X - \mu)}] \right \}$

Where things get tricky is that there are different Chernoff bounds for different random variables (since when you plug an actual specific random variable, you need to also pick the best $\lambda$). As a result, instead of a single bound, everyone who uses Chernoff bounds pick whatever specific bound they need, and yet they all call it the same thing. So one lesson that helps understand the literature is that you should think of it as “a Chernoff-style bound” rather than “the Chernoff bound”.

In addition, we’re not going to derive all of these bounds ourselves because it’s not very useful for where we’re going. We’re just going to use a few of them.

(Upper) Chernoff bound for Gaussians:

$P[X - \mu \ge t] \le \exp \left \{ -\frac{t^2}{2\sigma^2} \right \}, \forall t \ge 0.$

The height distribution in adult males is clearly not a Gaussian, but for the sake of comparison with the Chebyshev inequality, let’s pretend it is. In that case, $t = 52.14$, and so $P[X \ge 58] \le e^{-\frac{52^2}{32}} \approx 2 * 10^{-37}$. That’s exceedingly rare, and much more in line with our intuition.

Chernoff bound for sub-Gaussians:

Most random variables are not Gaussians, and so we need something a little more general. It turns out that a large class of random variables do support a single “kind” of Chernoff bound, and they’re called “sub-Gaussian random variables”. A random variable is sub-Gaussian if there exists a positive number $\sigma$ such that

$E[e^{\lambda(X - \mu)}] \le e^{\sigma^2 \lambda^2/2}, \forall \lambda \in R$

Whenever a random variable is sub-Gaussian with parameter $\sigma$, the Chernoff bound for Gaussians also holds for that variable, using $\sigma^2$ instead of the variance. Gaussian random variables are sub-Gaussian with the parameter $\sigma^2$ being equal to the variance of the Gaussian.

Chernoff bound for Rademachers ($+1$ or $-1$ random variables): Rademachers are sub-Gaussian with $\sigma = 1$.

Rademachers are handy random variables for analysis, being a random variable that is assigned $-1$ and $1$ with the same probability. Rademacher random variables are sub-Gaussian with $\sigma = 1$.

Chernoff bound for bounded random variables: Bounded random variables in $[a, b]$ are sub-Gaussian with $\sigma = (b-a)/2$.

Hoeffding bounds are “just” Chernoff bounds for sums where each random variable under the sum has a different expectation $\mu_i$ and sub-Gaussian parameter $\sigma_i$:

$P \left [\sum_{i=1}^n (X_i - \mu_i) \ge t \right ] \le \exp \left \{-\frac{t^2}{2 \sum_{i=1}^n \sigma_i^2 } \right \}$

### Notes

We’ve worked here with the Chernoff bounds for additive bounds, of the form $P[X \ge \mu + t]$. There is a different class of Chernoff bounds for multiplicative bounds, of the form $P[X \ge (1 + \delta) \mu]$. I personally find it harder to remember what they look like so I just look them up when I need them, and I didn’t cover them here. You should know they exist so you can look them up if you need them.

## High-dimensional spaces

Here are five different observations about high-dimensional spaces and Euclidean geometry. These are meant to show you how truly weird high-dimensional spaces are, and to give you a few pictures that will help understand why ML is particularly challenging in the absence of the DGD.

### In high dimensions, lemons are “all” rind

The formula for the volume of a sphere of radius $r$ in $d$-dimensional space is simple (if ugly):

$V_d(r) = \frac{\pi^{d/2}}{\Gamma(n/2 + 1)} r^d$

$\Gamma(x)$ is the Gamma function, which kind of generalizes factorials to real numbers, $\Gamma(x) = (x-1)!$ whenever $x$ is a large enough integer. More precisely, $\Gamma(x) = \int_0^\infty t^{x-1} e^{-t}\; dt$. (If you remember your integration by parts and inductions, you can prove the formula for the volume of the $d$-dimensional sphere in a straightforward, if tedious, way.)

To see why it’s the case that in high dimensions, most of the volume of a sphere is near the surface (and hence, “lemons are all rind”), we won’t need to actually do anything with $\Gamma$. Simply consider the ratio between a sphere of radius $0.9$ and one of radius $1$, as $d$ increases. Clearly, this ratio is just $0.9^d$, which decreases fast as $d$ increases.

In other words, the volume of the “90% core by radius” of the sphere very quickly goes to zero.

### In high dimensions, gaussians are really like spherical shells

(The proof of this is not particularly insightful to me and, annoyingly, I don’t know of a proof of a similar bound using only the Chernoff bounds we just learned about.)

Let’s just state Theorem 2.9 from FODS: for a $d$-dimensional Gaussian with unit variance in each direction, only $3 e^{-\beta^2/96}$ of the probability mass away from the shell $\sqrt{d} - \beta \le |x| \le \sqrt{d} + \beta$. Because of this, we call $\sqrt{d}$ the “radius” of the Gaussian.