The Higher Dimensions Series — Part Four: The Probability Theory Connection
Welcome to Part Four of the Higher Dimensions Series, where we explore some of the strange and delightful curiosities of higher dimensional space. Currently, I have completed Parts One, Two, Three, Four, and Five, with hopefully more to come. If you have not already done so, I encourage you to read the earlier parts before continuing on with our current expedition.
Today we will be entering the world of probability theory to try and shed a bit of light on some of the bewildering phenomena we have encountered in earlier parts of the series. My mind was completely blown the first time I was introduced to the the relationship between these higher dimensional wonders and some fundamental concepts in probability theory. In truth, the links between higher dimensional space and probability theory are very deep, highly technical, and mostly beyond my current intellectual reach. However, by introducing some relatively simple concepts in probability theory and relating them to scenarios we’ve already explored together, we can get a tiny glimpse as to why some things behave as they do in higher dimensions.
I hope you have come prepared because today’s journey is suited for the most intrepid of travelers, and it may alter your mind in unexpected ways. If you are anything like me, you will retire at the end of our journey with more questions than answers and a renewed dedication to exploring the strange world of higher dimensional space!
To start, it would be helpful to introduce the concept of a random variable. First, a variable is just a symbol, such as X or Y, that represents an arbitrary value of some set (such as all the real numbers, or all non-negative integers). There are some formal and rather technical definitions of random variables, but for our purposes we can think of them as variables whose values depend on the outcome of some random phenomenon. Random variables have a set of possible values that they can have, as well as a probability distribution that defines the probability of each of those possible values occurring. One prototypical example of a random variable is the result of a single roll of a six-sided die. Here, our random variable has six possible values 1, 2, 3, 4, 5, or 6, each of which occur with a probability of 1/6 or roughly 17%. If we roll a three, we can think of that value as a realization of the random variable, but before we rolled the die we only knew its possible values and the probabilities associated with seeing each of those values. Another common example of a random variable is the result of tossing a fair coin — does it land heads or tails? Here, our random variable has only two possible values, heads or tails, each of which occur with a probability of 1/2 or 50%.
Random variables can get a lot more complicated than die rolls and coin tosses, but the most fundamental concepts remain the same. A random variable is defined by its possible values as well as the probabilities of each of those values occurring.
Points in Space as Realizations of Random Variables
Although it may not have been clear at the time, when we were selecting random points in space from within an n-ball or n-cube in earlier parts of the series, each of the n individual coordinates associated with a single point were in fact different realizations of a single random variable!
Let’s be a bit more explicit here because there are a lot of components involved. Think about the n-cube centered at the origin with edges of length 1, which is a bit more straightforward than the n-ball in this context (for reasons I will touch on later). To generate a random point inside an n-cube, we need to randomly generate the value of each of the n coordinates associated with the point, which corresponds to randomly selecting a value from between -0.5 and 0.5 for each coordinate. Now, consider again the random variable associated with a single roll of a six-sided die. If we roll this die n times and record the outcomes, we will have a sequence of n integers that range from 1 to 6. Are you beginning to see the connection? This is the exact same idea as randomly generating n coordinates between -0.5 and 0.5! In the case of spatial coordinates, the random variable is a bit more complex and can take on a lot more values (an infinite number of values, actually) than a die roll, but the fundamental idea is the same: we have a random variable that can take on one of many numeric values, each of which is associated with a particular probability. In the case of the die roll or a single coordinate of a hypercube, each possible value has an equal probability of occurring. This is known as a uniform probability distribution, and it can exist over a discrete set of outcomes (e.g., the six possible integer values of a die roll) or a continuous set of outcomes (e.g., all the possible values between -0.5 and 0.5). This distinction isn’t particularly important to our journey, but I just thought I’d include some key probability concepts for those who might be interested.
To reiterate, we have just established that each individual coordinate of a random point inside an n-cube is a random variable, as are the unrealized outcomes of n tosses of a die. Perhaps you think this is cool (I do!), or perhaps not; if you find yourself in the latter camp, I assure you that your patience will be rewarded.
Sums of Random Variables
The concept of summing up random variables is fundamental to a wide variety of topics in probably and statistics and it is critical to our current adventure. Perhaps one of the most prominent places where this comes up is in the context of statistical inference, which is the process of learning about a population by studying a sample from that population. I’ll be skipping the gritty details here, but it turns out that there are some incredibly fascinating and powerful things that occur when you sum up random variables, something we will soon see for ourselves.
Consider again a fair coin. Let’s say we are interested in counting the number of times heads comes up in a given number of coin tosses, say ten coin tosses to start. To make things more quantitative, let’s code an outcome of heads as a one and an outcome of tails as a zero. Earlier, we saw that a single coin toss is a random variable with two possible values; thus, the number of heads in ten coin tosses is simply the sum of ten instances of that random variable. To be technical about it, we say that the number of heads in ten coin tosses is the sum of ten independent and identically distributed (i.i.d) random variables. Why independent? Well, independent just means that the outcome of each coin toss does not depend on the outcome of any other coin toss. Why identically distributed? This just refers to the fact that each of the coin tosses has the same (i.e., identical) probability distribution associated with its outcomes: heads (which we are coding as a one) or tails (which we are coding as zero), each of which occur with a probability of 50%. So, again, the number of heads in ten coin tosses is a sum of ten i.i.d variables! This i.i.d concept isn’t essential to our journey today, but I wanted to point it out because sums of i.i.d random variables have some particularly elegant properties that are absolutely fundamental to the field of statistics (e.g., the central limit theorem).
Anyway, let’s get back on track and consider the number of heads in ten coin tosses. How many times would we expect the coin to land on heads? Well, given that the each toss has a 50% chance of landing on heads, perhaps a decent guess would be that the coin will land on heads in five out of the ten tosses. We know from experience or intuition that the coin may land on heads five times, but it could also land on heads zero times, ten times, or any number in between. After all, it’s random!
Let’s simulate tossing a coin 10 times and recording the number of times it lands on heads, and then repeating this for 10,000 trials! This way, we can visualize the distribution of outcomes we might expect in a single trial of 10 tosses. Specifically, it will give us an idea of how often we might expect to see zero heads, one head, two heads, etc. Let’s go!
This looks reasonable, right? This is the distribution of outcomes among 10,000 trials of tossing a coin ten times and recording the number of heads. It appears very uncommon to get either zero heads or ten heads, and it is most common to get five heads, which was our initial guess! However let’s take note of the fact that we do see a good amount of variability in the outcome. We didn’t always get five heads, or even just somewhere between four and six heads; rather, we saw the full range of outcomes occur among our 10,000 trials. Although it’s hard to tell from the plot, there were about ten trials (0.1% of all trials) that ended up with zero heads and another ten that ended with ten heads.
I’m sure you’re thinking that this is all well and good, but what does it have to do with higher dimensional space? Worry not, for we are soon getting there! Instead of tossing the coin ten times and counting the heads, let’s toss it 100 times and count the number of heads, and repeat that for 10,000 trials! Remember, we are now talking about the sum of 100 i.i.d random variables. Here is the result of our 10,000 trials:
Wow! That definitely looks a bit different. If we toss a coin 100 times and record the number of heads, the possible number of heads we see is any number between zero and 100, right? However, based on our 10,000 trials, it looks extremely uncommon for a trial to have fewer than about 30 heads or more than about 70 heads. We no longer see trials resulting in all the possible outcomes, as we did when we tossed the coin 10 times. Indeed, it looks like the number of heads in each trial is more highly concentrated around the average, which happens to be 50 heads in this case.
Let’s do this one more time, but instead of tossing the coin 100 times, let’s toss it 1000 times, which corresponds to the sum of 1000 i.i.d random variables. Here’s the result of our 10,000 trials:
Okay, there’s an obvious trend here. It appears that as we increase the number of coin tosses (e.g., from 10 to 100 to 1000), the number of heads that we see in each our trials get more and more concentrated around the average value, relative to all the possible values. Coin tosses may seem trivial now, but we will soon see that the phenomena we are seeing is anything but.
Concentration of Measure and Higher Dimensional Spaces
What we demonstrated above is not specific to coin tosses but is a manifestation of a general phenomenon known as concentration of measure. For full transparency, a full technical understanding of all the theory and implications of this general phenomenon is far beyond my reach. However, one of the most notable and perhaps widely applicable manifestations of this phenomenon is what we observed above with our coin tosses; that is, as the number of coin tosses (i.e., random variables) being summed increases, the observed outcomes will grow more tightly concentrated around their average value. Another way to frame this is that the probability mass associated with each of the possible outcomes grows more highly concentrated among a small number of all possible outcomes as we increase the number of random variables that we are summing. When the variables being summed are i.i.d (as with coin tosses), this is described by an incredibly important theorem in probability and statistics called the law of large numbers. The law of large numbers formalizes this idea that as you increase the number of coin tosses (the number of i.i.d variables being summed), you’re going to get closer and closer to getting half heads and half tails (the true mean, or expectation of the probability distribution).
Anyway, for those who are familiar, are the above plots reminiscent of any of the plots we’ve encountered on our earlier adventures, particular those in Parts One and Three? Let’s refresh our memory. In Part One of the series, we saw that points randomly selected from within n-balls were more highly concentrated at the outer boundary of the ball as n increased. Put another way, the distance between the origin and each point was universally close to one as the dimensionality of the space increased; that is, they all were roughly the same distance from the origin, with little variability. In Part Three, we saw that pairs of points selected randomly from within n-balls or n-cubes were more highly concentrated around the average distance as n increased. Basically, as we moved into higher and higher dimensions, all points were roughly the same distance from all other points, with little variability.
Consider the n-cube. Let’s run an experiment where we select a large number of random points from within an n-cube centered at the origin and measure the distance between each point and the origin, similar to what we did in Part One. We’ll run this experiment for a 10-cube, 100-cube, and 1000-cube. The plots below show the distribution of distances from the origin of 10,000 random points within each n-cube:
Look at all that concentration of measure! As we’ve seen before, it looks like the points become more tightly clustered around a particular distance from the origin as we move into higher and higher dimensions. If your mind isn’t going wild with the connections that have been presented, I encourage you to scroll up and look at the plots of the numbers of heads in 10, 100, and 1000 coin tosses. They’re almost identical to these plots of the distances between the origin and random points inside 10-, 100-, and 1000- dimensional hypercubes! Is this a coincidence? I think not! Let’s dig a deeper into what might be going on here.
We already know that the number of heads in a given number of coin tosses is a sum of i.i.d. random variables, and we have seen that the probability mass of the sum of i.i.d random variables gets more highly concentrated among a smaller number of possible values as we increase the number of items being summed (i.e., the law of large numbers). But what does this have to do with higher dimensional space? More specifically, what does it have to do with the distance between a point inside an n-cube and the origin?
Remember how we calculated the distance from a point to the origin using the generalization of the pythagorean theorem? Let’s take another look at that:
What have we here? In words, the distance to the origin for a given point in n-space is the square root of the sum of the squares of each individual coordinate of that point! Remember earlier when we characterized each coordinate of a random point in space as the realization of a random variable? So, if the coordinates of a random point are random variables, then the squares of those coordinates are also random variables. And even further, to calculate the distance from a point to the origin, we are summing the squares of each individual coordinate, so we are summing random variables! We can ignore the square root here as it doesn’t change the fundamental ideas at play.
As a brief aside, I suggested earlier that the n-cube is a bit more straightforward than the n-ball. The reason for this is that each of the individual coordinates of a random point inside n-cube are i.i.d. random variables. Let’s think about why that is: no matter where the point is along its first dimensional axis, the distribution of the point’s coordinate on its second dimensional axis is unchanged, as with all its dimensions’ coordinates. Thus, each coordinate is independent of each other coordinate. In addition, the coordinates range from -0.5 to 0.5 (or whatever size cube we are dealing with) for every dimension, so the coordinates are identically distributed. Conversely, the coordinates of a random point inside an n-ball are different in that they are not independent: a point’s location along one axis affects the distribution of possible values for coordinates along other dimensions. If that’s not clear, just think through simple 2- and 3-dimensional cubes and balls. The details of exactly how this affects measure concentration lie beyond my cognitive horizon, but I just want to point out that there are some subtle differences at play.
Okay, back to it! Are you still with me? Let’s be explicit here. We just made the realization that calculating the distance from a random point inside an n-cube to the origin involves summing up i.i.d random variables. Also, as we increase the dimension n, we are increasing the number of items (i.e., individual coordinates) that we are summing. We also saw above in our coin toss experiments that as we increase the number of i.i.d random variables being summed, we see more and more measure concentration in terms of the probability mass associated with the possible outcomes; that is, the outcomes we tend to observe get more and more concentrated around the average outcome! So, the fact that all the points in high dimensional n-cubes cluster about the same distance from the origin has something to do with the probabilistic phenomenon of measure concentration. To be honest, I don’t even know exactly what to make of this but it’s insane and it absolutely blows my mind.
In the realm of higher dimensions, this phenomenon of measure concentration is not exclusive to the distance between the origin and random points in a n-cube. If you recall from our earlier adventures, we also saw this sort of concentration when looking at the distance between the origin and random points in an n-ball as well as between pairs of points in higher dimensional space. Whereas the technical details can vary depending on the type of shapes (and thus the underlying probability distributions of the coordinates) that we are working with, these are all manifestations of the deep and (at least for me) mystical connections between probability theory and higher dimensional space.
What a magnificent world!
Wrapping Up and Looking Forward
We just touched on some staggering and wondrous concepts. To recap, we found that some of the strange phenomena we have been encountering in higher dimensional space have something to do with probability theory and measure concentration. Specifically, the fact that random points tend to cluster the same distance from both the origin and from other points is fundamentally linked to the fact that the probability mass of sums of random variables tends to concentrate among a small set of possible outcomes as the number of variables being summed increases. In short, many of the outlandish characteristics we have observed in higher dimensional space are akin to what we see when we toss a bunch of coins or roll a bunch of dice!
I encourage you to really meditate on this. Because we are randomly generating points, it may seem obvious that we would expect random variables and probability theory to be involved in some way, but could we have imagined that it would have been so fundamentally linked to the “physical” characteristics of these higher dimensional spaces? That is, the concentration of random points within spaces communicates information about the distribution of volume within those spaces. In two- and three- dimensions, we can rely on our human senses (e.g., sight and touch) to learn about the physical characteristics of space. However, this is a luxury that fails us as we venture forth into higher dimensions. Here, we must rely on other tools to make sense of what we find, and I am completely amazed by the fact that probability theory can serve as such a prominent tool to understand these spaces.
We have seen again and again that higher dimensional space is absolutely wild and exhibits some very strange characteristics, but we have not really explored why this is the case. To be frank, I don’t have the depth of knowledge to fully understand why we have seen some of the things that we have seen. I’m sure I have just as many questions as you do. However, those questions and the strange but delightful feelings I get as I ponder these spaces are what keep me pushing forward. I hope it is the same for you.