Using the Monte Carlo method to visualize the behavior of observations with a large number of features
Think of a data set, made of a certain number of observations, each observation has N features. If you convert all the features to a numerical representation, you could say that each observation is a point in N-dimensional space.
When N is low, the relationships between the points are just what one would intuitively expect. But sometimes N grows a lot; this could happen, for example, if you are creating many functions via one-hot coding, etc. For very large values of N, the observations behave as if they are sparse, or as if the distances between them are somehow larger than might be expected.
The phenomenon is real. As the number of dimensions N grows, all else being equal, the volume N containing your observations actually increases in some sense (or at least the number of degrees of freedom increases), and the Euclidean distances between observations do too. increase . The pool of points actually becomes scarcer. This is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the data set will be influenced as a consequence of these changes.
Many things can go wrong if the number of features is too large. Having more features than observations is a typical configuration for models that overfit in training. Any brute force search in such a space (eg GridSearch) becomes less efficient: you need more tests to cover the same intervals along any axis. A subtle effect has an impact on any model based on distance or proximity: as the number of dimensions grows to very large values, if you consider any point in your observations, all other points appear to be very far away and somehow almost equidistant — since these models rely on distance to do their job, leveling out distance differences makes their job much more difficult. For example, clustering doesn’t work as well if all the points appear to be nearly equidistant.
For all these reasons, and more, techniques such as PCA, LDA, etc. have been created in an effort to move away from the peculiar geometry of high-dimensional spaces, and to distill the data set to a more compatible number of dimensions. the actual information contained in it.
It is difficult to intuitively perceive the true magnitude of this phenomenon, and spaces with more than 3 dimensions are extremely difficult to visualize, so let’s do some simple 2D visualizations to help our intuition. There is a geometric basis for which dimensionality can become a problem, and this is what we will visualize here. If you haven’t seen this before, the results may be surprising: the geometry of high-dimensional spaces is much more complex than typical intuition is likely to suggest.
Consider a square of size 1, centered at the origin. In the square, you inscribe a circle.
That is the configuration in 2 dimensions. Now think about the general N-dimensional case. In 3 dimensions, you have a sphere inscribed in a cube. Beyond that, you have an N-sphere inscribed in an N-cube, which is the most general way to put it. For simplicity, we will refer to these objects as “sphere” and “cube”, regardless of how many dimensions they have.
The volume of the cube is fixed, it is always 1. The question is: by varying the number of dimensions N, what happens to the volume of the sphere?
Let’s answer the question experimentally, using the Monte Carlo method. We will generate a large number of points, evenly but randomly distributed within the cube. For each point we calculate its distance from the origin: if that distance is less than 0.5 (the radius of the sphere), then the point is inside the sphere.
If we divide the number of points inside the sphere by the total number of points, it will approximate the ratio of the volume of the sphere to the volume of the cube. Since the volume of the cube is 1, the ratio will be equal to the volume of the sphere. The approximation improves when the total number of points is large.
In other words, the relationship inside_points / total_points
It will approximate the volume of the sphere.
The code is quite simple. Since we need a lot of points, explicit loops should be avoided. We could use NumPy, but it’s CPU only and single threaded so it will be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We’ll use PyTorch, but the NumPy code would look almost identical.
If you follow the nested torch
statements, we generate 100 million random points, calculate their distances from the origin, count the points inside the sphere, and divide the count by the total number of points. He ratio
The matrix will end up containing the volume of the sphere in different numbers of dimensions.
Tunable parameters are set for a GPU with 24 GB of memory; adjust them if your hardware is different.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n if you run out of memory
n = 10**8
ratio = np.zeros(d_max)
for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# combine large tensor statements for better memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)
# clean up memory
torch.cuda.empty_cache()
Let’s visualize the results: