table of Contents
1. Dimensionality reduction
2. How does principal component analysis work?
3. Implementation in Python
4. Evaluation and Interpretation
5. Conclusions and next steps
Many real machine learning problems involve data sets with thousands or even millions of features. Training these data sets can be computationally demanding, and interpreting the resulting solutions can be even more challenging.
As the number of features increases, data points become sparser and distance metrics become less informative, as the distances between points become less pronounced, making it difficult to distinguish what are close and distant points. That is known as the curse of dimensionality.
Sparser data makes models harder to train and more likely to overfit capturing noise rather than underlying patterns. This leads to poor generalization to new, unseen data.
Dimensionality reduction is used in data science and machine learning to reduce the number of variables or features in a data set while retaining as much of the original information as possible. This technique is useful for simplifying complex data sets, improving computational efficiency, and assisting with data visualization.
One of the most widely used techniques to mitigate the curse of dimensionality is Principal Component Analysis (PCA). PCA reduces the number of features in a data set while maintaining the most useful information by finding the axes that represent the most variation in the data set. These axes are called main components.
Since PCA aims to find a low-dimensional representation of a data set while maintaining a large portion of the variance rather than making predictions, it is considered a unsupervised learning algorithm.
But why does maintaining variation mean preserving important information?
Imagine that you are analyzing a data set about crimes in a city. The data has numerous characteristics, including “crime against a person – with injuries” and “crime against a person – without injuries.” Certainly, places with high rates of the first example must also have high rates of the second example.
In other words, the two characteristics in the example are highly correlated, so it is possible to reduce the dimensions of that data set by reducing redundancies in the data (the presence or absence of injuries on the victim).
The PCA algorithm is nothing more than a sophisticated way of doing it.
Now, let's discuss how the PCA algorithm works in the following steps:
Step 1: Center the data
PCA is affected by the scale of the data, so the first thing to do is subtract the mean of each feature in the data set, thus ensuring that all features have a mean equal to 0.
Step 2: Calculate the covariance matrix
Now, we need to calculate the covariance matrix to capture how each pair of features varies from the data. If the data set has north characteristics, the resulting covariance matrix will have north x north shape.
In the image below, the most correlated features have colors closer to red. Of course, each characteristic will be highly correlated with itself.
Step 3: Eigenvalue Decomposition
Next, we have to perform the eigenvalue decomposition of the covariance matrix. In case you don't remember, given the covariance matrix Σ (a square matrix), eigenvalue decomposition is the process of finding a set of scalars (eigenvalues) and vectors (eigenvectors) such that:
Where:
- Σ is the n × n covariance matrix.
- v It is a non-zero vector called the eigenvector.
- I is a scalar called eigenvalue associated with the eigenvector v.
own vectors indicate the directions of maximum variance in the data (the principal components), while own values quantify the variance captured by each principal component.
If a matrix TO can be decomposed into eigenvalues and eigenvectors, it can be represented as:
Where:
- q is a matrix whose columns are the eigenvectors of TO.
- Λ is a diagonal matrix whose diagonal elements are the eigenvalues of TO.
That way, we can use the same steps to find the eigenvalues and eigenvectors of the covariance matrix.
In the image above, we can see that the first eigenvector points to the direction with the largest variance of the data, and the second eigenvector points to the direction with the second largest variance.
Step 4: Select the main components
As stated above, eigenvalues quantify the variance of the data in the direction of its corresponding eigenvector. Therefore, we sort the eigenvalues in descending order and keep only the top n required. main components.
The following image illustrates the proportion of variance captured by each principal component in a two-dimensional PCA.
Step 5: Project the data
Finally, we have to project the original data onto the dimensions represented by the selected principal components. To do this, we have to multiply the data set, after centering it, by the eigenvector matrix found in the decomposition of the covariance matrix.
Now that we deeply understand the key concepts of Principal Component Analysis, it's time to create some code.
First, we have to configure the environment by importing the numpy package for mathematical calculations and matplotlib for visualization:
import numpy as np
import matplotlib.pyplot as plt
Next, we will encapsulate all the concepts covered in the previous section in a Python class with the following methods:
Constructor method to initialize the algorithm parameters: the number of desired components, a matrix to store the vectors of the components and a matrix to store the explained variance of each selected dimension.
In the adjustment method, the first four steps introduced in the previous section are implemented with code. In addition, the explained variances of each component are calculated.
The transformation method performs the last step presented in the previous section: projecting the data onto the selected dimensions.
The last method is an auxiliary function to plot the explained variance of each selected principal component as a bar chart.
Here is the complete code:
class PCA:
def __init__(self, n_components):
self.n_components = n_components
self.components = None
self.mean = None
self.explained_variance = Nonedef fit(self, x):
# Step 1: Standardize the data (subtract the mean)
self.mean = np.mean(x, axis=0)
X_centered = x - self.mean
# Step 2: Compute the covariance matrix
cov_matrix = np.cov(X_centered, rowvar=False)
# Step 3: Compute the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Step 4: Sort the eigenvalues and corresponding eigenvectors
sorted_indices = np.argsort(eigenvalues)(::-1)
eigenvalues = eigenvalues(sorted_indices)
eigenvectors = eigenvectors(:, sorted_indices)
# Step 5: Select the top n_components
self.components = eigenvectors(:, :self.n_components)
# Calculate explained variance
total_variance = np.sum(eigenvalues)
self.explained_variance = eigenvalues(:self.n_components) / total_variance
def transform(self, x):
# Step 6: Project the data onto the selected components
X_centered = x - self.mean
return np.dot(X_centered, self.components)
def plot_explained_variance(self):
# Create labels for each principal component
labels = (f'PCA{i+1}' for i in range(self.n_components))
# Create a bar plot for explained variance
plt.figure(figsize=(8, 6))
plt.bar(range(1, self.n_components + 1), self.explained_variance, alpha=0.7, align='center', color='blue', tick_label=labels)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.show()
Now it's time to use the class we just implemented on a mock data set created with the numpy package. The data set has 10 features and 100 samples.
# create simulated data for analysis
np.random.seed(42)
# Generate a low-dimensional signal
low_dim_data = np.random.randn(100, 4)# Create a random projection matrix to project into higher dimensions
projection_matrix = np.random.randn(4, 10)
# Project the low-dimensional data to higher dimensions
high_dim_data = np.dot(low_dim_data, projection_matrix)
# Add some noise to the high-dimensional data
noise = np.random.normal(loc=0, scale=0.5, size=(100, 10))
data_with_noise = high_dim_data + noise
x = data_with_noise
Before performing the PCA, one question remains: How do we choose the correct or optimal number of dimensions?? Generally, we have to look for the number of components that add up to at least 95% of the explained variance of the data set.
To do that, let's take a look at how each principal component contributes to the total variance of the data set:
# Apply PCA
pca = PCA(n_components=10)
pca.fit(x)
X_transformed = pca.transform(x)print("Explained Variance:\n", pca.explained_variance)
>> Explained Variance (%):
(55.406, 25.223, 11.137, 5.298, 0.641, 0.626, 0.511, 0.441, 0.401, 0.317)
Next, let's plot the cumulative sum of the variance and check in which number of dimensions we achieve the optimal value of 95% of the total variance.
As shown in the graph above, the optimal number of dimensions for the data set is 4, which adds up to a total of 97.064% of the explained variance. In other words, we transform a data set with 10 features into one with only 3 dimensions while maintaining more than 97% of the original information.
That means that most of the original 10 features were highly correlated and the algorithm transformed that high-dimensional data into uncorrelated principal components.
We created a PCA class using only the numpy package that successfully reduced the dimensionality of a data set of 10 features to just 4 while preserving approximately 97% of the data variance.
Additionally, we explore a method to obtain an optimal number of principal components from the PCA analysis that can be customized depending on the problem we are facing (we might be interested in retaining only 90% of the variance, for example).
This shows the potential of PCA analysis to address the curse of dimensionality explained above. Additionally, I would like to leave a few points for further exploration:
- Perform classification or regression tasks using other machine learning algorithms on the reduced data set using the PCA algorithm and compare the performance of the models trained on the original data set with the PCA transformed data set to evaluate the impact of the dimensionality reduction.
- Use PCA for data visualization to make high-dimensional data more interpretable and discover patterns that were not evident in the original feature space.
- Consider exploring other dimensionality reduction techniques, such as t-distributed stochastic neighbor embedding (t-SNE) and Linear discriminant analysis (LDA).
Full code available ai-from-scratch/tree/main/PCA” rel=”noopener ugc nofollow” target=”_blank”>here.