To demonstrate the theory and mathematics in action, we will program our own LDA from scratch using only numpy.
import numpy as npclass LDA_fs:
"""
Performs a Linear Discriminant Analysis (LDA)
Methods
=======
fit_transform():
Fits the model to the data x and Y, derives the transformation matrix W
and projects the feature matrix x onto the m LDA axes
"""
def __init__(self, m):
"""
Parameters
==========
m : int
Number of LDA axes onto which the data will be projected
Returns
=======
None
"""
self.m = m
def fit_transform(self, x, Y):
"""
Parameters
==========
x : array(n_samples, n_features)
Feature matrix of the dataset
Y = array(n_samples)
Label vector of the dataset
Returns
=======
X_transform : New feature matrix projected onto the m LDA axes
"""
# Get number of features (columns)
self.n_features = x.shape(1)
# Get unique class labels
class_labels = np.unique(Y)
# Get the overall mean vector (independent of the class labels)
mean_overall = np.mean(x, axis=0) # Mean of each feature
# Initialize both scatter matrices with zeros
SW = np.zeros((self.n_features, self.n_features)) # Within scatter matrix
SB = np.zeros((self.n_features, self.n_features)) # Between scatter matrix
# Iterate over all classes and select the corresponding data
for c in class_labels:
# Filter x for class c
X_c = x(Y == c)
# Calculate the mean vector for class c
mean_c = np.mean(X_c, axis=0)
# Calculate within-class scatter for class c
SW += (X_c - mean_c).T.dot((X_c - mean_c))
# Number of samples in class c
n_c = X_c.shape(0)
# Difference between the overall mean and the mean of class c --> between-class scatter
mean_diff = (mean_c - mean_overall).reshape(self.n_features, 1)
SB += n_c * (mean_diff).dot(mean_diff.T)
# Determine SW^-1 * SB
A = np.linalg.inv(SW).dot(SB)
# Get the eigenvalues and eigenvectors of (SW^-1 * SB)
eigenvalues, eigenvectors = np.linalg.eig(A)
# Keep only the real parts of eigenvalues and eigenvectors
eigenvalues = np.real(eigenvalues)
eigenvectors = np.real(eigenvectors.T)
# Sort the eigenvalues descending (high to low)
idxs = np.argsort(np.abs(eigenvalues))(::-1)
self.eigenvalues = np.abs(eigenvalues(idxs))
self.eigenvectors = eigenvectors(idxs)
# Store the first m eigenvectors as transformation matrix W
self.W = self.eigenvectors(0:self.m)
# Transform the feature matrix x onto LD axes
return np.dot(x, self.W.T)
To see LDA in action, we'll apply it to a typical task in the production environment. We have data from a simple manufacturing line with only 7 stations. Each of these stations sends one data point (yes, I know, just one data point is very unrealistic). Unfortunately, our line produces a significant number of defective parts and we want to know which stations are responsible for this.
First, we load the data and take an initial look.
import pandas as pd# URL to Github repository
url = "https://raw.githubusercontent.com/IngoNowitzky/LDA_Medium/main/production_line_data.csv"
# Read csv to DataFrame
data = pd.read_csv(url)
# Print first 5 lines
data.head()
Next, we study the distribution of the data using the .describe()
Pandas method.
# Show average, min and max of numerical values
data.describe()
We see that we have 20,000 data points and the measurements range between -5 and +150. Therefore, we note for later that we need to normalize the data set: otherwise the different magnitudes of the numerical values would negatively affect the LDA.
How many good parts and how many bad parts do we have?
# Count the number of good and bad parts
label_counts = data('Label').value_counts()# Display the results
print("Number of Good and Bad Parts:")
print(label_counts)
We have 19,031 good parts and 969 defective parts. The fact that the data set is so imbalanced is an issue that requires further analysis. Therefore, we select all defective parts and an equal number of good parts chosen at random for further processing.
# Select all bad parts
bad_parts = data(data('Label') == 'Bad')# Randomly select an equal number of good parts
good_parts = data(data('Label') == 'Good').sample(n=len(bad_parts), random_state=42)
# Combine both subsets to create a balanced dataset
balanced_data = pd.concat((bad_parts, good_parts))
# Shuffle the combined dataset
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)
# Display the number of good and bad parts in the balanced dataset
print("Number of Good and Bad Parts in the balanced dataset:")
print(balanced_data('Label').value_counts())
Now, let's apply our LDA from scratch to the balanced data set. We use the StandardScaler
of sklearn
to normalize the measurements of each feature so that it has a mean of 0 and a standard deviation of 1. We choose only one linear discriminant axis (meter=1) on which we project the data. This helps us clearly see which features are most relevant in distinguishing the good parts from the bad and we visualize the data projected in a histogram.
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler# Separate features and labels
x = balanced_data.drop(columns=('Label'))
y = balanced_data('Label')
# Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
# Perform LDA
lda = LDA_fs(m=1) # Instanciate LDA object with 1 axis
X_lda = lda.fit_transform(X_scaled, y) # Fit the model and project the data
# Plot the LDA projection
plt.figure(figsize=(10, 6))
plt.hist(X_lda(y == 'Good'), bins=20, alpha=0.7, label='Good', color='green')
plt.hist(X_lda(y == 'Bad'), bins=20, alpha=0.7, label='Bad', color='red')
plt.title("LDA Projection of Good and Bad Parts")
plt.xlabel("LDA Component")
plt.ylabel("Frequency")
plt.legend()
plt.show()
# Examine feature contributions to the LDA component
feature_importance = pd.DataFrame({'Feature': x.columns, 'LDA Coefficient': lda.W(0)})
feature_importance = feature_importance.sort_values(by='LDA Coefficient', ascending=False)
# Display feature importance
print("Feature Contributions to LDA Component:")
print(feature_importance)
The histogram shows that we can separate good parts from bad parts very well, with only a small overlap. This is already a positive result and indicates that our LDA was successful.
The “LDA Coefficients” in the “Feature Contributions to LDA Components” table represent the eigenvector of the first (and only, since meter=1) column of our transformation matrix W.. They indicate the direction and magnitude with which the normalized measurements of the stations are projected onto the linear discriminant axis. The values in the table are arranged in descending order. We need to read the table from above and below simultaneously because the absolute value of the coefficient indicates the importance of each station in separating the classes and, consequently, its contribution to the production of defective parts. The sign indicates whether a smaller or larger measurement increases the probability of defective parts. Let's take a closer look at our example:
The highest absolute value comes from Station 4, with a coefficient of -0.672. This means that station 4 has the greatest influence on part failure. Due to the negative sign, higher positive measurements are projected onto a negative linear discriminant (LD). The histogram shows that a negative LD is associated with good (green) parts. Instead, Low and negative measurements at this station increase the probability of part failure..
The second highest absolute value is that of Station 2, with a coefficient of 0.557. Therefore, this station is the second largest contributor to part failures. The positive sign indicates that high positive measurements project toward the positive LD. From the histogram, we know that a high positive LD value is associated with a high probability of failure. In other words, High measurements at station 2 cause part failure.
The third highest coefficient comes from Station 7, with a value of -0.486. This makes Station 7 the third largest contributor to parts failures. The negative sign again indicates that high positive values at this station lead to a negative LD (corresponding to good pieces). Instead, Low and negative values at this station cause part failure..
All other LDA coefficients are an order of magnitude smaller than the three mentioned, Therefore, the associated stations do not influence the failure of the part..
Are the results of our LDA analysis correct? As you may have already guessed, the production data set is generated synthetically. I labeled all parts as defective where the measurement at Station 2 was greater than 0.5, the value at Station 4 was less than -2.5, and the value at Station 7 was less than 3. Turns out the LDA hit the mark perfectly!
# Determine if a sample is a good or bad part based on the conditions
data('Label') = np.where(
(data('Station_2') > 0.5) & (data('Station_4') < -2.5) & (data('Station_7') < 3),
'Bad',
'Good'
)
Linear discriminant analysis (LDA) not only reduces the complexity of data sets, but also highlights the key characteristics that drive class separation, making it highly effective in identifying the causes of failures in production systems. It is a simple but powerful method with practical applications and is available in libraries such as scikit-learn
.
To achieve optimal results, it is essential to balance the data set (ensure a similar number of samples in each class) and normalize it (mean of 0 and standard deviation of 1).
The next time you work with a large data set containing class labels and numerous features, why not try LDA?