table of Contents
1. Introduction
2. What does the K-Means algorithm do?
3. Implementation in Python
4. Evaluation and Interpretation
5. Conclusions and next steps
Most of the widely used machine learning algorithms such as linear regression, logistic regression, decision trees and others are useful for making predictions from labeled data, that is, each input comprises feature values with a value associated label. that's what it's called Supervised learning.
However, we often have to deal with large data sets without any associated labels. Imagine a company that needs to understand different customer groups based on purchasing behavior, demographics, address and other information, in order to offer better services, products and promotions.
These types of problems can be addressed with the use of Unsupervised learning techniques. The K-Means algorithm is an unsupervised learning algorithm widely used in Machine Learning. Its simple and elegant approach makes it possible to separate a data set into a desired number of K distinct clusters, allowing patterns to be learned from unlabeled data.
As stated above, the K-Means algorithm seeks to divide the data points into a given number of groups. The points within each group are similar, while the points in different groups have considerable differences.
That being said, a question arises: how do we define similarity or difference? In K-Means clustering, Euclidean distance is the most common metric to measure similarity.
In the following figure we can clearly see 3 different groups. In this way, we could determine the centers of each group and each point would be associated with the closest center.
By doing that, mathematically speaking, the idea is to minimize the within-group variancethe measure of similarity between each point and its nearest center.
Performing the task in the previous example was easy because the data was two-dimensional and the groups were clearly different. However, as the number of dimensions increases and different values of K are considered, we need an algorithm to handle the complexity.
Step 1: Choose the starting centers (randomly)
We need to seed the algorithm with initial center vectors that can be chosen randomly from the data or generate random vectors with the same dimensions as the original data. See the white diamonds in the image below.
Step 2: Find the distances of each point from the centers.
Now, we will calculate the distance of each data point to the K centers. We then associate each point with the center closest to that point.
Given a set of data with north tickets and METER characteristics, distances to the centers c can be given by the following equation:
where:
k varies from 1 to k;
d is the distance from a point n to k center;
x is the point vector;
c is the central vector.
Therefore, for each data point north we will have K distances, then we have to label the vector to the center with the smallest distance:
Where d is a vector with k distances.
Step 3: Find the k centroids and iterate
For each of the k groups, recalculate the centroid. The new centroid is the mean of all data points assigned to that group. Then update the centroid positions to the newly calculated ones.
Check if the centroids have changed significantly from the previous iteration. This can be done by comparing the positions of the centroids in the current iteration with those of the last iteration.
If the centroids have changed significantly, return to Step 2. Otherwise, the algorithm has converged and the process stops. See the image below.
Now that we know the fundamental concepts of the K-Means algorithm, it's time to implement a Python class. The packages used were Numpy for mathematical calculations, Matplotlib for visualization, and Sklearn's Make_blobs package for simulated data.
# import required packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
The class will have the following methods:
A constructor method to initialize the basic parameters of the algorithm: the value k of clusters, the maximum number of iterations max_iter, and tolerance tol There is value in stopping optimization when there is no significant improvement.
These methods aim to assist the optimization process during training, such as calculating the Euclidean distance, randomly choosing the initial centroids, assigning the closest centroid to each point, updating the centroid values, and checking if the optimization converged.
As mentioned above, the K-Means algorithm is an unsupervised learning technique, meaning it does not require labeled data during the training process. Thus, a unique method is needed to fit the data and predict which group each data point belongs to.
A method to evaluate the quality of optimization by calculating the total square error of optimization. This will be explored in the next section.
Here is the complete code:
class Kmeans:# construct method for hyperparameter initialization
def __init__(self, k=3, max_iter=100, tol=1e-06):
self.k = k
self.max_iter = max_iter
self.tol = tol
# randomly picks the initial centroids from the input data
def pick_centers(self, x):
centers_idxs = np.random.choice(self.n_samples, self.k)
return x(centers_idxs)
# finds the closest centroid for each data point
def get_closest_centroid(self, x, centroids):
distances = (euclidean_distance(x, centroid) for centroid in centroids)
return np.argmin(distances)
# creates a list with lists containing the idxs of each cluster
def create_clusters(self, centroids, x):
clusters = (() for _ in range(self.k))
labels = np.empty(self.n_samples)
for i, x in enumerate(x):
centroid_idx = self.get_closest_centroid(x, centroids)
clusters(centroid_idx).append(i)
labels(i) = centroid_idx
return clusters, labels
# calculates the centroids for each cluster using the mean value
def compute_centroids(self, clusters, x):
centroids = np.empty((self.k, self.n_features))
for i, cluster in enumerate(clusters):
centroids(i) = np.mean(x(cluster), axis=0)
return centroids
# helper function to verify if the centroids changed significantly
def is_converged(self, old_centroids, new_centroids):
distances = (euclidean_distance(old_centroids(i), new_centroids(i)) for i in range(self.k))
return (sum(distances) < self.tol)
# method to train the data, find the optimized centroids and label each data point according to its cluster
def fit_predict(self, x):
self.n_samples, self.n_features = x.shape
self.centroids = self.pick_centers(x)
for i in range(self.max_iter):
self.clusters, self.labels = self.create_clusters(self.centroids, x)
new_centroids = self.compute_centroids(self.clusters, x)
if self.is_converged(self.centroids, new_centroids):
break
self.centroids = new_centroids
# method for evaluating the intracluster variance of the optimization
def clustering_errors(self, x):
cluster_values = (x(cluster) for cluster in self.clusters)
squared_distances = ()
# calculation of total squared Euclidean distance
for i, cluster_array in enumerate(cluster_values):
squared_distances.append(np.sum((cluster_array - self.centroids(i))**2))
total_error = np.sum(squared_distances)
return total_error
Now we will use the K-Means class to perform clustering on simulated data. For this we will use the make_blobs Sklearn library package. The data consists of 500 two-dimensional points with 4 fixed centers.
# create simulated data for examples
x, _ = make_blobs(n_samples=500, n_features=2, centers=4,
shuffle=False, random_state=0)
After performing training using four clusters, we achieved the following result.
model = Kmeans(k=4)
model.fit_predict(x)
labels = model.labels
centroids =model.centroids
plot_clusters(x, labels, centroids)
In that case, the algorithm was able to calculate the clusters successfully with 18 iterations. However, we must keep in mind that we already know the optimal number of clusters from the simulated data. In real-world applications, we often don't know that value.
As said before, the K-Means algorithm aims to make within-group variance as small as possible. The metric used to calculate this variance is the total euclidean distance squared given by:
where:
p is the number of data points in a group;
c_i is the centroid vector of a group;
K is the number of groups.
In words, the above formula adds the distances of the data points to the nearest centroid. The error decreases as the number K increases.
In the extreme case of K =N, you have one group for each data point and this error will be zero.
Willmott, Paul (2019).
If we plot the error as a function of the number of clusters and look at where the graph “bends”, we can find the optimal number of clusters.
As we can see, the graph is shaped like an “elbow” and doubles at K = 4, which means that for larger values of K, the decrease in total error will be less significant.
In this article, we cover the fundamental concepts behind the K-Means algorithm, its uses and applications. Additionally, using these concepts, we were able to implement a Python class from scratch that performed clustering of simulated data and how to find the optimal value for K using a scree plot.
However, since this is an unsupervised technique, there is an additional step. The algorithm can successfully assign a label to the groups, but the meaning of each label is a task that the data scientist or machine learning engineer will have to perform by analyzing the data in each group.
Also, I'll leave a few points for further exploration:
- Our simulated data used two-dimensional points. Try using the algorithm for other data sets and find the optimal values for K.
- There are other widely used unsupervised learning algorithms such as Hierarchical grouping.
- Depending on the problem domain, it may be necessary to use other error metrics, such as Manhattan distance and cosine similarity. Try to investigate them.
Full code available ai-from-scratch/tree/main/K-Means” rel=”noopener ugc nofollow” target=”_blank”>here: