This is the most common method and the one you will surely know. Anyway, let’s study it because I will show advanced analysis techniques in these cases. The Jupyter notebook where you will find the complete procedure is called kmeans.ipynb
Preprocessed
Preprocessing of the variables is carried out:
- It consists of converting categorical variables into numerical ones. We can apply a Onehot Encoder (the usual thing) but in this case we will apply an Ordinal Encoder.
- We try to ensure that the numerical variables have a Gaussian distribution. For them we will apply a PowerTransformer.
Let’s see how it looks in the code.
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra# data visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import shap
# sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, silhouette_samples, accuracy_score, classification_report
from pyod.models.ecod import ECOD
from yellowbrick.cluster import KElbowVisualizer
import lightgbm as lgb
import prince
df = pd.read_csv("train.csv", sep = ";")
df = df.iloc(:, 0:8)
pipe = Pipeline((('ordinal', OrdinalEncoder()), ('scaler', PowerTransformer())))
pipe_fit = pipe.fit(df)
data = pd.DataFrame(pipe_fit.transform(df), columns = df.columns)
data
Production:
Atypical values
It is crucial that there are so few outliers in our data as Kmeans is very sensitive to this. We can apply the typical method of picking outliers using z-score, but in this post I will show you a much more advanced and cooler method.
Well, what is this method? Well, we will use the Python Outlier Detection (PyOD) Library. This library focuses on detecting outliers for different cases. To be more specific we will use the ECOD method (“Empirical cumulative distribution functions for outlier detection“).
This method seeks to obtain the distribution of the data and thus know which are the values where the probability density is lower (outliers). Take a look at the GitHub if you like.
from pyod.models.ecod import ECODclf = ECOD()
clf.fit(data)
outliers = clf.predict(data)
data("outliers") = outliers
# Data without outliers
data_no_outliers = data(data("outliers") == 0)
data_no_outliers = data_no_outliers.drop(("outliers"), axis = 1)
# Data with Outliers
data_with_outliers = data.copy()
data_with_outliers = data_with_outliers.drop(("outliers"), axis = 1)
print(data_no_outliers.shape) -> (40691, 8)
print(data_with_outliers.shape) -> (45211, 8)
Modeling
One of the disadvantages of using the Kmeans algorithm is that you must choose the number of clusters you want to use. In this case, to obtain this data, we will use elbow method. It consists of calculating the distortion that exists between the points of a cluster and its centroid. The objective is clear, to obtain the least possible distortion. In this case we use the following code:
from yellowbrick.cluster import KElbowVisualizer# Instantiate the clustering model and visualizer
km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10))
visualizer.fit(data_no_outliers) # Fit the data to the visualizer
visualizer.show()
Production:
we see that from k=5, the distortion does not vary drastically. It is true that the ideal is for the behavior from k= 5 to be almost flat. This rarely happens and other methods can be applied to be sure of the optimal number of clusters. Without a doubt, we could make a silhouette display. The code is the following:
from sklearn.metrics import davies_bouldin_score, silhouette_score, silhouette_samples
import matplotlib.cm as cmdef make_Silhouette_plot(X, n_clusters):
plt.xlim((-0.1, 1))
plt.ylim((0, len(X) + (n_clusters + 1) * 10))
clusterer = KMeans(n_clusters=n_clusters, max_iter = 1000, n_init = 10, init = 'k-means++', random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
print(
"For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = sample_silhouette_values(cluster_labels == i)
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape(0)
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
plt.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
plt.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
plt.title(f"The Silhouette Plot for n_cluster = {n_clusters}", fontsize=26)
plt.xlabel("The silhouette coefficient values", fontsize=24)
plt.ylabel("Cluster label", fontsize=24)
plt.axvline(x=silhouette_avg, color="red", linestyle="--")
plt.yticks(())
plt.xticks((-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1))
range_n_clusters = list(range(2,10))
for n_clusters in range_n_clusters:
print(f"N cluster: {n_clusters}")
make_Silhouette_plot(data_no_outliers, n_clusters)
plt.savefig('Silhouette_plot_{}.png'.format(n_clusters))
plt.close()
OUTPUT:
"""
N cluster: 2
For n_clusters = 2 The average silhouette_score is : 0.1775761520337095
N cluster: 3
For n_clusters = 3 The average silhouette_score is : 0.20772622268785523
N cluster: 4
For n_clusters = 4 The average silhouette_score is : 0.2038116470937145
N cluster: 5
For n_clusters = 5 The average silhouette_score is : 0.20142888327171368
N cluster: 6
For n_clusters = 6 The average silhouette_score is : 0.20252892716996912
N cluster: 7
For n_clusters = 7 The average silhouette_score is : 0.21185490763840265
N cluster: 8
For n_clusters = 8 The average silhouette_score is : 0.20867816457291538
N cluster: 9
For n_clusters = 9 The average silhouette_score is : 0.21154289421300868
"""
It can be seen that the highest silhouette score is obtained with n_cluster=9, but it is also true that the variation in the score is quite small when compared to the other scores. At the moment the previous result does not give us much information. On the other hand, the previous code creates the Silhouette visualization, which gives us more information:
Since understanding these representations well is not the goal of this post, I will conclude that there does not seem to be a very clear decision on which number is better. After seeing the above renderings, we can choose K=5 or K= 6. This is because for different groups, their silhouette score is above the average value and there is no imbalance in group size. Additionally, in some situations, the marketing department may be interested in having the fewest number of customer groups/types (this may or may not be the case).
Finally we can create our Kmeans model with K=5.
km = KMeans(n_clusters=5,
init='k-means++',
n_init=10,
max_iter=100,
random_state=42)clusters_predict = km.fit_predict(data_no_outliers)
"""
clusters_predict -> array((4, 2, 0, ..., 3, 4, 3))
np.unique(clusters_predict) -> array((0, 1, 2, 3, 4))
"""
Assessment
The way of evaluating kmeans models is somewhat more open than for other models. we can use
- metrics
- visualizations
- interpretation (Something very important for companies).
Regarding model evaluation metricswe can use the following code:
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score"""
The Davies Bouldin index is defined as the average similarity measure
of each cluster with its most similar cluster, where similarity
is the ratio of within-cluster distances to between-cluster distances.
The minimum value of the DB Index is 0, whereas a smaller
value (closer to 0) represents a better model that produces better clusters.
"""
print(f"Davies bouldin score: {davies_bouldin_score(data_no_outliers,clusters_predict)}")
"""
Calinski Harabaz Index -> Variance Ratio Criterion.
Calinski Harabaz Index is defined as the ratio of the
sum of between-cluster dispersion and of within-cluster dispersion.
The higher the index the more separable the clusters.
"""
print(f"Calinski Score: {calinski_harabasz_score(data_no_outliers,clusters_predict)}")
"""
The silhouette score is a metric used to calculate the goodness of
fit of a clustering algorithm, but can also be used as
a method for determining an optimal value of k (see here for more).
Its value ranges from -1 to 1.
A value of 0 indicates clusters are overlapping and either
the data or the value of k is incorrect.
1 is the ideal value and indicates that clusters are very
dense and nicely separated.
"""
print(f"Silhouette Score: {silhouette_score(data_no_outliers,clusters_predict)}")
OUTPUT:
"""
Davies bouldin score: 1.5480952939773156
Calinski Score: 7646.959165727562
Silhouette Score: 0.2013600389183821
"""
As far as can be seen, we are not facing an excessively good model. davies score It is telling us that the distance between the groups is quite small.
This can be due to several factors, but keep in mind that the energy of a model is the data; If the data does not have sufficient predictive power, you cannot expect to achieve exceptional results.
For visualizationswe can use the method to reduce dimensionality, PCA. For them we are going to use the The prince Library, focused on exploratory analysis and dimensionality reduction. If you prefer, you can use Sklearn’s PCA, they are identical.
We will first calculate the principal components in 3D, and then perform the rendering. These are the two functions that the previous steps perform:
import prince
import plotly.express as pxdef get_pca_2d(df, predict):
pca_2d_object = prince.PCA(
n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca_2d_object.fit(df)
df_pca_2d = pca_2d_object.transform(df)
df_pca_2d.columns = ("comp1", "comp2")
df_pca_2d("cluster") = predict
return pca_2d_object, df_pca_2d
def get_pca_3d(df, predict):
pca_3d_object = prince.PCA(
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)
pca_3d_object.fit(df)
df_pca_3d = pca_3d_object.transform(df)
df_pca_3d.columns = ("comp1", "comp2", "comp3")
df_pca_3d("cluster") = predict
return pca_3d_object, df_pca_3d
def plot_pca_3d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):
df = df.astype({"cluster": "object"})
df = df.sort_values("cluster")
fig = px.scatter_3d(
df,
x='comp1',
y='comp2',
z='comp3',
color='cluster',
template="plotly",
# symbol = "cluster",
color_discrete_sequence=px.colors.qualitative.Vivid,
title=title).update_traces(
# mode = 'markers',
marker={
"size": 4,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 800,
height = 800,
autosize = True,
showlegend = True,
legend=dict(title_font_family="Times New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2', titlefont_color = 'black'),
zaxis=dict(title = 'comp3', titlefont_color = 'black')),
font = dict(family = "Gilroy", color = 'black', size = 15))
fig.show()
Don’t worry too much about those functions, use them as follows:
pca_3d_object, df_pca_3d = pca_plot_3d(data_no_outliers, clusters_predict)
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)
print("The variability is :", pca_3d_object.eigenvalues_summary)
Production:
It can be seen that the clusters have almost no separation between them and there is no clear division. This is in accordance with the information provided by the metrics.
Something to take into account and that very few people take into account is the PCA and the variability of the eigenvectors.
Let’s say each field contains a certain amount of information, and this adds up to its bit of information. If the cumulative sum of the 3 main components adds up to around 80% variability, we can say that it is acceptable, obtaining good results in the representations. If the value is lower, we have to take the visualizations with caution since we are missing a lot of information contained in other eigenvectors.
The next question is obvious: What is the variability of the PCA executed?
The answer is the following:
As can be seen, we have 48.37% variability with the first 3 components, something insufficient to draw informed conclusions.
It turns out that when a PCA analysis is run, the spatial structure is not preserved. Luckily there is a less known method, called t-SNEthat allows us reduces dimensionality and also maintains spatial structure. This can help us visualize, since with the previous method we have not had much success.
If you try it on your computers, keep in mind that it has a higher computational cost. For this reason, I sampled my original data set and it still took me about 5 minutes to get the result. The code is the following:
from sklearn.manifold import TSNEsampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1)
sampling_clusters = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True, random_state=1)(0).values
df_tsne_3d = TSNE(
n_components=3,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000).fit_transform(sampling_data)
df_tsne_3d = pd.DataFrame(df_tsne_3d, columns=("comp1", "comp2",'comp3'))
df_tsne_3d("cluster") = sampling_clusters
plot_pca_3d(df_tsne_3d, title = "PCA Space", opacity=1, width_line = 0.1)
As a result, I got the following image. It shows greater separation between clusters and allows us to draw conclusions more clearly.
In fact, we can compare the reduction made by the PCA and t-SNE, in 2 dimensions. The improvement is clear using the second method.
Finally, let’s explore a little how the model works, what characteristics are most important and what are the main characteristics of clusters.
To see the importance of each of the variables we will use a typical “trick” in this type of situation. Let’s create a classification model where the “X” are the inputs of the Kmeans model and the “y” are the groups predicted by the Kmeans model.
The chosen model is a LGBM Classifier. This model is quite powerful and works well with categorical and numerical variables. Have the new model trained, using the SHAPE library, we can obtain the importance of each of the features in the prediction. The code is:
import lightgbm as lgb
import shap# We create the LGBMClassifier model and train it
clf_km = lgb.LGBMClassifier(colsample_by_tree=0.8)
clf_km.fit(X=data_no_outliers, y=clusters_predict)
#SHAP values
explainer_km = shap.TreeExplainer(clf_km)
shap_values_km = explainer_km.shap_values(data_no_outliers)
shap.summary_plot(shap_values_km, data_no_outliers, plot_type="bar", plot_size=(15, 10))
Production:
You can see that feature accommodation has the greatest predictive power. It can also be seen that cluster number 4 (green) is mainly differentiated by the loan variable.
Finally, we must analyze the characteristics of the clusters. This part of the study is decisive for the business. For them we are going to obtain the means (for numerical variables) and the most frequent value (categorical variables) of each of the characteristics of the data set for each of the clusters:
df_no_outliers = df(df.outliers == 0)
df_no_outliers("cluster") = clusters_predictdf_no_outliers.groupby('cluster').agg(
{
'job': lambda x: x.value_counts().index(0),
'marital': lambda x: x.value_counts().index(0),
'education': lambda x: x.value_counts().index(0),
'housing': lambda x: x.value_counts().index(0),
'loan': lambda x: x.value_counts().index(0),
'contact': lambda x: x.value_counts().index(0),
'age':'mean',
'balance': 'mean',
'default': lambda x: x.value_counts().index(0),
}
).reset_index()
Production:
We see that the clusters with work = worker They do not have a great differentiation between their characteristics. This is something that is not desirable since it is difficult to differentiate the clients of each of the clusters. In it work=administration In this case we obtain better differentiation.
After carrying out the analysis in different ways, they converge on the same conclusion: “We need to improve results.”