Now, let's focus on internal validation and external validation. Below I will list some metrics of my choice with hyperlinks where you can track their definitions and formulas in detail.
Since I will not be covering the formulas for these metrics, readers are encouraged to follow the hyperlinks provided below to discover them.
A. Metrics used for Internal Validation
The goal of internal validation is to establish the quality of the clustering structure based solely on the data set provided.
Classification of internal evaluation methods:
Internal validation methods can be classified according to the classes of grouping methodologies. A typical clustering classification can be formulated as follows:
- Partition methods (e.g. K-means),
- Hierarchical methods (e.g. agglomerative clustering),
- Density-based methods (e.g. DBSCAN), and
- the rest
Here, I cover the first two: partitioned clustering and hierarchical clustering.
a) Partition methods: e.g. K-means
For partitioning methods, there are three bases of evaluation metrics: cohesion, separation, and their hybrid.
Cohesion:
Cohesion evaluates the closeness of the cluster's internal data structure. The lower the value of the cohesion metrics, the better quality the clusters will be. An example of cohesion metrics is:
- SSW: Sum of squared errors within the cluster.
Separation:
Separation is a metric between clusters and evaluates the dispersion of the data structure between clusters. The idea behind a separation metric is to maximize the distance between groups. An example of cohesion metrics is:
- SSB: Sum of squared errors between clusters.
Hybrid of cohesion and separation:
The hybrid type quantifies the level of separation and cohesion in a single metric. Here is a list of examples:
Yo) The silhouette coefficient.: in the range of (-1, 1)
This metric is a relative measure of the distance between groups with the neighboring group.
Here is a general interpretation of the metric:
- Best value: 1
- The worst value: -1.
- Values close to 0: overlapping groups.
- Negative values: high possibility that a sample is assigned to an incorrect cluster.
Below is an example use case for the metric: https://www.geeksforgeeks.org/silhouette-index-cluster-validity-index-set-2/?ref=ml_lbp
i) The Calisnki-Harabasz coefficient:
Also known as the Variance Ratio Criterion, this metric measures the relationship between the sum of the between-group dispersion and the intra-group dispersion for all groups.
For a given assignment of clusters, the higher the value of the metric, the better the clustering result: since a higher value indicates that the resulting clusters are compact and well separated.
Below is an example use case for the metric: https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/?ref=ml_lbp
iii) Then index:
For a given cluster assignment, a higher Dunn's index indicates better clustering.
Below is an example use case for the metric: https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/?ref=ml_lbp
iv) Davies Bouldin Score:
The metric measures the relationship between similarity within a group and similarity between groups. Logically, a higher metric suggests a denser intra-cluster structure and a more separated inter-cluster structure, hence a better clustering result.
Below is an example use case for the metric: https://www.geeksforgeeks.org/davies-bouldin-index/
b) Hierarchical methods: e.g. agglomerate clustering algorithm
i) Human judgment based on the visual representation of the dendrogram.
Although Palacio-Niño & Berzal did not include human judgment; It is one of the most useful tools for internal validation of hierarchical clustering based on dendrograms.
Instead, the co-authors listed the following two correlation coefficient metrics specialized in evaluating the results of hierarchical clustering.
For both, their higher values indicate better results. Both take values in the range of (-1, 1).
i) The cophenetic correlation coefficient (CPCC): (-1, 1)
Measures the distance between observations in the hierarchical clustering defined by the link.
iii) Hubert's statistic: (-1, 1)
A higher Hubert value corresponds to better clustering of data.
c) Potential category: Self-supervised learning
Self-supervised learning can generate feature representations that can be used for clustering. Self-supervised learning does not have explicit labels on the data set, but uses the input data as labels for learning. Palacio-Niño and Berzal did not include self-supervised frameworks, such as autoencoders and GANs, for their proposal in this section. Well, they are not clustering algorithms per se. However, I'll keep this particular domain on my note. Time will tell if any specialized metrics emerge from this particular domain.
Before closing the internal validation section, here is a warning from Gere (2023).
“Choosing the appropriate hierarchical clustering algorithm and number of clusters is always a key issue… In many cases, researchers do not publish any reason why a given distance measure and linking rule were chosen along with the numbers of conglomerate. The reason behind this could be that different cluster comparison and validation techniques give contradictory results in most cases. …The results of the validation methods deviate, suggesting that the clustering is highly dependent on the data set in question. Although Euclidean distance, Ward's method seems like a safe option, it is highly recommended to test and validate different clustering combinations.“
Yes, it is a difficult task.
Now, let's move on to external validation.