Why Clustering Fails and How to Fix It | by Ryan Feather | Jul, 2024

And how to fix it

You had a data interpretation problem, so you tried clustering. Now you have a cluster interpretation problem! There was a suspicion that there might be patterns in the data. Reasonably, the hope was that adding some structure through unsupervised learning would yield some insight. Clustering is the go-to tool for finding structure. So you set out on your journey. You spend a considerable amount of money on computing. You invest a lot of sweat in fiddling with the tuning parameters of the clusters. Just to be sure, you try some algorithms. But at the end of the day you are left with rainbow graphs of clustered data that might have some meaning — just maybe, if you squint hard enough. You go home with the nagging suspicion that it was all for naught. Sadly, this is all too often the case. But why should it be?

Some real clusters. Image released in the public domain by NASA and STScI.

Failure to produce value in a clustering project is often due to several causes: poor understanding of the data, insufficient focus on the desired outcome, and poor choice of tools. We will discuss each of these in turn. To motivate the discussion, it is illuminating to understand the reasons why clustering techniques exist. To do so, we will review what clustering is and some of the issues that drove the development of clustering techniques.