There are many data science problems where you don’t have labelled data and need to use clustering to find related points. For these clustering problems, HDBSCAN is a great algorithm. It was originally created by Campello et al. and there is a fast Python implementation written by Leland Mcinnes and John Healy. When I refer to HDBSCAN I’ll be talking about the python implementation/package. It is generally the first clustering method I try for a variety of reasons:
- You don’t need to specify the number of clusters. Other clustering methods such as k-means require that you specify the number of clusters to find in your data, and this is hard to know ahead of time. HDBSCAN will find the natural number of clusters in your data. All you need to specify is the minimum number of points that a cluster should have (which is much easier to have an intuition for).
- Many other clustering algorithms make assumptions about the shape of the clusters (e.g. they must fit in a circle) or they are all the same density. In real data this is generally not true and HDBSCAN finds clusters with varying shapes/densities.
- HDBSCAN will label points as noise/outliers. Many clustering algorithms force every point into a cluster. However, real world data is messy and having outliers improves the quality of the clusters (since they aren’t polluted by noise).
- It generally just works. I find I spend much less time fiddling with parameters and spend more time looking at my actual data.
Let’s look at an example
The first thing we need is an embedding, which as you might recall is just a numeric representation of your data with a way to measure distance between points. Shown below is a plot of a sample dataset. While it is an artificial dataset, it has many properties of real data:
- There are a lot of noisy/outlying points which don’t belong in any cluster
- The groups of points are different shapes and you can see in some clusters that the points are much closer together, while in others they are less dense.
Let’s try clustering this data. First we import HDBSCAN and load our data
import hdbscan import numpy as np data = np.load('clusterable_data.npy')
Clustering the data is as simple as
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, metric='euclidean') clusterer.fit_predict(data)
Here we are saying that there must be at least 15 points close together before we say that something is a cluster. How do we measure “close together”? We also specified a Euclidean distance metric. This is the default metric but HDBSCAN can use many other metrics. Euclidean is the default metric, but it is always better to explicitly state your distance measure for other people reading the code. If we wanted to use cosine distance instead of Euclidean (despite Euclidean being the better choice in this case) we could do
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, metric='cosine') clusterer.fit_predict(data)
We can see which cluster each point belongs to using
labels = clusterer.labels_ print(labels) # [ 5 5 5 ... -1 -1 5]
The first points shown are part of cluster 5. Points that are outliers are given a label of “-1” so they are easy to filter out. Let’s remake the plot above but colour the points based on their cluster label. The outlying points (part of the -1 cluster) will be grey.
You can see that the resulting clusters are pretty good. More importantly, they match what we would intuitively pick as the clusters if we had to draw lines around the groups of points.
HDBSCAN and its python implementation is a fast clustering algorithm that is easy to use. It naturally handles a lot of the messiness of real world data and lets you spend more time focussing on the problem you are trying to solve. If you want to learn more about how HDBSCAN works and see other examples check out the resources below.
- Blog on understanding HDBSCAN which is similar to this blog but goes into much more detail
- How HDBSCAN works from the official documentation
- HDBSCAN, Fast Density Based Clustering, the How and the Why - John Healy