Clustering
Last updated
Last updated
scikit-learn implements many clustering algorithms. Below is a comparison adopted from its page.
Clustering is a unsupervised learning algorithm that groups data in such a way that data points in the same group are more similar to each other than to those from other groups
Similarity is usually defined using a distance measure (e.g, Euclidean, Cosine, Jaccard, etc.)
The goal is usually to discover the underlying structure within the data (usually high dimensional)
The most common clustering algorithm is K-means, where we define K (the number of clusters) and the algorithm iteratively finds the cluster each data point belongs to
Input:
(number of clusters)
Training set ()
Algorithm:
Randomly initialized cluster centroids
Repeat {
for i = 1 to
:= index (from 1 to K) of cluster centroid closest to
for k = 1 to
:= average (mean) of points assigned to Cluster k
}
How we initialize K-means
And how avoid local optimum
Consider clustering algorithm
Never spoke about how we initialize the centroids
A few ways - one method is most recommended
Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)0
Randomly pick K training examples
Set μ1 up to μK to these example's values
K means can converge to different solutions depending on the initialization setup
The local optimum are valid convergence, but local optimum not global ones
If this is a concern
We can do multiple random initializations
See if we get the same result - many same results are likely to indicate a global optimum
Algorithmically we can do this as follows;
A typical number of times to initialize K-means is 50-1000
Randomly initialize K-means
For each 100 random initialization run K-means
Then compute the distortion on the set of cluster assignments and centroids at convergent
End with 100 ways of cluster the data
Pick the clustering which gave the lowest distortion
If you're running K means with 2-10 clusters can help find better global optimum
If K is larger than 10, then multiple random initializations are less likely to be necessary
First solution is probably good enough (better granularity of clustering)
The technique to determine K, the number of clusters, is called the elbow method. With a bit of fantasy, you can see an elbow in the chart below.
We’ll plot:
values for K on the horizontal axis
the distortion on the Y axis (the values calculated with the cost function). This results in:
When K increases, the centroids are closer to the clusters centroids. The improvements will decline, at some point rapidly, creating the elbow shape. That point is the optimal value for K. In the image above, K=3.
Using K-means for market segmentation
Running K-means for a later/downstream purpose
See how well different number of clusters serve you later needs
e.g.
T-shirt size example
If you have three sizes (S,M,L)
Or five sizes (XS, S, M, L, XL)
Run K means where K = 3 and K = 5
How does this look
This gives a way to chose the number of clusters
Could consider the cost of making extra sizes vs. how well distributed the products are
How important are those sizes though? (e.g. more sizes might make the customers happier)
So applied problem may help guide the number of clusters
DBSCAN is a density-based clustered algorithm similar to mean-shift, but with a couple of notable advantages. Check out another fancy graphic below and let’s get started!
DBSCAN begins with an arbitrary starting data point that has not been visited. The neighborhood of this point is extracted using a distance epsilon ε (All points which are within the ε distance are neighborhood points).
If there are a sufficient number of points (according to minPoints) within this neighborhood then the clustering process starts and the current data point becomes the first point in the new cluster. Otherwise, the point will be labeled as noise (later this noisy point might become the part of the cluster). In both cases that point is marked as “visited”.
For this first point in the new cluster, the points within its ε distance neighborhood also become part of the same cluster. This procedure of making all points in the ε neighborhood belong to the same cluster is then repeated for all of the new points that have been just added to the cluster group.
This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e all points within the ε neighborhood of the cluster have been visited and labeled.
Once we’re done with the current cluster, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. This process repeats until all points are marked as visited. Since at the end of this all points have been visited, each point will have been marked as either belonging to a cluster or being noise.
DBSCAN poses some great advantages over other clustering algorithms. Firstly, it does not require a pe-set number of clusters at all. It also identifies outliers as noises, unlike mean-shift which simply throws them into a cluster even if the data point is very different. Additionally, it can find arbitrarily sized and arbitrarily shaped clusters quite well.
The main drawback of DBSCAN is that it doesn’t perform as well as others when the clusters are of varying density. This is because the setting of the distance threshold ε and minPoints for identifying the neighborhood points will vary from cluster to cluster when the density varies. This drawback also occurs with very high-dimensional data since again the distance threshold ε becomes challenging to estimate.
Risk of local optimum