How to validate clusters
How to determine the number of clusters for your problem statement.
As clustering is an unsupervised technique we have to put the number of clusters manually
Goal of clustering is to maximize the similarity of the data points within a cluster and also maximize dissimilarity between clusters.
Therefore we need to be very careful while deciding the number of clusters.
Outcome of Unsupervised techniques are only useful if they are solving your business problem.
The output we get is something that we define ourselves.
For our example point of view we are taking K-means clustering
To evaluate our cluster we need to focus on below points:-
- What is K-means Clustering
K-means clustering works on assigning the number of clusters we assign that is our value ‘K’.
Question:- How does K-means clustering know which data point is in which cluster?
Ans:- It is called Euclidean distance which determines the distance from centroid and decides which cluster data point should come under.
As we can see in the diagram, this is how data points are assigned to the clusters in order to maximize similarity and dissimilarity between the clusters
Lets see how to determine the value of K i.e. our number of clusters
WCSS(Within Cluster Sum of Square) :- It measures the distance between points in a cluster. It is similar to our SSE(Sum of Squared Error).
It is a measure developed within the ANOVA framework.
But our goal here is not to minimize the WCSS to 0. Instead we want WCSS to be as low as possible while we can still have a small number of clusters.
So our optimal number of clusters in the above diagram would be 4 i.e K=4.
- Market Segmentation
- Image segmentation
- Object detection