We will look at the fundamental concept of clustering, different types of clustering methods, and the weaknesses. Clustering is an unsupervised learning technique that consists of grouping data points and creating partitions based on similarity. The ultimate goal is to find groups of similar objects.
What You'll Learn
- Types of clustering methods: Centroid-based clustering, Connectivity-based clustering, Distribution-based clustering, and Density-based clustering
Hello everyone, my name is Arham. In this video, we will look at the fundamental concept of clustering and the types of clustering methods.
Clustering is grouping data points in creating partitions based on similarity. If two things are similar in some ways, They often share other characteristics. Almost everything we perceive is in the form of clusters when we look up at the night sky we see clusters of stars and we name them after shapes they resemble. Similarly, a cluster is a set of similar data points or a set of points that are more similar to each other than two points in other clusters. It is classified as an unsupervised learning technique. And the key difference from other machine learning techniques is that clustering does not have a response class.
After grouping observations, a human needs to visually look at the clusters and optionally associate meaning to each cluster. The ultimate prediction is the set of clusters themselves, and this technique works only with data that is in numeric form. This means that any categorical variable needs to be converted to a numeric variable by binarization which is popularly known as one-hot encoding. There are many methods to predict clusters by calculating similarity. And I will now introduce you to four different types of clustering methods.
The first one is centroid-based clustering. Each cluster is represented by a centroid which derives clusters based on the distance of the data point to the centroid of the clusters One of the most widely used centroid-based algorithms is K-Means. K here stands for the number of clusters and K needs to be defined by the user This method starts by randomly placing centroids and iterates Until the centroids find the shortest sum of the distance between point to the center. It minimizes the aggregate intracluster distances and every cycle results in different clusters.
The second one is connectivity-based clustering. The clusters are defined by grouping the nearest neighbor, based on the distance between the data points The idea is that nearby data points are more related than other points farther away. The key aspect is that one cluster contains other clusters. Because of this structure, the clusters represent a hierarchy. This method works in two ways. It either starts from the smallest cluster and each step two clusters that are similar are combined into a bigger cluster in a bottom-up manner or starts from the biggest cluster and each step divides into two in a top-down manner. Clusters are represented by a dendrogram here, which explicitly shows the hierarchy of clusters.
The third one is distribution-based clustering. In this method, each cluster belongs to a normal distribution. The idea is that data points are divided based on the probability of belonging to the same normal distribution It is similar to centroid-based clustering, except that distribution-based clustering uses Probability to compute the clusters rather than using the mean. The user needs to define the number of clusters. This method goes through an iterative process of optimizing the clusters and a popular example is an expectation-maximization algorithm which uses a normal distribution for clustering the data points.
The fourth one is density-based clustering. Clusters here are defined by areas of concentrated density. This method begins by searching for areas of dense data points and assigns those areas to the same clusters. It’s based on connecting points with cell certain distance. A cluster contains all linked data points within a distance threshold. And considering the sparse areas as noise or borders between clusters. I will now go through some clustering weaknesses. In most clustering methods we need to supply the number of clusters. We can use an approximation method to estimate the number of clusters called as elbow method.
Lastly, remember that clustering algorithms are always sensitive to outliers. When you search for something on Google or go on to Amazon to buy something, you are presented with links or products that are relevant to your search by means of clustering. All of the methods we looked at today boil down to the basic idea that we want to find groups of similar objects. If you have any other topics you’d like us to cover leave a comment down below.
Give us a like if you found this useful, and if you want to see more. Check out other videos at online.datasciencedojo.com. Thanks for watching!
Arham Akheel - Arham holds a Masters degree in Technology Management from Texas A&M University and has a background of managing information systems.