Clustering is one of the tasks that fall under unsupervised learning. Clustering methods help to identify clusters (groups) of points in a dataset. A cluster is generally understood to be a group of points such that the points within the cluster are closer to each other (in terms of some distance measure) than w.r.t. points outside the cluster.

Clustering methods can be divided into two main categories by the structure of their output [mmds2014]:

  • Point-assignment methods: Each point of the dataset is assigned into a cluster based on the relative distances among them (in terms of some distance measure). The method returns a cluster identifier for each point.
  • Hierarchical methods: Rather than assign points into clusters, the method orders them within a hierarchical structure. This structure can be cut at any depth to get the kind of assignment provided by point-assignment methods. A diagram of this hierarchical structure is called a dendrogram.

Another way to divide clustering methods is by how many times each point is considered [mmds2014]:

  • Iterative methods: they need to iterate over the points several times;
  • Scan methods: They only need to go through the dataset once.

This distinction makes a big difference especially in the case of very large datasets that do not fit in memory, because loading the data from the hard drive into memory over and over again can have a drastic impact on performance.

Literature

  1. [mmds2014] Rajaraman, A. and Ullman, J.D., 2011. Mining of massive datasets. Cambridge University Press. URL: <http://infolab.stanford.edu/~ullman/mmds/book.pdf>.