Clustering is the task of grouping data so that points in the same cluster are highly similar to each other, while points in different clusters are dissimilar. Clustering is a form of unsupervised learning because there is no target variable indicating which groups the training data belong to.
The GraphLab clustering toolkit includes two models: K-Means and DBSCAN. K-Means
finds cluster centers for a predetermined number of clusters ("K") by minimizing
the sum of squared distances from each point to its assigned cluster. Points are
assigned to the cluster whose center is closest. It is usually the faster of the
two options, and can be accelerated further by setting the
parameter to use only a small subset of data for each training iteration.
DBSCAN is the most popular probability density-based clustering method. It creates clusters by connecting neighboring points that have high estimated probability density. Although less computationally efficient than K-Means, DBSCAN captures more flexible cluster shapes, automatically determines the best number of clusters, and detects outliers.