graphlab.kmeans.create¶

graphlab.kmeans.
create
(dataset, num_clusters=None, features=None, label=None, initial_centers=None, max_iterations=10, batch_size=None, verbose=True)¶ Create a kmeans clustering model. The KmeansModel object contains the computed cluster centers and the cluster assignment for each instance in the input ‘dataset’.
Given a number of clusters, kmeans iteratively chooses the best cluster centers and assigns nearby points to the best cluster. If no points change cluster membership between iterations, the algorithm terminates.
Parameters: dataset : SFrame
Each row in the SFrame is an observation.
num_clusters : int
Number of clusters. This is the ‘k’ in kmeans.
features : list[str], optional
Names of feature columns to use in computing distances between observations and cluster centers. ‘None’ (the default) indicates that all columns should be used as features. Columns may be of the following types:
 Numeric: values of numeric type integer or float.
 Array: list of numeric (int or float) values. Each list element is treated as a distinct feature in the model.
 Dict: dictionary of keys mapped to numeric values. Each unique key is treated as a distinct feature in the model.
Note that columns of type list are not supported. Convert them to array columns if all entries in the list are of numeric types.
label : str, optional
Name of the column to use as row labels in the Kmeans output. The values in this column must be integers or strings. If not specified, row numbers are used by default.
initial_centers : SFrame, optional
Initial centers to use when starting the Kmeans algorithm. If specified, this parameter overrides the num_clusters parameter. The ‘initial_centers’ SFrame must contain the same features used in the input ‘dataset’.
If not specified (the default), initial centers are chosen intelligently with the Kmeans++ algorithm.
max_iterations : int, optional
The maximum number of iterations to run. Prints a warning if the algorithm does not converge after max_iterations iterations. If set to 0, the model returns clusters defined by the initial centers and assignments to those centers.
batch_size : int, optional
Number of randomlychosen data points to use in each iteration. If ‘None’ (the default) or greater than the number of rows in ‘dataset’, then this parameter is ignored: all rows of dataset are used in each iteration and model training terminates once point assignments stop changing or max_iterations is reached.
verbose : bool, optional
If True, print model training progress to the screen.
Returns: out : KmeansModel
A Model object containing a cluster id for each vertex, and the centers of the clusters.
See also
Notes
 Integer features in the ‘dataset’ or ‘initial_centers’ inputs are converted internally to float type, and the corresponding features in the output centers are floattyped.
 It can be important for the Kmeans model to standardize the features so they have the same scale. This function does not standardize automatically.
References
 Wikipedia  kmeans clustering
 Artuhur, D. and Vassilvitskii, S. (2007) kmeans++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms. pp. 10271035.
 Elkan, C. (2003) Using the triangle inequality to accelerate kmeans. In Proceedings of the Twentieth International Conference on Machine Learning, Volume 3, pp. 147153.
 Sculley, D. (2010) Web Scale KMeans Clustering. In Proceedings of the 19th International Conference on World Wide Web. pp. 11771178
Examples
>>> sf = graphlab.SFrame({ ... 'x1': [0.6777, 9.391, 7.0385, 2.2657, 7.7864, 10.16, 8.162, ... 8.8817, 9.525, 9.153, 2.0860, 7.6619, 6.5511, 2.7020], ... 'x2': [5.6110, 8.5139, 5.3913, 5.4743, 8.3606, 7.8843, 2.7305, ... 5.1679, 6.7231, 3.7051, 1.7682, 7.4608, 3.1270, 6.5624]}) ... >>> model = graphlab.kmeans.create(sf, num_clusters=3)