distances

The GraphLab Create distances module provides access to the standard distance functions and utilities for working with composite distances. Distance functions are used in all toolkits based on a nearest neighbors search, including nearest_neighbors itself, nearest_neighbor_classifier , and nearest_neighbor_deduplication.

Warning

The ‘dot_product’ distance is deprecated and will be removed in future versions of GraphLab Create. Please use ‘logistic’ distance instead, although note that this is more than a name change; it is a different transformation of the dot product of two vectors. Please see the distances module documentation for more details.

Standard distance functions measure the dissimilarity between two data points consisting of only a single type.

  • euclidean, squared_euclidean, manhattan, cosine, and transformed_dot_product distances work for integer and floating point data, which can be thought of a vectors.
  • These distances, as well as the jaccard and weighted_jaccard distances work on data contained in dictionaries.
  • The levenshtein distance works for string data, although another strategy that often works well is to turn strings into dictionaries with the graphlab.text_analytics.count_ngrams() function then use Jaccard or weighted Jaccard distance.

These functions may be passed to a model by specifying either the name or the handle of the function in this module. For example, suppose we have the following SFrame of data:

>>> sf = graphlab.SFrame({'X1': [0.98, 0.62, 0.11, 1.4, 0.88],
...                       'X2': [0.69, 0.58, 0.36, 1.23, 0.2],
...                       'species': ['cat', 'dog', 'elephant', 'fossa', 'giraffe']})

To find the nearest neighbors of each row, we create a nearest neighbors model, and we have to indicate how we want to measure the distance between any pair of rows. Suppose we only want to use the numeric features ‘X1’ and ‘X2’; then we can use any of the standard numeric distances.

>>> m = graphlab.nearest_neighbors.create(sf, features=['X1', 'X2'],
...                                       distance='euclidean')
...
>>> m2 = graphlab.nearest_neighbors.create(sf, features=['X1', 'X2'],
...                                        distance=graphlab.distances.euclidean)

Composite distances provide greater flexibility because they allow distances on features that have different types. A composite distance is simply a weighted sum of standard distance functions, each of which is applied to a particular subset of features. To represent this in code, we use a Python list. Each member of a composite distance list contains three things:

  1. a list or tuple of feature names
  2. the name of a standard distance function
  3. a weight

The weight is a single scalar value (integer or float) that multiplies the contribution of each component of the distance.

For a concrete example, suppose we want to measure the distance between two rows \(a\) and \(b\) in the SFrame above using a combination of Euclidean distance on the numeric features and Levenshtein distance on the species name. To increase the relative contribution of the numeric features we can up-weight the Euclidean distance by a factor of 2, and down-weight the Levenshtein distance by a factor of 0.3. Our composite distance is

\[D(a, b) = 2 * d_{euclidean}(a[X1, X2], b[X1, X2]) + 0.3 * d_{levenshtein}(a[species], b[species])\]

This is represented in Python code as:

>>> species_dist = [[('X1', 'X2'), 'euclidean', 2],
...                 [('species',), 'levenshtein', 0.3]]

Composite distances can be used with the following models as a drop-in replacement for the standard distance name or function.

When a composite distance is used, we no longer need to specify the features, because the composite distance already contains that information. Models that use composite distances store the specification so it can be retrieved, modified, and reused. For example, suppose we decided that the Levenshtein distance on species name should have a higher weight. We don’t have to construct a composite distance from scratch; we can modify the one we used previously.

>>> m3 = graphlab.nearest_neighbors.create(sf, distance=dist_spec)
...
>>> dist_spec2 = m3['composite_params']
>>> dist_spec2[1][2] = 0.7
>>> m4 = graphlab.nearest_neighbors.create(sf, distance=dist_spec2)

Specifying a composite distance can be tricky. Often we have a general sense for which features and standard distances to use, but only a vague idea how much each component should be weighted. The compute_composite_distance function can help with this by evaluating a composite distance on two specific data points.

>>> d1 = graphlab.distances.compute_composite_distance(dist_spec, sf[0], sf[1])
>>> d2 = graphlab.distances.compute_composite_distance(dist_spec, sf[0], sf[2])
>>> print "d1:", d1, "d2:", d2
d1: 1.65286120899 d2: 3.66096749031

This tells that under our first composite distance, the ‘cat’ and ‘dog’ data points are closer than the ‘cat’ and ‘elephant’ data points.

distances

euclidean(x, y) Compute the Euclidean distance between two dictionaries or two lists of equal length.
squared_euclidean(x, y) Compute the squared Euclidean distance between two dictionaries or two lists of equal length.
manhattan(x, y) Compute the Manhattan distance between between two dictionaries or two lists of equal length.
cosine(x, y) Compute the cosine distance between between two dictionaries or two lists of equal length.
levenshtein(x, y) Compute the Levenshtein distance between between strings.
dot_product(x, y) Compute the dot_product between two dictionaries or two lists of equal length.
transformed_dot_product(x, y) Compute the “transformed_dot_product” distance between two dictionaries or two lists of equal length.
jaccard(x, y) Compute the Jaccard distance between between two dictionaries.
weighted_jaccard(x, y) Compute the weighted Jaccard distance between between two dictionaries.

utilities

compute_composite_distance(distance, x, y) Compute the value of a composite distance function on two dictionaries, typically SFrame rows.
build_address_distance([number, street, ...]) Construct a composite distance appropriate for matching address data.