graphlab.data_matching.autotagger.create

graphlab.data_matching.autotagger.create(dataset, tag_name=None, features=None, verbose=True)

Create an autotagger model, which can be used to quickly apply tags from a reference set of text labels to a new query set using the _AutoTagger.tag method.

Parameters:

dataset : SFrame

Reference data. This SFrame must contain at least one column. By default, only the tag_name column is used as the basis for tagging. You may optionally include additional columns with the features parameter.

tag_name : string, optional

Name of the column in dataset with the tags. This column must contain string values. If dataset contains more than one column, tag_name must be specified.

features : list[string], optional

Names of the columns with features to use as the basis for tagging. ‘None’ (the default) indicates that only the column specified by the tag_name parameter should be used. Only str or list fields are allowed. If a column of type list is specified, all values must be either of type string or convertible to type string.

verbose : bool, optional

If True, print verbose output during model creation.

Returns:

out : model

A model for quickly tagging new query observations with entries from dataset. Currently, the only implementation is the following:

  • NearestNeighborAutoTagger

Examples

First construct a toy SFrame of actor names, which will serve as the reference set for our autotagger model.

>>> actors_sf = gl.SFrame(
        {"actor": ["Will Smith", "Tom Hanks", "Bradley Cooper",
                   "Tom Cruise", "Jude Law", "Robert Pattinson",
                   "Matt Damon", "Brad Pitt", "Johnny Depp",
                   "Leonardo DiCaprio", "Jennifer Aniston",
                   "Jessica Alba", "Emma Stone", "Cameron Diaz",
                   "Scarlett Johansson", "Mila Kunis", "Julia Roberts",
                   "Charlize Theron", "Marion Cotillard",
                   "Angelina Jolie"]})
>>> m = gl.data_matching.autotagger.create(actors_sf, tag_name="actor")

Then we load some IMDB movie reviews into an SFrame and tag them using the model we created above. The score field in the output is a similarity score, indicating the strength of the match between the query data and the suggested reference tag.

>>> reviews_sf = gl.SFrame(
        "https://static.turi.com/datasets/imdb_reviews/reviews.sframe")
>>> m.tag(reviews_sf.head(10), query_name="review", verbose=False)
+-----------+-------------------------------+------------------+-----------------+
| review_id |             review            |      actor       |      score      |
+-----------+-------------------------------+------------------+-----------------+
|     0     | Story of a man who has unn... |   Cameron Diaz   | 0.0769230769231 |
|     0     | Story of a man who has unn... |  Angelina Jolie  | 0.0666666666667 |
|     0     | Story of a man who has unn... | Charlize Theron  |      0.0625     |
|     0     | Story of a man who has unn... | Robert Pattinson | 0.0588235294118 |
|     1     | Bromwell High is a cartoon... |   Jessica Alba   |      0.125      |
|     1     | Bromwell High is a cartoon... | Jennifer Aniston |       0.1       |
|     1     | Bromwell High is a cartoon... | Charlize Theron  |       0.05      |
|     1     | Bromwell High is a cartoon... | Robert Pattinson |  0.047619047619 |
|     1     | Bromwell High is a cartoon... | Marion Cotillard |  0.047619047619 |
|     2     | Airport '77 starts as a br... |  Julia Roberts   | 0.0961538461538 |
|    ...    |              ...              |       ...        |       ...       |
+-----------+-------------------------------+------------------+-----------------+

The initial results look a little noisy. To filter out obvious spurious matches, we can set the tag method’s similarity_threshold parameter.

>>> m.tag(reviews_sf.head(1000), query_name="review", verbose=False,
          similarity_threshold=.8)
+-----------+-------------------------------+------------------+----------------+
| review_id |             review            |      actor       |     score      |
+-----------+-------------------------------+------------------+----------------+
|    341    | I caught this film at a te... |  Julia Roberts   | 0.857142857143 |
|    657    | Fairly funny Jim Carrey ve... | Jennifer Aniston | 0.882352941176 |
|    668    | A very funny movie. It was... | Jennifer Aniston | 0.833333333333 |
|    673    | This film is the best film... | Jennifer Aniston |     0.9375     |
+-----------+-------------------------------+------------------+----------------+

In this second example, you’ll notice that the review_id column is much more sparse. This is because all results whose score was below the specified similarity threshold (.8) were excluded from the output.