Data Matching

Data matching is the identification and aggregation of data records that correspond to the same real-world entity. Often data matching problems arise when aggregating datasets from different sources, but the field of data matching encompasses several different tasks that have quite different data contexts. The GraphLab Create data matching toolkit provides four tools to help you quickly accomplish the most common data matching tasks.

Record linker is the most straightforward data matching task: linking structured query records to a fixed reference set, also in tabular form.

Deduplication also works with structured datasets, but differs from record linking in that there is no fixed reference dataset. Instead, all records from the input datasets are matched to each other, with duplicate records given the same entity label (akin to a cluster label). Deduplication examples include combining records about customers who sign up for a service multiple times, or aggregating location information about businesses obtained from multiple listing services.

Autotagging involves matching documents to a fixed set of tags, listed and described in a tabular dataset. Examples of this include finding product names in unstructured customer reviews or blog posts, and matching unstructured merchant product offers to a product catalog.

Similarity search takes high level data objects like images, documents, or even combinations of the two, and finds similar items in a reference set of items. Typically, implementing a system to accomplish this task requires substantial domain knowledge to convert the raw data objects into numeric vectors, followed by a nearest neighbors search.