Introduction to Feature Engineering

Feature engineering is an important part of designing an effective machine learning pipeline. It is best described as the process of transforming data from its raw form to something more useful to the predictive model. This can result in much better results on your task.

The basic idea is to construct an object, fit it to a dataset, and transform any new data.

# Construct a transformer
sf = gl.SFrame({'docs': ["This is a document!", "This one's also a document."]})
f = graphlab.feature_engineering.TFIDF(features = ['docs'])

# Fit it to a dataset

# Now the object is ready to transform new data

Feature engineering objects can be combined into pipelines and deployed on predictive services (see below for more). There is also a helper function fit_transform method that combines the last two methods.

GLC has a collection of feature engineering objects are helpful for transforming SFrames of various types. These feature engineering tasks are grouped based on the feature types:

Numeric Features

Categorical Features

Image Features

Text features


Transforming single columns

Many of the above transformations have a corresponding one-liner function whose input is an SArray and output is an SArray. Internally it simply runs fit_transform on the corresponding transformation.

tfidf_transforms = gl.text_analytics.tf_idf(data['docs'])
bag_of_words_transforms = gl.text_analytics.count_words(data['docs'])
bag_of_ngrams_transforms = gl.text_analytics.count_words(data['docs'])

Transforming multiple columns

TF-IDF is an example of a feature engineering object that performs a transformation for each feature (i.e. column name) provided in the features argument. Other one-to-one transformations include CategoricalImputer, CountThreshold, FeatureBinner, NGramCounter, NumericImputer, Tokenizer, WordCounter. If you would prefer to have each transformed column be included in the SFrame (rather than replacing the original column) you can use the column_name_prefix argument to add a prefix the set of transformed columns.

Other transformations take a set of columns and create a single column. Examples include FeatureHasher, OneHotEncoder, and QuadraticFeatures. You may change the name of the output column using the output_column_name argument.

Finally, the RandomProjection tool does a "many-to-many" transformation. It takes a set of columns as input and returns a smaller set of (randomly projected) columns.

Deploying feature engineering transformations

The feature engineering toolkit also makes it easy to deploy your feature engineering models and pipelines.

Suppose we have a simple tokenizer:

import graphlab as gl
data = gl.SFrame({'docs': ["This is a document", "Another doc"]})
m = gl.feature_engineering.Tokenizer(features=['docs'])

Now suppose we have created a Predictive Service object ps.
(For more on that, see the Predictive Services chapter of the user guide.) Then we can take a feature engineering model and add it as a service, apply those changes, and query the model that has been deployed as a service.

ps = gl.deploy.predictive_service.load(my_ps_url)
ps.add('my_transformation', m)

d = [row for row in data]  # Create JSON-serializable version of data
ps.query('my_transformation', method='transform', data={'data': d})

The resulting JSON will have a "response" field containing the data transformed by the deployed tokenizer model.

{u'from_cache': False,
 u'model': u'chris_tmp_wordcounter',
 u'response': [{u'docs': [u'This', u'is', u'a', u'document']},
  {u'docs': [u'Another', u'doc']}],
 u'uuid': u'6bb3627b-708d-4398-9afd-b13dd170d8e3',
 u'version': 0}


Feedback about the feature engineering toolkit is very welcome. Please post questions and comments on our forum or send a note to