feature_engineering

A transformer is a stateful object that transforms input data (as an SFrame) from one form to another. Transformers are commonly used for feature engineering. In addition to the modules provided in GraphLab create, users can write transformers that integrate seamlessly with already existing ones.

Each transformer has the following methods:

Method Description
__init__ Construct the object.
fit Fit the object using training data.
transform Transform the object on training/test data.
fit_transform First perform fit() and then transform() on data.
save Save the model to a GraphLab Create archive.

creating transformer(s)

create Create a Transformer object to transform data for feature engineering.

numeric features

QuadraticFeatures Calculates quadratic interaction terms between features.
FeatureBinner Feature binning is a method of turning continuous variables into categorical values.
NumericImputer Impute missing values with feature means.

categorical features

OneHotEncoder Encode a collection of categorical features using a 1-of-K encoding scheme.
CountThresholder Map infrequent categorical variables to a new/separate category.
CategoricalImputer The purpose of this imputer is to fill missing values (None) in data sets that have categorical data.
CountFeaturizer Replaces a collection of categorical columns with counts of a target column.

text features

BM25 Transform an SFrame into BM25 scores for a given query.
NGramCounter Transform string/dict/list columns of an SFrame into their respective bag-of-ngrams representation.
PartOfSpeechExtractor([features, ...]) The PartOfSpeechExtractor takes SFrame columns of type string and list, and transforms into a nested dictionary.
RareWordTrimmer([features, ...]) Remove words that occur below a certain number of times in a given column.
TFIDF Transform an SFrame into TF-IDF scores.
SentenceSplitter The SentenceSplitter takes SFrame columns of type string or list, and transforms into list of strings, where each element is a single sentence.
Tokenizer Tokenizing is a method of breaking natural language text into its smallest standalone and meaningful components (in English, usually space-delimited words, but not always).
WordCounter Transform string/dict/list columns of an SFrame into their respective bag-of-words representation.

image features

DeepFeatureExtractor Takes an input dataset, propagates each example through the network, and returns an SArray of dense feature vectors, each of which is the concatenation of all the hidden unit values at layer[layer_id].

misc.

AutoVectorizer Creates a feature transformer based on the content in the provided data that turns arbitrary content into informative features usable by any GraphLab ML algorithm.
FeatureHasher Hashes an input feature space to an n-bit feature space.
RandomProjection Project high-dimensional numeric features into a low-dimensional subspace.
TransformerChain Sequentially apply a list of transforms.
TransformerBase An abstract base class for user defined Transformers.