Linear Regression

GraphLab's linear regression module is used to predict a continuous target as a linear function of features. This is a two-stage process, analogous to many other GraphLab toolkits. First a model is created (or trained), using training data. Once the model is created, it can then be used to make predictions on new examples that were not seen in training (the test data). Model creation, prediction, and evaluation work will data that is contained in an SFrame. The following figure illustrates how linear regression works. Notice that the functional form learned here is a linear function (unlike the previous figure where the predicted function was non-linear).

Background

Given a set of features , and a label (where denotes a single example in the training data), linear regression assumes that the target is a linear combination of the features i.e

where is random Gaussian noise with mean and variance . An intercept term is added by appending a column of 1's to the features. Regularization is often required to prevent over-fitting by penalizing models with extreme parameter values. The linear regression module supports L1 and L2 regularization, which are added to the loss function.

The composite objective being optimized for is the following:

where is the l1_penalty and is the l2_penalty.

Introductory Example

In this chapter, we will use a sampled version of the data from the Yelp Dataset Challenge. The task is to predict the 'star rating' for a restaurant for a given user. The entire dataset is available on the Yelp website.

We will first split the data into a train-test split and then create a linear regression model that can predict the star rating for each review using:

  • Average rating of a given business
  • Average rating made by a user
  • Number of reviews made by a user
  • Number of reviews that concern a business
import graphlab as gl

# Load the data
data =  gl.SFrame('https://static.turi.com/datasets/regression/yelp-data.csv')

# Make a train-test split
train_data, test_data = data.random_split(0.8)

# Create a model.
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'])

# Save predictions to an SFrame (class and corresponding class-probabilities)
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
results = model.evaluate(test_data)
Additional Features

We will now go over some more advanced options with the linear regression module. This includes regularization, evaluation options, model interpretation, and missing value handling. Note that logistic regression and support vector machines (SVM) conform to almost all of the API discussed below.

Accessing attributes of the model

The attributes of all GraphLab Create models, which include training statistics, model hyper-parameters, and model results can be accessed in the same way as python dictionaries. To get a list of all fields that can be accessed, you can use the list_fields() function:

fields = model.list_fields()
print fields
['auto_tuning',
 'coefficients',
 'convergence_threshold',
 'feature_rescaling',
 'features',
 'l1_penalty',
 'l2_penalty',
 'lbfgs_memory_level',
 'max_iterations',
 'mini_batch_size',
 'num_coefficients',
 'num_examples',
 'num_features',
 'num_unpacked_features',
 'solver',
 'step_size',
 'target',
 'training_iterations',
 'training_loss',
 'training_rmse',
 'training_solver_status',
 'training_time',
 'unpacked_features']

Each of these fields can be accessed using dictionary-like lookups. For example, the training_rmse measures the root-mean-squared error during training. It can be accessed as follows:

print model['training_rmse']
0.971003765928

The API docs provide a detailed description of each of the model's atrributes.

Interpreting results

Linear regression can provide valuable insight about the relationships between the target and feature columns in your data, revealing why your model returns the predictions that it does. The coefficients () are what the training procedure learns. Each model coefficient describes the expected change in the target variable associated with a unit change in the feature. The bias term indicates the "inherent" or "average" target value if all feature values were set to zero.

The coefficients often tell an interesting story of how much each feature matters in predicting target values. The magnitude (absolute value) of the coefficient for each feature indicates the strength of the feature's association to the target variable, holding all other features constant. The sign on the coefficient (positive or negative) gives the direction of the association.

When a GraphLab Create regression model is trained, the model.summary() output shows the largest positive and negative coefficients. For a trained model, we can access the coefficients as follows:

coefs = model['coefficients']
print coefs
+-----------------------+-------+-------------------+-------------------+
|          name         | index |       value       |       stderr      |
+-----------------------+-------+-------------------+-------------------+
|      (intercept)      |  None |   -2.22993154648  |  0.0199750718984  |
|     user_avg_stars    |  None |   0.809669227421  |  0.00414406824672 |
|   business_avg_stars  |  None |   0.781099022798  |  0.00419064940637 |
|   user_review_count   |  None | 1.85023282568e-05 | 1.15481783547e-05 |
| business_review_count |  None | 7.06842770902e-05 | 1.72332966093e-05 |
+-----------------------+-------+-------------------+-------------------+
[5 rows x 3 columns]

Note that the index column in the coefficients is only applicable for categorical features, lists, and dictionaries. In the SFrame above, there is an extra column for standard errors on the estimated coefficients (see section below for a more detailed explanation).

Standard-errors

The standard error is the empirical standard deviation on the estimated coefficient. For example, a coefficient of 1 with a standard error of 0.5 implies a standard deviation of 0.5 on the estimated values of the coefficients. Standard errors capture the confidence we have in the coefficients estimated during model training. Smaller standard errors implies more confidence in the value of the coefficients returned by th model.

Standard errors on coefficients are only available when solver=newton or when the default auto solver option choses the newton method and if the number of examples in the training data is more than the number of coefficients. If standard errors cannot be estimated, a column of None values are returned.

Categorical features

Categorical variables are features that can take one of a limited, and usually fixed, number of possible values. Each category is referred to as a level. As an example, consider the variable city that a given restaurant is in. This dataset has about 60 unique strings for the city (the number of unique values in the output of GraphLab Canvas is approximate).

city = train_data['city']
city.show()

Categorical variable summary

The regression module in Graphlab Create uses simple encoding while training models using string features. Simple encoding compares each category to an arbitrary reference category (we choose the first category of the data as the reference). In other words, we add dummy coefficients to encode each category. The number of these dummy coefficients is equal to the total number of categories minus 1 (for the reference category). The following figure illustrates how categorical variables are encoded using the simple encoding scheme. This article provides more details about simple encoding for regression models.

Categorical variable summary

Now let us look at a simple example of using city as a variable in the linear regression model. Notice that there is no need to do any special pre-processing. All SFrame columns of type str are automatically transformed into categorical variables. Notice that the number of coefficients and the number of features aren't the same.

In this example, there are 60 unique cities, so 59 dummy coefficients. Graphlab Create can handle categorical variables with millions of categories.

# Create a model with categorical features.
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count',
                                                'city'])

# Number of feature columns
print "Number of features          : %s" % model['num_features']

# Number of features (including expanded lists and dictionaries)
print "Number of unpacked features : %s" % model['num_unpacked_features']

# Number of coefficients in the model
print "Number of coefficients      : %s" % model['num_coefficients']

# A coefficient is included for each category
print model['coefficients']
Number of features          : 5
Number of unpacked features : 5
Number of coefficients      : 64

+-----------------------+------------+-------------------+
|          name         |   index    |       value       |
+-----------------------+------------+-------------------+
|      (intercept)      |    None    |   -2.23521785747  |
|     user_avg_stars    |    None    |   0.812827992012  |
|   business_avg_stars  |    None    |   0.778447236096  |
|   user_review_count   |    None    | 2.48618619187e-05 |
| business_review_count |    None    | 0.000102740183746 |
|          city         |   Tempe    |  -0.0214425060978 |
|          city         | Scottsdale | -0.00858553630484 |
|          city         |    Mesa    |  0.0109508448699  |
|          city         |  Chandler  |  0.0390579171569  |
|          city         |  Gilbert   | -0.00525039510189 |
|          ...          |    ...     |        ...        |
+-----------------------+------------+-------------------+
[64 rows x 3 columns]

Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
Sparse features

Sparsity is one of the most important things to consider when working with a lot of data. Sparse features are encoded using a dictionary where the keys represent the feature names and the values are the feature values. A coefficient is added for each key in the union of the set of keys for all dictionaries across all examples. For a particular example, only non-zero values need to be stored in the dictionary. The following illustration describes how GraphLab Create dictionary features leverage sparsity in your data.

Dictionary variable summary

Often, the sparse representation using dictionaries is much more efficient than using the dense representation.

Yelp reviews contain useful tags that describe each business. Unlike categorical variables, a business may have several tags. These tags are stored using dictionaries. Not all tags are associated with every restaurant, so we will only store those tags that are associated with restaurant and assume that all tags not explicitly provided are treated as features with value zero. We illustrate with the following example:

print train_data['categories_dict'].head(3)
[{'Breakfast & Brunch': 1, 'Restaurants': 1},
 {'Restaurants': 1, 'Pizza': 1, 'Italian': 1},
 {'Dog Parks': 1, 'Parks': 1, 'Active Life': 1}]

We can now create a linear regression model with categories_dict included in the feature list. A coefficient is added for each key encountered in the input data. If the test data contains keys that were not present during training, they are silently ignored.

# Create a model with dictionary features.
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count',
                                                'categories_dict'])

# Number of feature columns
print "Number of features          : %s" % model['num_features']

# Number of features (including expanded lists and dictionaries)
print "Number of unpacked features : %s" % model['num_unpacked_features']

# Number of coefficients in the model
print "Number of coefficients      : %s" % model['num_coefficients']

# A coefficient is included for each key in the dictionary
print model['coefficients']
Number of features          : 5
Number of unpacked features : 511
Number of coefficients      : 512

+-----------------------+--------------------+--------------------+
|          name         |       index        |       value        |
+-----------------------+--------------------+--------------------+
|      (intercept)      |        None        |   0.872679609626   |
|     user_avg_stars    |        None        |   0.415478142682   |
|   business_avg_stars  |        None        |   0.363118585079   |
|   user_review_count   |        None        | -1.13051967735e-05 |
| business_review_count |        None        | 0.000205358202417  |
|    categories_dict    | Breakfast & Brunch |  -0.120010958039   |
|    categories_dict    |    Restaurants     |   0.271655570815   |
|    categories_dict    |   Middle Eastern   |  -0.229361372902   |
|    categories_dict    |     Dog Parks      |  -0.509799508435   |
|    categories_dict    |       Parks        |  -0.0565527745702  |
|          ...          |        ...         |        ...         |
+-----------------------+--------------------+--------------------+
[512 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
List features

GraphLab Create can also handle list of numerical values as features without preprocessing. The following illustration shows that the numeric-list feature is equivalent to adding several columns for each individual feature. This is especially useful for image data where the number of features generated are large and may not have natural interpretations. The following figure illustrates this idea.

List variable summary

As an example, we convert the votes (which stores the number of cool, funny, and useful votes for each review) column into a list of features for each vote type.

from array import array

# List of features
train_data['votes_list'] = train_data['votes'].apply(lambda x: x.values())
print train_data['votes_list'].head(3)
[array('d', [0.0, 5.0, 2.0]),
 array('d', [0.0, 0.0, 0.0]),
 array('d', [0.0, 2.0, 1.0])]

Notice that the SArray of numeric lists must be of dtype array and the array in each row must be of the same length. The astype() function of the SArray can help convert your data from type array to list.

# Create a model with numeric-list features.
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count',
                                                'votes_list'])


# Number of feature columns
print "Number of features          : %s" % model['num_features']

# Number of features (including expanded lists and dictionaries)
print "Number of unpacked features : %s" % model['num_unpacked_features']

# Number of coefficients in the model
print "Number of coefficients      : %s" % model['num_coefficients']

# A coefficient is included for each index in the list
print model['coefficients']
Number of features          : 5
Number of unpacked features : 7
Number of coefficients      : 8

+-----------------------+-------+-------------------+
|          name         | index |       value       |
+-----------------------+-------+-------------------+
|      (intercept)      |  None |   -2.02163219079  |
|     user_avg_stars    |  None |   0.792122828319  |
|   business_avg_stars  |  None |   0.756871316569  |
|   user_review_count   |  None | 3.46394815858e-06 |
| business_review_count |  None | 7.55781795366e-05 |
|       votes_list      |   0   |  -0.0919094290658 |
|       votes_list      |   1   |  -0.113808771262  |
|       votes_list      |   2   |   0.201216924886  |
+-----------------------+-------+-------------------+
[8 rows x 3 columns]
Feature rescaling

Feature rescaling is the process of scaling individual features to be of the same scale. This process is particularly useful when features vary widely in their ranges. It is well known that feature rescaling can make an enormous impact on accuracy. By default, GraphLab Create rescales all the features to make sure that they have the same L2-norm. The data is not centered to make each column of mean zero. Categorical features, lists, and sparse features (dictionaries), are also rescaled columnwise. The feature rescaling can be disabled (only recommended for advanced users) by setting feature_rescaling = False. The coefficients are scaled back to the original scale of the problem.

model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'],
                                    feature_rescaling = False)
Choosing the solver

The optimization used to train the model is automatically picked based on the input data from a carefully engineered collection of methods. The newton method works best for datasets with plenty of examples and few features (long datasets). Limited memory BFGS (lbfgs) is a robust solver for wide datasets (i.e datasets with many coefficients). fista is the default solver for L1-regularized linear regression. Gradient-descent (GD) is another well tuned method that can work really well on L1-regularized problems. The solvers are all automatically tuned and the default options should function well. Solver options can, however, be changed. In this example, we will increase max_iterations to 20 iterations. This is useful when model['solver_status'] is TERMINATED: Iteration limit reached and the accuracy of the model is not at a desired level.

# Provide solver options in a dictionary
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_id',
                                                'business_id',
                                                'user_avg_stars',
                                                'business_avg_stars'],
                                    solver = 'lbfgs',
                                    max_iterations = 20)

What happens when the default solver fails?

Whenever you have convergence trouble and the number of features is < 5000, then solver=newton is a good choice. The Newton method is one of the most stable optimization methods and works quite well if you have few coefficients. The downside of the Newton method is its large computational complexity per- iteration.

Solver options guide

Most solvers are all automatically tuned and the default options should function well. Changing the default settings may help in some cases.

  • step_size: The starting step size to use for the fista and gd solvers. The default is set to 1.0, this is an aggressive setting. If the first iteration is taking a considerable amount of time, then reducing the step size may speed up model training. If the step size of 1.0 is accepted in the first iteration, then consider increasing this step size to a more aggressive value.
  • max_iterations : The maximum number of passes through the data allowed. Increasing his can result in a more accurately trained model. Consider increasing this (the default value is 10) if the training accuracy is low and the Grad-Norm in the display is large.
  • convergence_threshold: Convergence is tested using variation in the training objective. The variation in the training objective is calculated by dividing the difference between the objective values between two steps. Consider reducing this below the default value (0.01) for a better training data fit. Beware of overfitting (i.e a model that works well only on the training data) if this parameter is set to a very low value.
  • lbfgs_memory_level: The L-BFGS algorithm keeps track of gradient information from the previous lbfgs_memory_level iterations. The storage requirement for each of these gradients is the num_coefficients in the problem. Increasing the lbfgs_memory_levelcan help improve the quality of the model trained at the cost of more memory. Setting this to more than max_iterations has the same effect as setting it to max_iterations.
Making predictions with missing data

Once the model is created and ready to make predictions, it is quite common that the input data to make predictions on are noisy. GraphLab Create can recognize the following types of missing data (stored as None). Note that the following options only apply to missing values during prediction or evaluation. Missing values during model creation raise an error.

  • Missing values in individual features: Individual features including numerical, categorical, list of numeric, and dictionary features may contain missing data. The option missing_value_action determines the way in which these features are handled. The default value value is missing_value_action='impute' which imputes each of the features to the mean value observed during training. Setting missing_value_action='error' throws an error if any missing values are encountered during prediction.
  • Missing feature columns: Sometimes, an entire column is missing while making predictions. In this case, missing_value_action='impute' imputes the mean value observed during training for each row in the input data. Again, setting missing_value_action='error' throws an error if an entire feature column is missing during prediction.
  • New categories in categorical features: New categories are often observed during predict time. These new categories are treated to be the same as the reference category thus having no effect on the model prediction.
  • New keys in dictionary features: New keys in dictionary features during predict time are silently ignored and do not effect the prediction made.
# Create a model with categorical and dictionary features.
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count',
                                                'city',
                                                'categories_dict'])

# An SFrame with a single row with missing values in the 'user_avg_stars'
# predict() will impute the entire feature to the mean value observed during training
new_data = gl.SFrame({'user_avg_stars': [None],
                      'business_avg_stars': [30.0],
                      'user_review_count': [10],
                      'business_review_count': [20],
                       'city': ['Tuscon'],
                       'categories_dict': [{'Pizza': 1}]
                      })
prediction = model.predict(new_data)
print prediction

# An SFrame without the 'user_avg_stars' feature. predict() will impute the
# entire feature to the mean value observed during training.
new_data = gl.SFrame({'business_avg_stars': [30.0],
                      'user_review_count': [10],
                      'business_review_count': [20],
                       'city': ['Tuscon'],
                       'categories_dict': [{'Pizza': 1}]
                      })
prediction = model.predict(new_data)
Regularizing Models (Lasso, Ridge, and Elastic Net regression)

Regularization is the process of using problem structure to solve ill-posed problems or to prevent overfitting. The structure is imposed by adding a penalty term to the objective function. The linear regression module supports L1 and L2 regularization, which are added to the loss function.

As discussed above, the composite objective being optimized for is the following:

where is the l1_penalty and is the l2_penalty.

# Lasso
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'],
                                    l1_penalty = 1.0,
                                    l2_penalty = 0.0)

# Ridge regression
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'],
                                    l2_penalty = 1.0)

# Elastic net regression
model = gl.linear_regression.create(train_data, target='stars',
                                    features = ['user_avg_stars',
                                                'business_avg_stars',
                                                'user_review_count',
                                                'business_review_count'],
                                    l1_penalty = 1.0,
                                    l2_penalty = 1.0)