Churn prediction

Churn prediction is the task of identifying whether users are likely to stop using a service, product, or website. With this toolkit, you can start with raw (or processed) usage metrics and accurately forecast the probability that a given customer will churn.

Note: Follow the steps in the sample-churn-predictor GitHub repo to get the code and data used in this chapter.

Introduction

A churn predictor model learns historical user behavior patterns to make an accurate forecast for the probability of no activity in the future (defined as churn).

How is churn defined?

Customer churn can be defined in many ways. In this toolkit, churn is defined to be no activity for a fixed period of time (called the churn_period). Using this definition, a user is said to have churned if there is no activity for a duration of time known as the churn_period (by default, this is set to 30 days). The following figure better illustrates this concept.

churn-illustration

A churn forecast is always associated with a particular timestamp (at which the churn_period starts) known as a time_boundary. As an example, a user is said to have churned at the time_boundary Jan 2015 because the user did not have any activity for a churn_period duration of time (say 30 days) after Jan 2015.

Input Data

A churn prediction model can be trained on time-series of observation_data. The time-series must contain a column to represent user_id and at least one other column that can be treated as a feature column. The following example shows a typical dataset that can be consumed directly by the churn predictor toolkit.

+---------------------+------------+----------+
|     InvoiceDate     | CustomerID | Quantity |
+---------------------+------------+----------+
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    8     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    2     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:28:00 |   17850    |    6     |
| 2010-12-01 08:28:00 |   17850    |    6     |
| 2010-12-01 08:34:00 |   13047    |    32    |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    8     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    2     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    4     |
+---------------------+------------+----------+
[532618 rows x 5 columns]

In the above dataset, let us assume that the last timestamp was October 1,

  1. If the churn_period is set to 1 month, a churn forecast predicts the probability that a user will have no activity for a 1 month period after October 1, 2011.

Example

In this example, we will explore the task of predicting churn directly from customer activity logs. The following dataset contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

import graphlab as gl
import datetime

# Load a data set.
sf = gl.SFrame(
   'https://static.turi.com/datasets/churn-prediction/online_retail.csv')

# Convert InvoiceDate from string to a Python datetime.
import dateutil
from dateutil import parser
sf['InvoiceDate'] = sf['InvoiceDate'].apply(parser.parse)

# Convert the SFrame into TimeSeries with InvoiceDate as the index.
time_series = gl.TimeSeries(sf, 'InvoiceDate')


# Split the data using the special train, validation split.
train, valid = gl.churn_predictor.random_split(time_series,
                              user_id='CustomerID', fraction=0.9)

# Define the period of in-activity that constitutes churn.
churn_period = datetime.timedelta(days = 30)

# Train a churn prediction model.
model = gl.churn_predictor.create(train, user_id='CustomerID',
                      features = ['Quantity'],
                      churn_period = churn_period)

# Making a churn forecast
predictions = model.predict(time_series)

# Evaluating the model
evaluation_time = datetime.datetime(2011, 9, 1)
predictions = model.predict(time_series, evaluation_time)

# Visualize the results
views = model.views.overview(time_series, evaluation_time)
views.show()

Learn more

The following sections provide more information about the churn prediction model: