# Churn prediction

Churn prediction is the task of identifying whether users are likely to stop using a service, product, or website. With this toolkit, you can start with raw (or processed) usage metrics and accurately forecast the probability that a given customer will churn.

Note: Follow the steps in the sample-churn-predictor GitHub repo to get the code and data used in this chapter.

#### Introduction

A churn predictor model learns historical user behavior patterns to make an accurate forecast for the probability of no activity in the future (defined as churn).

#### How is churn defined?

Customer churn can be defined in many ways. In this toolkit, churn is defined to be no activity for a fixed period of time (called the churn_period). Using this definition, a user is said to have churned if there is no activity for a duration of time known as the churn_period (by default, this is set to 30 days). The following figure better illustrates this concept.

A churn forecast is always associated with a particular timestamp (at which the churn_period starts) known as a time_boundary. As an example, a user is said to have churned at the time_boundary Jan 2015 because the user did not have any activity for a churn_period duration of time (say 30 days) after Jan 2015.

#### Input Data

A churn prediction model can be trained on time-series of observation_data. The time-series must contain a column to represent user_id and at least one other column that can be treated as a feature column. The following example shows a typical dataset that can be consumed directly by the churn predictor toolkit.

+---------------------+------------+----------+
|     InvoiceDate     | CustomerID | Quantity |
+---------------------+------------+----------+
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    8     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:26:00 |   17850    |    2     |
| 2010-12-01 08:26:00 |   17850    |    6     |
| 2010-12-01 08:28:00 |   17850    |    6     |
| 2010-12-01 08:28:00 |   17850    |    6     |
| 2010-12-01 08:34:00 |   13047    |    32    |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    8     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    6     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    2     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    3     |
| 2010-12-01 08:34:00 |   13047    |    4     |
+---------------------+------------+----------+
[532618 rows x 5 columns]


In the above dataset, let us assume that the last timestamp was October 1,

1. If the churn_period is set to 1 month, a churn forecast predicts the probability that a user will have no activity for a 1 month period after October 1, 2011.

#### Example

In this example, we will explore the task of predicting churn directly from customer activity logs. The following dataset contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

import graphlab as gl
import datetime

sf = gl.SFrame(
'https://static.turi.com/datasets/churn-prediction/online_retail.csv')

# Convert InvoiceDate from string to a Python datetime.
import dateutil
from dateutil import parser
sf['InvoiceDate'] = sf['InvoiceDate'].apply(parser.parse)

# Convert the SFrame into TimeSeries with InvoiceDate as the index.
time_series = gl.TimeSeries(sf, 'InvoiceDate')

# Split the data using the special train, validation split.
train, valid = gl.churn_predictor.random_split(time_series,
user_id='CustomerID', fraction=0.9)

# Define the period of in-activity that constitutes churn.
churn_period = datetime.timedelta(days = 30)

# Train a churn prediction model.
model = gl.churn_predictor.create(train, user_id='CustomerID',
features = ['Quantity'],
churn_period = churn_period)

# Making a churn forecast
predictions = model.predict(time_series)

# Evaluating the model
evaluation_time = datetime.datetime(2011, 9, 1)
predictions = model.predict(time_series, evaluation_time)

# Visualize the results
views = model.views.overview(time_series, evaluation_time)
views.show()