graphlab.recommender.util.random_split_by_user

graphlab.recommender.util.random_split_by_user(dataset, user_id='user_id', item_id='item_id', max_num_users=1000, item_test_proportion=0.2, random_seed=0)

Create a recommender-friendly train-test split of the provided data set.

The test dataset is generated by first choosing max_num_users out of the total number of users in dataset. Then, for each of the chosen test users, a portion of the user’s items (determined by item_test_proportion) is randomly chosen to be included in the test set. This split allows the training data to retain enough information about the users in the testset, so that adequate recommendations can be made. The total number of users in the test set may be fewer than max_num_users if a user was chosen for the test set but none of their items are selected.

Parameters:

dataset : SFrame

An SFrame containing (user, item) pairs.

user_id : str, optional

The name of the column in dataset that contains user ids.

item_id : str, optional

The name of the column in dataset that contains item ids.

max_num_users : int, optional

The maximum number of users to use to construct the test set. If set to ‘None’, then use all available users.

item_test_proportion : float, optional

The desired probability that a test user’s item will be chosen for the test set.

random_seed : int, optional The random seed to use for

randomization. If None, then the random seed is different every time; if numeric, then subsequent calls with the same dataset and random seed with have the same split.

Returns:

train, test : SFrame

A tuple with two datasets to be used for training and testing.

Examples

>>> import graphlab as gl
>>> sf = gl.SFrame('https://static.turi.com/datasets/audioscrobbler')
>>> train, test = gl.recommender.util.random_split_by_user(sf, max_num_users=100)