In the code block below, import the StackOverflow dataset SFrame that you saved during earlier exercises. Note that this data is shared courtesy of StackExchange and is under the Creative Commons Attribution-ShareAlike 3.0 Unported License. This particular version of the data set was used in a recent Kaggle competition.
import os if os.path.exists('stack_overflow'): sf = graphlab.SFrame('stack_overflow') else: sf= graphlab.SFrame('https://static.turi.com/datasets/stack_overflow') sf.save('stack_overflow')
Question 1: Visually explore the above data using GraphLab Canvas.
In this section we will make a model that can be used to recommend new tags to users.
Create a new column called
Tags where each element is a list of all the tags
used for that question. (Hint: Check out
sf = sf.pack_columns(column_prefix='Tag', new_column_name='Tags')
Make your SFrame only contain the
OwnerUserId column and the
Tags column you
created in the previous step.
sf = sf[['OwnerUserId', 'Tags']]
Use the following Python function to modify the
Tags column to not have any
empty strings in the list.
def remove_empty(tags): return [tag for tag in tags if tag != '']
sf['Tags'] = sf['Tags'].apply(remove_empty)
Create a new SFrame called
user_tag that has a row for every (user, tag) pair.
user_tag = sf.stack(column_name='Tags', new_column_name='Tag')
Create a new SFrame called
user_tag_count that has three columns:
- `OwnerUserId` - `Tag` - `Count`
Count contains the number of times the given
Tag was used by that
OwnerUserId. Hint: See
user_tag_count = user_tag.groupby(['OwnerUserId', 'Tag'], graphlab.aggregate.COUNT)
Question 7: Visually explore this summarized version of your data set with GraphLab Canvas.
graphlab.recommender.create() to create a model that can be used to
recommend tags to each user.
m = graphlab.recommender.create(user_tag_count, user_id='OwnerUserId', item_id='Tag')
Question 9: Print a summary of the model by simply entering the name of the object.
Get all unique users from the first 10000 observations and save them as a
users = user_tag_count.head(10000)['OwnerUserId'].unique()
Get 20 recommendations for each user in your list of users. Save these as a new
recs = m.recommend(users, k=20)
When people use recommendation systems for online commerice, it's often useful to be able to recommending products from a single category of items, e.g. recommending shoes to somebody who typically buys shirts.
Create a variable called
have used the
Use the model you created above to find the 20 most similar items to the tag
"python". Create a variable called
python_items containing just these similar
python_items = m.get_similar_items(['python'], k=20) python_items = python_items['similar']
For each user in
Question 15: Use GraphLab Canvas to find out the 10 most often recommended items.
python_recs.show() # Then click on the Summary tab and look at the histogram in the second column.
Question 16: Save your model to a file.
Create a train/test split of the
user_tag_count data from the section above.
train, test = graphlab.recommender.util.random_split_by_user(user_tag_count, user_id='OwnerUserId', item_id='Tag')
Question 18: Create a recommender model like you did above that only uses the training set.
m1 = graphlab.recommender.create(train, user_id='OwnerUserId', item_id='Tag')
Create a matrix factorization model that is better at ranking by setting
unobserved_rating_regularization argument to 1.
m2 = graphlab.ranking_factorization_recommender.create(train, user_id='OwnerUserId', item_id='Tag', target='Count', ranking_regularization=1)
Question 20: Retrieve the coefficients for each user that were learned by this algorithm.
Question 21: Compare the predictive performance of the two models. Given the ability to make 10 recommendations, which model predicted the highest proportion of items in the test set (on average)?
results = graphlab.recommender.util.compare_models(test, [m1, m2], metric='precision_recall')