The data for these exercises is culled from Wikipedia's Database Download. Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC-BY-SA).

Load the first Wikipedia text file called "w0". Each line in the document represents a single document and there is no header line. Name the variable documents.

# Downloads the data from S3 if you haven't already.
import os
if os.path.exists('wikipedia_w0'):
    documents = graphlab.SFrame('wikipedia_w0')
    documents = graphlab.SFrame.read_csv('https://static.turi.com/datasets/wikipedia/raw/w0', header=False)

Question 1:

Create an SArray that represents the documents in "bag-of-words format", where each element of the SArray is a dictionary with each unique word as a key and the number of occurrences is the value. Hint: look at the text analytics method count_words.

bow = graphlab.text_analytics.count_words(documents['X1'])

Question 2: Create a trimmed version of this dataset that excludes all words in each document that occur just once.

docs = bow.dict_trim_by_values(2)

Question 3: Remove all stopwords from the dataset. Hint: you'll find a predefined set of stopwords in stopwords.

docs = docs.dict_trim_by_keys(graphlab.text_analytics.stopwords(), exclude=True)

Question 4: Remove all documents from docs and documents that now have fewer than 10 unique words. Hint: You can use SArray's logical filter.

ix = docs.apply(lambda x: len(x.keys()) >= 10)
docs = docs[ix]

Question 5: What proportion of documents have we removed from the dataset?

1 - ix.mean()
Topic Modeling

Question 6: Create a topic model using your processed version of the dataset, docs. Have the model learn 30 topics and let the algorithm run for 30 iterations. Hint: use the topic modeling toolkit.

m = graphlab.topic_model.create(docs, num_topics=30, num_iterations=30)

Question 7: Print information about the model.


Question 8: Find out how many words the model has used while learning the topic model.


Use the following code to get the top 10 most probable words in each topic. Typically we hope that each list is a cohesive set of words, one that represents a general cluster of topics present in the dataset.

topics = m.get_topics(num_words=10).unstack(['word','score'], new_column_name='topic_words')['topic_words'].apply(lambda x: x.keys())
for topic in topics:
    print topic

Question 9: Predict the topic for the first 5 documents in docs.


Sometimes it is useful to manually fix words to be associated with a particular topic. For this we can use the associations argument.

Question 10: Create a new topic model that uses the following SFrame which will associate the words "law", "court", and "business" to topic 0. Use verbose=False, 30 topics, and let the algorithm run for 20 iterations.

fixed_associations = graphlab.SFrame()
fixed_associations['word'] = ['law', 'court', 'business']
fixed_associations['topic'] = 0
m2 = graphlab.topic_model.create(docs,  
                                 num_topics=30, verbose=False, num_iterations=20)

Question 11: Get the top 20 most likely words for topic 0. Ideally, we will see the words "law", "court", and "business". What other words appear to be related to this topic?

m2.get_topics([0], num_words=20)
Transforming word counts

Remove all the documents from docs and documents that have 0 words.

tf_idf_docs = graphlab.text_analytics.tf_idf(docs)

Question 12: Use GraphLab Canvas to explore the distribution of TF-IDF scores given to the word "year".


Question 13: Create an SFrame with the following columns:

  • id: a string column containing the range of numbers from 0 to the number of documents
  • word_score: the SArray containing TF-IDF scores you created above
  • text: the original text from each document
doc_data = graphlab.SFrame()
doc_data['id'] = graphlab.SArray(range(len(tf_idf_docs))).astype(str)
doc_data['word_score'] = tf_idf_docs
doc_data['text'] = docs

Question 14: Create a model that allows you to query the nearest neighbors to a given document. Use the id column above as your label for each document, and use the word_score column of TF-IDF scores as your features. Hint: use the new nearest_neighbors toolkit.

nn = graphlab.nearest_neighbors.create(doc_data, label='id', feature='word_score')

Question 15: Find all the nearest documents for the first two documents in the data set.

nearest = nn.query(doc_data.head(2), label='id')

Question 16: Make an SFrame that contains the original text for the query points and the original text for each query's nearest neighbors. Hint: Use SFrame.join.

nearest_docs = nearest[['query_label', 'reference_label']]
doc_data = doc_data[['id', 'text']]
nearest_docs.join(doc_data, on={'query_label':'id'})\
                .join(doc_data, on={'reference_label':'id'})\
                .sort('query_label')[['query_text', 'original_text']]