This module provides utilities for doing text processing.

Note that standard SArray utilities can be used for transforming text data into “bag of words” format, where a document is represented as a dictionary mapping unique words with the number of times that word occurs in the document. See count_words() for more details. Also, see pack_columns() and unstack() for ways of creating SArrays containing dictionary types.

We provide methods for learning topic models, which can be useful for modeling large document collections. See create() for more, as well as the How-Tos, data science Gallery, and text analysis chapter of the User Guide.

term frequency transformations

bm25 For a given query and set of documents, compute the BM25 score for each document.
tf_idf Compute the TF-IDF scores for each word in each document.

topic models

topic_model.create Create a topic model from the given data set.
topic_model.TopicModel TopicModel objects can be used to predict the underlying topic of a document.


count_words Convert the content of string/dict/list type SArrays to a dictionary of (word, count) pairs.
count_ngrams Return an SArray of dict type where each element contains the count for each of the n-grams that appear in the corresponding input element.
parse_sparse Parse a file that’s in libSVM format.
parse_docword Parse a file that’s in “docword” format.
random_split Utility for performing a random split for text data that is already in bag-of-words format.
stopwords Get common words that are often removed during preprocessing of text data, i.e.
tokenize Tokenize the input SArray of text strings and return the list of tokens.
trim_rare_words Remove words that occur below a certain number of times in an SArray.