Features

text2vec provides an efficient framework with a concise API for text analysis and natural language processing (NLP) in R.

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts, such as training GloVe word embeddings are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and Solaris (x86) without any additional tuning or tricks. Finally, a streaming API means that users do not have to load all the data into RAM.

This vignette explains how to use text2vec to vectorize text on arbitrary n-grams using either a vocabulary or feature hashing. See the glove vignette for an explanation of how to use state-of-the art GloVe word embeddings with this package.

Text vectorization

Most text mining and NLP modeling use bag-of-words or bag-of-n-grams methods. Despite their simplicity, these models usually demonstrate good performance on text categorization and classification tasks. But, in contrast to their theoretical simplicity and practical efficiency, building bag-of-words models involves technical challenges. This is especially the case in R because of its copy-on-modify semantics.

Text analysis pipeline

Let’s briefly review some details of a typical text analysis pipeline :

  1. The reseacher usually begins by constructing a document-term matrix (DTM) from input documents. In other words, the first step is to vectorize text by creating a map from words or n-grams to a vector space.
  2. The researcher fits a model to that DTM. These models might include text classification, topic modeling, or word embedding. Fitting the model will include tuning and validating the model.
  3. Finally the researcher applies the model to new data.

In this vignette we will primarily discuss first stage. Texts themselves can take up a lot of memory, but vectorized texts usually do not, because they are stored as sparse matrices. Because of R’s copy-on-modify semantics, it is not easy to iteratively grow a DTM. So constructing a DTM, even for a small collections of documents, can be a serious bottleneck for analysts and researchers. It involves reading the whole collection of text documents into RAM and processing it as single vector, which can easily increase memory use by a factor of 2 to 4. The text2vec package solves this problem by providing a better way of constructing a document-term matrix.

Example: Sentiment analysis on IMDB movie review dataset

This package provides the movie_review dataset. It consists of 5000 movie reviews, each of which is marked as positive or negative.

library(text2vec)
data("movie_review")
set.seed(42L)

To represent documents in vector space, we first have to create term -> term_id mappings. We call them terms instead of words, because they can be arbitrary n-grams, not just single words. We represent a set of documents as a sparse matrix, where each row corresponds to a document and each column corresponds to a term. This can be done in 2 ways: using the vocabulary itself or by feature hashing.

Vocabulary based vectorization

Let’s first create a vocabulary-based DTM. Here we collect unique terms from all documents and mark each of them with a unique_id. Using the vocabulary() function. We use an iterator to create the vocabulary.

it <- itoken(movie_review$review, 
             preprocess_function = tolower, 
             tokenizer = word_tokenizer, 
             ids = movie_review$id)

sw <- c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours")
vocab <- create_vocabulary(it, stopwords = sw)

Alternatively, if your data fits in RAM, you can once create list of tokens and the reuse it in further steps:

# Each element of list represents document
tokens <- movie_review$review %>% 
  tolower() %>% 
  word_tokenizer()
it <- itoken(tokens, ids = movie_review$id)
vocab <- create_vocabulary(it, stopwords = sw)

Now that we have a vocabulary, we can construct a document-term matrix. (We could instead use create_corpus() and get_dtm()).

it <- itoken(tokens, ids = movie_review$id)
# Or
# it <- itoken(movie_review$review, tolower, word_tokenizer, ids = movie_review$id)
vectorizer <- vocab_vectorizer(vocab)
dtm <- create_dtm(it, vectorizer)

Now we have a DTM and can check its dimensions.

str(dtm)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:706047] 4999 4999 4999 4999 4999 4998 4998 4998 4998 49..
##   ..@ p       : int [1:42642] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ Dim     : int [1:2] 5000 42641
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:5000] "5814_8" "2381_9" "7759_3" "3630_4" ...
##   .. ..$ : chr [1:42641] "decent.the" "nudity.i" "fantastic.it" "mentors" ...
##   ..@ x       : num [1:706047] 1 1 1 1 1 1 1 1 3 2 ...
##   ..@ factors : list()
identical(rownames(dtm), movie_review$id)
## [1] TRUE

As you can see, the DTM has 5000 rows, equal to the number of documents, and 42641 columns, equal to the number of unique terms.

Now we are ready to fit our first model. Here we will use the glmnet package to fit a logistic regression model with an L1 penalty.

library(glmnet)
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1,
                 # interested in the area under ROC curve
                 type.measure = "auc",
                 # 5-fold cross-validation
                 nfolds = 5,
                 # high value is less accurate, but has faster training
                 thresh = 1e-3,
                 # again lower number of iterations for faster training
                 maxit = 1e3)
plot(fit)

print(paste("max AUC =", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.919"

We have successfully fit a model to our DTM.

Pruning vocabulary

We can note, however, that the training time for our model was quite high. We can reduce it and also significantly improve accuracy by pruning the vocabulary.

For example we can find words “a,” “the,” and “in” in almost all documents, but they do not provide much useful information. Usually they called stop words. On the other hand, the corpus also contains very uncommon terms, which are contained in only a few documents. These terms are also useless, because we don’t have sufficient statistics for them. Here we will remove both very common and very unusual terms.

pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.5, doc_proportion_min = 0.001)
it <- itoken(tokens, ids = movie_review$id)
vectorizer <- vocab_vectorizer(pruned_vocab)
dtm <- create_dtm(it, vectorizer)
dim(dtm)
## [1] 5000 7656

Note that the new DTM has many fewer columns than the original DTM.

TF-IDF

We can (and usually should!) also apply TF-IDF transformation to our DTM** which will increase the weight of terms which are specific to a single document or handful of documents and decrease the weight for terms used in most documents:

dtm <- dtm %>% transform_tfidf()
## idf scaling matrix not provided, calculating it form input matrix

Now, let’s fit our model again:

t1 <- Sys.time()
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = 5,
                 thresh = 1e-3,
                 maxit = 1e3)
print(difftime(Sys.time(), t1, units = 'sec'))
## Time difference of 2.372514 secs
plot(fit)

print(paste("max AUC =", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9198"

We trained this model faster with a larger AUC.

Can we improve the model?

We can try to improve our model by using n-grams instead of words. We will use up to 3-grams:

it <- itoken(tokens, ids = movie_review$id)

vocab <- create_vocabulary(it, ngram = c(1L, 3L)) %>% 
  prune_vocabulary(term_count_min = 10, 
                   doc_proportion_max = 0.5, 
                   doc_proportion_min = 0.001)

vectorizer <- vocab_vectorizer(vocab)

dtm <- tokens %>% 
  itoken() %>% 
  create_dtm(vectorizer) %>% 
  transform_tfidf()
## idf scaling matrix not provided, calculating it form input matrix
dim(dtm)
## [1]  5000 27226
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = 5,
                 thresh = 1e-3,
                 maxit = 1e3)

plot(fit)

print(paste("max AUC =", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9193"

Using n-grams improved our model a little bit more. Further tuning is left up to the reader.

Feature hashing

If you are not familiar with feature hashing (the so-called “hashing trick”) I recommend that you start with the Wikipedia article, then read the original paper by a Yahoo! research team. This techique is very fast because we don’t have to perform a lookup over an associative array. Another benefit is that it leads to a very low memory footprint, since we can map an arbitrary number of features into much more compact space. This method was popularized by Yahoo! and widely used in Vowpal Wabbit.

Here is how to use feature hashing in text2vec.

it <- itoken(tokens, ids = movie_review$id)

vectorizer <- hash_vectorizer(hash_size = 2 ^ 16, ngram = c(1L, 3L))
dtm <- create_dtm(it, vectorizer) %>% 
  transform_tfidf()
## idf scaling matrix not provided, calculating it form input matrix
fit <- cv.glmnet(x = dtm, y = movie_review[['sentiment']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = 5,
                 thresh = 1e-3,
                 maxit = 1e3)

plot(fit)

print(paste("max AUC =", round(max(fit$cvm), 4)))
## [1] "max AUC = 0.9027"

As you can see, we our AUC is a bit worse, but DTM construction time was considerably lower. On large collections of documents this can be a significant advantage.