Working with files

It this paragraph I will show how to use text2vec for vectorization of large collections of text stored in files.

Imagine we want to build a topic model with lda package. We have a collection of movie reviews stored in multiple text files on disk.

For this vignette we will create files from embedded movie_review dataset:

library(text2vec)
library(magrittr)
data("movie_review")
# remove all internal EOL to simplify reading
movie_review$review <- gsub(pattern = '\n', replacement = ' ', 
                            x = movie_review$review, fixed = TRUE)
N_FILES <- 10
CHUNK_LEN <- nrow(movie_review) / N_FILES
files <- sapply(1:N_FILES, function(x) tempfile())
chunks <- split(movie_review, rep(1:N_FILES, each = nrow(movie_review) / N_FILES ))
for (i in 1:N_FILES ) {
  write.table(chunks[[i]], files[[i]], quote = T, row.names = F, col.names = T, sep = '|')
}
# Note how data looks like
str(movie_review, strict.width = 'cut')
## 'data.frame':    5000 obs. of  3 variables:
##  $ id       : chr  "5814_8" "2381_9" "7759_3" "3630_4" ...
##  $ sentiment: int  1 1 0 0 1 1 0 0 0 1 ...
##  $ review   : chr  "With all this stuff going down at the moment with MJ"..

text2vec provides functions to easily work with files.

User need to perform only a few things:

  1. Construct iterator over the files with ifiles function.
  2. Provide reader function to ifiles. text2vec doesn’t know anything about underlying files. They can be in plain text or some binary format.
  3. Construct tokens iterator from files iterator via itoken function.

Lets see how it works:

library(data.table)
reader <- function(x, ...) {
  # read
  chunk <- fread(x, header = T, sep = '|')
  # select column with review
  res <- chunk$review
  # assign ids to reviews
  names(res) <- chunk$id
  res
}
# create iterator over files
it_files  <- ifiles(files, reader_function = reader)
# create iterator over tokens from files iterator
it_tokens <- itoken(it_files, preprocess_function = tolower, 
                    tokenizer = word_tokenizer, progessbar = FALSE)

vocab <- create_vocabulary(it_tokens)

Now are able to construct DTM in lda_c format (as required by lda package):

# need to reinitialise iterators!
# they are mutable and already empty!
# try(it_files$nextElem())
it_files  <- ifiles(files, reader_function = reader)
it_tokens <- itoken(it_files, preprocess_function = tolower, 
                    tokenizer = word_tokenizer, progessbar = FALSE)

dtm <- create_dtm(it_tokens, vectorizer = vocab_vectorizer(vocab), type = 'lda_c')
str(dtm, list.len = 5)
## List of 5000
##  $ 5814_8  : int [1:2, 1:228] 10387 1 10389 1 11265 1 12207 2 12444 1 ...
##  $ 2381_9  : int [1:2, 1:109] 10652 1 11274 1 13406 2 14679 2 15394 1 ...
##  $ 7759_3  : int [1:2, 1:253] 10389 1 10414 1 10843 1 11336 1 12193 1 ...
##  $ 3630_4  : int [1:2, 1:218] 10615 1 10652 1 10866 1 10958 1 11247 1 ...
##  $ 9495_8  : int [1:2, 1:256] 10355 1 10667 1 11247 2 13372 1 13659 1 ...
##   [list output truncated]

Note that DTM has document ids. They are inhereted from document names we assigned in reader function. This is a convenient way to assign document ids when working with files.

Now we can fit LDA model using lda::lda.collapsed.gibbs.sampler() function:

library(lda)
# prior for topics
alpha = 0.1
# prior for words
eta = 0.001
# fit model with 30 topics, make 30 Gibbs sampling iterations
lda_fit <- lda.collapsed.gibbs.sampler(documents = dtm, K = 30, 
                                       vocab = vocab$vocab$terms, 
                                       alpha = alpha, 
                                       eta = eta,
                                       num.iterations = 30, 
                                       trace = 2L)

Parallel mode - using multiple cores

create_dtm, create_tcm, create_vocabulary are albe to take advantage of multicore machines and do it in transparent manner. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel, these functions use standart R high-level parallelism on top of foreach package. They are flexible and can use diffrent parallel backends - doParallel, doRedis, etc. But user should remember that such high-level parallelism can involve significant overhead.

Only two things user should perform manually to take advantage of multicore machine:

  1. register parallel backend
  2. prepare splits of input data in a form of list of itoken iterators.

Here is simple example:

N_WORKERS <- 4
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)

#  prepare splits
# "jobs" is a list of itoken iterators!
N_SPLITS <- 4

jobs <- files %>% 
  split_into(N_SPLITS) %>% 
  lapply(ifiles, reader_function = reader) %>% 
  # Worth to set chunks_number to 1 because we already splitted input
  lapply(itoken, chunks_number = 1, preprocess_function = tolower, 
         tokenizer = word_tokenizer, progessbar = FALSE)

# Alternatively when data is in memory we can perform splite in the following way:
#
# review_chunks <- split_into(movie_review$review, N_SPLITS)
# review_ids <- split_into(movie_review$id, N_SPLITS)
#
# jobs <- Map(function(doc, ids) {
#  itoken(iterable = doc, ids = ids, preprocess_function = tolower, 
#         tokenizer = word_tokenizer, chunks_number = 1, progessbar = FALSE) 
# }, review_chunks, review_ids)

# Now all below function calls will benefit from multicore machines
# Each job will be evaluated in separate process

# vocabulary creation
vocab <- create_vocabulary(jobs)

# dtm vocabulary vectorization
v_vectorizer <- vocab_vectorizer(vocab)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)

# dtm hash vectorization
h_vectorizer <- hash_vectorizer()
hash_dtm_parallel <- create_dtm(jobs, vectorizer = h_vectorizer)

# coocurence statistics
tcm_vectorizer <- vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5)
tcm_parallel <- create_tcm(jobs, vectorizer = tcm_vectorizer)