It this paragraph I will show how to use text2vec
for vectorization of large collections of text stored in files.
Imagine we want to build a topic model with lda package. We have a collection of movie reviews stored in multiple text files on disk.
For this vignette we will create files from embedded movie_review
dataset:
library(text2vec)
library(magrittr)
data("movie_review")
# remove all internal EOL to simplify reading
movie_review$review <- gsub(pattern = '\n', replacement = ' ',
x = movie_review$review, fixed = TRUE)
N_FILES <- 10
CHUNK_LEN <- nrow(movie_review) / N_FILES
files <- sapply(1:N_FILES, function(x) tempfile())
chunks <- split(movie_review, rep(1:N_FILES, each = nrow(movie_review) / N_FILES ))
for (i in 1:N_FILES ) {
write.table(chunks[[i]], files[[i]], quote = T, row.names = F, col.names = T, sep = '|')
}
# Note how data looks like
str(movie_review, strict.width = 'cut')
## 'data.frame': 5000 obs. of 3 variables:
## $ id : chr "5814_8" "2381_9" "7759_3" "3630_4" ...
## $ sentiment: int 1 1 0 0 1 1 0 0 0 1 ...
## $ review : chr "With all this stuff going down at the moment with MJ"..
text2vec
provides functions to easily work with files.
User need to perform only a few things:
ifiles
function.ifiles
. text2vec
doesn’t know anything about underlying files. They can be in plain text or some binary format.itoken
function.Lets see how it works:
library(data.table)
reader <- function(x, ...) {
# read
chunk <- fread(x, header = T, sep = '|')
# select column with review
res <- chunk$review
# assign ids to reviews
names(res) <- chunk$id
res
}
# create iterator over files
it_files <- ifiles(files, reader_function = reader)
# create iterator over tokens from files iterator
it_tokens <- itoken(it_files, preprocess_function = tolower,
tokenizer = word_tokenizer, progessbar = FALSE)
vocab <- create_vocabulary(it_tokens)
Now are able to construct DTM in lda_c
format (as required by lda
package):
# need to reinitialise iterators!
# they are mutable and already empty!
# try(it_files$nextElem())
it_files <- ifiles(files, reader_function = reader)
it_tokens <- itoken(it_files, preprocess_function = tolower,
tokenizer = word_tokenizer, progessbar = FALSE)
dtm <- create_dtm(it_tokens, vectorizer = vocab_vectorizer(vocab), type = 'lda_c')
str(dtm, list.len = 5)
## List of 5000
## $ 5814_8 : int [1:2, 1:228] 10387 1 10389 1 11265 1 12207 2 12444 1 ...
## $ 2381_9 : int [1:2, 1:109] 10652 1 11274 1 13406 2 14679 2 15394 1 ...
## $ 7759_3 : int [1:2, 1:253] 10389 1 10414 1 10843 1 11336 1 12193 1 ...
## $ 3630_4 : int [1:2, 1:218] 10615 1 10652 1 10866 1 10958 1 11247 1 ...
## $ 9495_8 : int [1:2, 1:256] 10355 1 10667 1 11247 2 13372 1 13659 1 ...
## [list output truncated]
Note that DTM has document ids. They are inhereted from document names we assigned in reader
function. This is a convenient way to assign document ids when working with files.
Now we can fit LDA
model using lda::lda.collapsed.gibbs.sampler()
function:
library(lda)
# prior for topics
alpha = 0.1
# prior for words
eta = 0.001
# fit model with 30 topics, make 30 Gibbs sampling iterations
lda_fit <- lda.collapsed.gibbs.sampler(documents = dtm, K = 30,
vocab = vocab$vocab$terms,
alpha = alpha,
eta = eta,
num.iterations = 30,
trace = 2L)
create_dtm
, create_tcm
, create_vocabulary
are albe to take advantage of multicore machines and do it in transparent manner. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel
, these functions use standart R high-level parallelism on top of foreach
package. They are flexible and can use diffrent parallel backends - doParallel
, doRedis
, etc. But user should remember that such high-level parallelism can involve significant overhead.
Only two things user should perform manually to take advantage of multicore machine:
list
of itoken
iterators.Here is simple example:
N_WORKERS <- 4
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)
# prepare splits
# "jobs" is a list of itoken iterators!
N_SPLITS <- 4
jobs <- files %>%
split_into(N_SPLITS) %>%
lapply(ifiles, reader_function = reader) %>%
# Worth to set chunks_number to 1 because we already splitted input
lapply(itoken, chunks_number = 1, preprocess_function = tolower,
tokenizer = word_tokenizer, progessbar = FALSE)
# Alternatively when data is in memory we can perform splite in the following way:
#
# review_chunks <- split_into(movie_review$review, N_SPLITS)
# review_ids <- split_into(movie_review$id, N_SPLITS)
#
# jobs <- Map(function(doc, ids) {
# itoken(iterable = doc, ids = ids, preprocess_function = tolower,
# tokenizer = word_tokenizer, chunks_number = 1, progessbar = FALSE)
# }, review_chunks, review_ids)
# Now all below function calls will benefit from multicore machines
# Each job will be evaluated in separate process
# vocabulary creation
vocab <- create_vocabulary(jobs)
# dtm vocabulary vectorization
v_vectorizer <- vocab_vectorizer(vocab)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)
# dtm hash vectorization
h_vectorizer <- hash_vectorizer()
hash_dtm_parallel <- create_dtm(jobs, vectorizer = h_vectorizer)
# coocurence statistics
tcm_vectorizer <- vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5)
tcm_parallel <- create_tcm(jobs, vectorizer = tcm_vectorizer)