It’s good to get data into and out of R! More compatibility, more data, more better. But which is fastest?
Of course, speed might not be your only criterion. Different data formats have different capabilities and purposes:
serialize()
is your best best for getting R objects out of and back into R the way you had them, but doesn’t do much for you wneh communicating with other systems.dump()
is your next best bet, while being a somewhat human-readable text format.dims
and class
don’t have equivalents in Javascript. Different packages also take different approaches in representing JSON objects in R, and vice versa.All that said…. Which is fastest? Which is most efficient? Under which set of arbitrarily chosen tests? I’ll give more details below, but first I’ll show the results. In both graphs, lower is better.
load_cache("timings","timing.plot")
On the horizontal axis is the normalized size of the object being transmitted. On the vertical is the time taken to transmit the test dataset. There are four scenarios corresponding to how the dataset is transmitted or stored. All tests are using the default options for each respective library.
This following graph shows the amount of storage used to encode the test dataset.
load_cache("datasize","datasize.plot")
This document is included as a static vignette, because running benchmark code would not be kind to package builders.
I’ll try picking a reference dataset and timing how long it takes to encode and decode it. I’ll start with the dataset nycflights13
, consisting of five data frames, since it’s large enough to take a measurable time to process.
The file inst/benchmarking.R
contains code for test generation and timing.
source(system.file("benchmarking.R", package = "msgpack"))
I will test each encoder under the following scenarios:
convert
is for just converting an R object to an R raw or character object and back, in memory.conn
also does an in-memory conversion but with R textConnection
or rawConnection
objects. (The rawConnection
objects seem to work faster, even with text formats.)file
writes to a temp file and then reads it back.fifo
forks the R process and writes from one while reading from the other over a UNIX named pipe.tcp
forks the R process and writes from one to the other using a TCP socket.remote
connects to a R process on a remote machine, and sends data from the local to the remote over a TCP socket.Each test will record the OS-reported CPU time, as well as elapsed clock time for reading, writing and total.
For fifo
, tcp
, and remote
tests, the reading and writing may happen concurrently, so that total.elapsed < read.elapsed + write.elapsed
. On the other hand, some packages cannot cope with reading from asynchronous connections (more on this in the details). In these cases I need to make sure all data is read into a buffer before calling the decoder.
These packages turn out to vary in performance by a couple orders of magnitude. So the test harness has to dynamically vary tne input size so as to not take forever to run. It starts with a small subsample of the dataset and increases its size until the encode and decode takes 10 seconds (or the entire dataset is transmitted). (See function testCurve
in inst/benchmarking.R
.).
On to the per-package benchmarks.
serialize
and unserialize
produce faithful replications of R objects including R-specific structures like closures, environments, and attributes. But generally only R code can read it. It is useful for communicating between R processes.
For instance:
unserialize.inmem <- timeConvert(dataset,
unserialize,
function(data) serialize(data, NULL))
showTimings(unserialize.inmem, "R serialization (in memory)")
R serialization (in memory) | |
---|---|
total.elapsed (s) | 0.35 |
read.cpu (s) | 0.35 |
read.elapsed (s) | 0.35 |
write.cpu (s) | 0.23 |
write.elapsed (s) | 0.23 |
data size (MB) | 46.38 |
This is reasonably speedy, too.
However, I run into a problem if I try to transfer too large an object over a fifo or socket and read it with unserialize().
unserialize.bad <- timeFifoTransfer(dataset, unserialize, serialize, catch=FALSE)
## Error in writer(data, conn): ignoring SIGPIPE signal
This appears to be because unserialize
doesn’t operate concurrently, in the sense that it doesn’t recover from finding the end of the line having only read part of a message. Meanwhile, on my machine fifo()
and socketConnection()
do not seem to block even if blocking = TRUE
is set. They always wait for at least one byte, but may return fewer than requested. So unserialize
does not work easily with socket connections.
One workaround is to exhaustively read the connection before handing off the data to unserialize
. I handle that off screen in bufferBytes
in inst/benchmarking.R
.
unserialize.socket <- timeSocketTransfer(dataset,
bufferBytes(unserialize),
serialize)
showTimings(unserialize.socket,
c("R serialization (over TCP, same host, blocking read)"))
R serialization (over TCP, same host, blocking read) | |
---|---|
total.elapsed (s) | 1.22 |
read.cpu (s) | 1.15 |
read.elapsed (s) | 1.22 |
write.cpu (s) | 0.27 |
write.elapsed (s) | 0.84 |
data size (MB) | NA |
But this strategy only works for transmitting one object per connection, and for one connection at a time.
Another way you can do it is to wrap R serialization with msgpack::msgConnection
. This only adds a few bytes to each message, and msgpack will handle assembling complete messages. This also allows you to send several values per connection, or poll several connections connections until one of them returns a decoded message.
unserialize.wrapped_socket <- timeSocketTransfer(dataset,
unserialize,
serialize,
wrap = msgpack::msgConnection)
showTimings(unserialize.wrapped_socket,
"R serialization over msgpack over TCP (same host)")
R serialization over msgpack over TCP (same host) | |
---|---|
total.elapsed (s) | 0.39 |
read.cpu (s) | 0.39 |
read.elapsed (s) | 0.39 |
write.cpu (s) | 0.25 |
write.elapsed (s) | 0.39 |
data size (MB) | NA |
Surprisingly, this actually works faster. R unserialize.wrapped_socket$total.elapsed < unserialize.socket$total.elapsed || stop()), include=FALSE
. One hypothesis that might account for this is that there are fewer write syscalls if the object is serialized into memory before sending, and fewer read syscalls if the object is prepended with its length before reading.
Now I’ll start collecting benchmarks systematically. Since the methods we will explore have such a wide variance of performance, we will test them with successively larger datasets, until they exceed a timeout. The serialize.spec
data structure annotated below specifies how to test R serialization and how to label the results. See the code file inst/benchmarking.R
for more details.
# A spec is a named list of test factors.
# A test factor is a named list of arguments that will be passed to the
# test function at each factor level.
# (Actually go look at the simpler test spec for msgpack before trying to
# grok this)
serialize.spec <- combine_opts( # options of the same name concatenated
list(
method = list( # the factor "method"
convert = list(method = timeConvert)), # has a level "convert"
# that passes timeConvert
# to the "method"
# argument of the test
# function
encoder = list( # the factor "encoder"
serialize = list( # has a level named "serialize"
writer = serialize, # with these arguments specifying, reader, writer, to, from
from = unserialize,
to = function(data) serialize(data, NULL)))),
# Combined with these common options (that include connection-based
# test methods
buffer_read_options(reader = unserialize, raw = TRUE)
)
# arg_df takes the above spec and takes its outer product, generating
# a data frame with case labels and arguments used
serialize.calls <- arg_df(serialize.spec)
serialize.timings <- run_tests(serialize.calls)
test.plot <- (ggplot(
filter(benchmarks, encoder=="serialize"))
+ aes(x = size, y = total.elapsed, color = method)
+ geom_line()
+ scale_x_continuous(limits = c(0, 1), name = "Fraction of nycflights13 transmitted")
+ scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
)
test.plot + labs(title="R serialization performance")
Here the horizontal axis is the size of the test dataset, and the vertical axis is the time taken to transmit and receive.
R’s implementation of fifo connections seem to be consistently slower than other methods on the test system (OS X). The next slowest, in the case of R serialization, is the test case where data is transmitted over a gigabit link (if it matters, the reading computer is an Ubuntu box slower than the writing computer.)
The dput
and deparse
functions render R data objects to an ASCII connection in R-like syntax. The idea is that the text output “looks like” the code it takes to construct the object, to the extent that the mechanism for reading objects back in is to eval
or source
the text. (Hopefully one does not do this with untrusted data. A better technique may be to evaluate the data in a limited environment that just contains the needed constructors like structure
, list
and c
etc.)
## Annoyingly, the behavior of "dump" and "dput" depend on this global option.
options(deparse.max.lines = NULL)
#the output mimics the input
dput(control=c(),
list(1, "2", verb=quote(buckle), my=c("s", "h", "o", "e")))
## list(1, "2", verb = buckle, my = c("s", "h", "o", "e"))
(Note how dput
fails on transmitting language objects; if we try to eval the above we will try to evaluate “buckle” instead of getting just the name object. as.name("buckle")
.)
Performance-wise, dput
and source
should not be used for large datasets, because they display an O(n^2) characteristic in terms of the data size.
options(deparse.max.lines = NULL)
dput.timings <- run_tests(arg_df(c(
all.common.options,
list(
encoder = list(
dput = list(
from = function(t) eval(parse(text=t)),
to = deparse,
reader = function(c) eval(parse(c)),
writer = function(x, c) dput(x, file=c)))))))
(test.plot + labs(title="dput performance")) %+% filter(benchmarks, encoder == "dput")
It’s interesting that deparse
(which is used for the “in-memory conversion” test method labeled conn
) is much slower than the connection-based dput
.
The jsonlite
package includes a fromJSON and toJSON implementation. It also supports streaming reads and write, but only of records consisting of one data frame per message. Data frames are sent row-wise.
Since the test dataset consists of several data frames, I will send one after the other over the lifetime of one connection.
jsonlite_reader <- function(conn) {
append <- msgpack:::catenator()
jsonlite::stream_in(conn, verbose = FALSE, handler = function(x) append(list(x)))
append(action="read")
}
jsonlite_writer <- function(l, conn) {
lapply(l, function(x) jsonlite::stream_out(x, verbose=FALSE, conn))
}
jsonlite.spec <- c(
all.common.options,
list(
encoder = list(
jsonlite = list(
to = jsonlite::toJSON,
from = jsonlite::fromJSON,
reader = jsonlite_reader,
writer = jsonlite_writer,
raw = FALSE))))
jsonlite.timings <- run_tests(arg_df(jsonlite.spec))
jsonlite performs reasonably well, but is several times slower than serialization.
benchmarks <- store(jsonlite.timings, benchmarks)
(test.plot + labs(title="jsonlite performance")) %+% filter(benchmarks, encoder == "jsonlite")
msgpack
is the package this vignette is written for.
msgpack_remote.spec <- c(
common.options,
list(method = list(remote = list(method=timeRemoteTransfer)),
encoder = list(
msgpack = list(
reader = msgpack::readMsg,
writer = msgpack::writeMsg,
wrap = msgpack::msgConnection))))
msgpackR_remote.timings <- run_tests(arg_df(msgpack_remote.spec))
(test.plot + labs(title="msgpack")) %+% msgpackR_remote.timings
msgpack.spec <- c(
all.common.options,
list(
encoder = list(
msgpack = list(
wrap = msgpack::msgConnection,
reader = msgpack::readMsg,
writer = msgpack::writeMsg,
to = msgpack::packMsg,
from = msgpack::unpackMsg))))
msgpack.timings <- run_tests(arg_df(msgpack.spec))
benchmarks <- store(msgpack.timings, benchmarks)
(test.plot + labs(title="msgpack performance")) %+% filter(benchmarks, encoder == "msgpack")
Implementing the streaming-mode callbacks has helped a lot, but there is still a quadratic characteristic going on here. Need to do some profiling of memory allocation.
Interestingly, msgpack is already faster than serialize for the remote use case (modulo some network glitches affecting one or two datapoints.)
There is an older pure-R implementation of msgpack on CRAN. One quirk is that it doesn’t accept NA
in R vectors.
msgpackR::pack(c(1, 2, 3))
## [1] 93 01 02 03
msgpackR::pack(c(1, 2, 3, NA))
## Error in if (floor(data) == data) {: missing value where TRUE/FALSE needed
As a workaround I’ll substitute out all the NA values in the dataset.
dataset_mungenull <- map(dataset, map_dfc,
function(col) ifelse(is.na(col), 9999, col))
msgpackR.spec <- c(
all.common.options %but% list(
dataset = list(nycflights13 = list(data = dataset_mungenull)),
note = list("no NAs" = list())),
list(
encoder = list(
msgpackR = list(
from = msgpackR::unpack,
to = msgpackR::pack,
reader = bufferBytes(msgpackR::unpack),
writer = function(data, conn) writeBin(msgpackR::pack(data), conn)))))
msgpackR.timings <- run_tests(arg_df(msgpackR.spec))
(test.plot + labs(title="msgpackR performance")) %+% filter(benchmarks, encoder == "msgpackR")
Performance-wise, it is quite slow, but a deeper concern is that if I send larger objects and inspect the results it sometimes looks garbled, and there are intermittent errors like:
str(subsample(dataset_mungenull, .000044) %>% (msgpackR::pack) %>% (msgpackR::unpack))
## Error in result[[i]] <- .unpack_data(): attempt to select less than one element in integerOneIndex
rjson.spec <- combine_opts(
list(
method = list(
convert = list(method = timeConvert)),
encoder = list(
rjson = list(
from = rjson::fromJSON,
to = rjson::toJSON,
writer = function(data, con) writeChar(rjson::toJSON(data), con)))),
buffer_read_options(reader = function(con) rjson::fromJSON(file=con),
raw = TRUE,
buffer = bufferRawConn))
rjson.timings <- run_tests(arg_df(rjson.spec))
rjson
is the fastest JSON implementation – only 10 times slower than msgpack
, in memory, and 2.7 times slower across the wire. It does not support streaming reads, so we must byte-buffer to read from connections. But that turns out to be quite fast as well.
benchmarks <- store(rjson.timings, benchmarks)
(test.plot + labs(title="rjson performance")) %+% filter(benchmarks, encoder == "rjson")
I am getting the following intermittent error in RJSONIO (i.e. this document fails to render once in a while.)
Error in RJSONIO::readJSONStream(con) : failed to parse json at 10240
RJSONIO.spec <- c(
all.common.options,
list(
encoder = list(
RJSONIO = list(
from = RJSONIO::fromJSON,
to = RJSONIO::toJSON,
writer = function(x, con) writeBin(RJSONIO::toJSON(x), con),
reader = RJSONIO::readJSONStream,
raw = TRUE))))
RJSONIO.timings <- run_tests(arg_df(RJSONIO.spec))
RJSONIO offers a function to do streaming reads from connection, but it has much overhead compared with in-memory conversion.
(test.plot + labs(title="RJSONIO performance")) %+% filter(benchmarks, encoder == "RJSONIO")
YAML is kind of “JSON, but more like Markdown.” Allegedly easier to read but with a more complex grammar. It’s popular for config files.
cat(yaml::as.yaml(list(compact=TRUE, schema = 0)))
## compact: yes
## schema: 0.0e+00
Unfortunately, the YAML package produces a protection stack overflow when decoding too large a message.
oops <- yaml::yaml.load(yaml::as.yaml(subsample(dataset, 0.15)))
## Error: protect(): protection stack overflow
yaml.timings <- run_tests(arg_df(c(
all.common.options,
list(
note = list("Max 0.14" = list(max = 0.14)),
encoder = list(
yaml = list(
reader = yaml::yaml.load_file,
writer = function(data, conn) writeChar(yaml::as.yaml(data), conn),
to = yaml::as.yaml,
from = yaml::yaml.load,
raw = TRUE))))))
benchmarks <- store(yaml.timings, benchmarks)
(test.plot + labs(title="yaml performance")) %+% filter(benchmarks, encoder == "yaml")
We’re going here aren’t we. Let’s see if we can send messages with CSV. To send several CSV tables over one connection, I’ll prepend to each message a header saying how many rows to read following.
writeCsvs <- function(data, con) {
for (nm in names(data)) {
write.csv(as_tibble(list(name = nm, nrows = nrow(data[[nm]]))), con)
write.csv(data[[nm]], con, row.names = FALSE)
}
}
readCsvs <- function(data, con) {
output <- list()
tryCatch(
repeat {
header <- read.csv(con2, nrows=1, stringsAsFactors = FALSE)
output[[header$name]] <- read.csv(con2, nrows=header$nrows, stringsAsFactors = FALSE)
},
error = force)
output
}
I have just had a sinking feeling that a lot of people actually build web services that talk in CSV.
Unfortunately read.table
can’t cope with non-blocking connections, so I have to buffer the read into memory when reading from a fifo or socket.
csv.timings <- run_tests(
arg_df(c(
list(encoder = list(csv = list(writer = writeCsvs))),
buffer_read_options(reader = readCsvs, raw = TRUE))))
benchmarks <- store(csv.timings, benchmarks)
(test.plot + labs(title="R csv performance")) %+% filter(benchmarks, encoder == "csv")
Surprisingly fast at writing to files.
The most importent scenarios are conversion in memory, writing to file, writing to another process on the same host, and writing to remote host.
library(directlabels)
timing.plot <- ( benchmarks
%>% subset(method %in% c("convert", "socket", "remote", "file"))
%>% mutate(method = c(convert="Conversion in memory", socket="TCP (same host)",
remote="TCP (over LAN)", file = "File I/O")[method])
%>% mutate(encoder = factor(encoder, levels=encoder_order))
%>% { (ggplot(.)
+ aes(y = total.elapsed, x = size, color = encoder, group = encoder)
+ facet_wrap(~method)
+ geom_point()
+ geom_line()
+ theme(aspect.ratio=1)
+ scale_x_continuous(name = "Fraction of nycflights13 transmitted")
+ scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
+ labs(title="Elapsed time to encode, write, read, decode, by package")
+ expand_limits(x=1.5, y=30)
+ geom_dl(aes(label=encoder),
debug=FALSE,
method = list(
"last.points",
rot=45,
"bumpup",
dl.trans(x=x+0.2, y=y+0.2)))
+ guides(color=FALSE)
)})
timing.plot
# We didn't have time to test each encoder with the full
# dataset, so we'll extrapolate from the largest data set tested
needed <- (benchmarks
%>% filter(!is.na(bytes))
%>% select(., encoder, method, size, bytes)
%>% group_by(encoder)
%>% filter(row_number() == 1)
%>% arrange(desc(size), .by_group=TRUE)
%>% slice(1)
%>% select(encoder, method)
)
model <- ( needed
%>% inner_join(benchmarks)
%>% select(bytes, encoder, size)
%>% lm(formula = bytes ~ encoder * size))
predicted <- (needed
%>% mutate(size = 1)
%>% augment(model, newdata=.)
%>% rename(bytes = .fitted)
)
datasize.plot <- (predicted
%>% mutate(encoder = factor(encoder, levels=encoder_order))
%>% ggplot
%>% +aes(x = reorder(encoder, bytes), y=bytes/2^20, fill=encoder)
%>% +geom_col()
%>% +labs(y = "MiB", x = NULL,
title = "Space used to encode nycflights13 (shorter is better)")
%>% +geom_dl(aes(label=encoder, group=encoder),
method=c(dl.trans(y=y+0.1), "top.bumpup"))
%>% +guides(fill = FALSE)
%>% +theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
)
datasize.plot
The difference in size between different JSON and msgpack implementations could use some investigation. It may down to whitespace, sending data row-wise vs. col-wise, differences in the mapping between R and data format types, or bugs.
save(benchmarks, file="../inst/benchmarks.RData")