Which Data Format is Fastest Data Format?

It’s good to get data into and out of R! More compatibility, more data, more better. But which is fastest?

Of course, speed might not be your only criterion. Different data formats have different capabilities and purposes:

All that said…. Which is fastest? Which is most efficient? Under which set of arbitrarily chosen tests? I’ll give more details below, but first I’ll show the results. In both graphs, lower is better.

load_cache("timings","timing.plot")

On the horizontal axis is the normalized size of the object being transmitted. On the vertical is the time taken to transmit the test dataset. There are four scenarios corresponding to how the dataset is transmitted or stored. All tests are using the default options for each respective library.

This following graph shows the amount of storage used to encode the test dataset.

load_cache("datasize","datasize.plot")

Benchmarking Data Export / Import

This document is included as a static vignette, because running benchmark code would not be kind to package builders.

I’ll try picking a reference dataset and timing how long it takes to encode and decode it. I’ll start with the dataset nycflights13, consisting of five data frames, since it’s large enough to take a measurable time to process.

The file inst/benchmarking.R contains code for test generation and timing.

source(system.file("benchmarking.R", package = "msgpack"))

I will test each encoder under the following scenarios:

Each test will record the OS-reported CPU time, as well as elapsed clock time for reading, writing and total.

For fifo, tcp, and remote tests, the reading and writing may happen concurrently, so that total.elapsed < read.elapsed + write.elapsed. On the other hand, some packages cannot cope with reading from asynchronous connections (more on this in the details). In these cases I need to make sure all data is read into a buffer before calling the decoder.

These packages turn out to vary in performance by a couple orders of magnitude. So the test harness has to dynamically vary tne input size so as to not take forever to run. It starts with a small subsample of the dataset and increases its size until the encode and decode takes 10 seconds (or the entire dataset is transmitted). (See function testCurve in inst/benchmarking.R.).

On to the per-package benchmarks.

R serialization

serialize and unserialize produce faithful replications of R objects including R-specific structures like closures, environments, and attributes. But generally only R code can read it. It is useful for communicating between R processes.

For instance:

unserialize.inmem <- timeConvert(dataset,
unserialize,
function(data) serialize(data, NULL))
showTimings(unserialize.inmem, "R serialization (in memory)")
R serialization (in memory)
total.elapsed (s) 0.35
read.cpu (s) 0.35
read.elapsed (s) 0.35
write.cpu (s) 0.23
write.elapsed (s) 0.23
data size (MB) 46.38

This is reasonably speedy, too.

However, I run into a problem if I try to transfer too large an object over a fifo or socket and read it with unserialize().

unserialize.bad <- timeFifoTransfer(dataset, unserialize, serialize, catch=FALSE)
## Error in writer(data, conn): ignoring SIGPIPE signal

This appears to be because unserialize doesn’t operate concurrently, in the sense that it doesn’t recover from finding the end of the line having only read part of a message. Meanwhile, on my machine fifo() and socketConnection() do not seem to block even if blocking = TRUE is set. They always wait for at least one byte, but may return fewer than requested. So unserialize does not work easily with socket connections.

One workaround is to exhaustively read the connection before handing off the data to unserialize. I handle that off screen in bufferBytes in inst/benchmarking.R.

unserialize.socket <- timeSocketTransfer(dataset,
bufferBytes(unserialize),
serialize)
showTimings(unserialize.socket,
c("R serialization (over TCP, same host, blocking read)"))
R serialization (over TCP, same host, blocking read)
total.elapsed (s) 1.22
read.cpu (s) 1.15
read.elapsed (s) 1.22
write.cpu (s) 0.27
write.elapsed (s) 0.84
data size (MB) NA

But this strategy only works for transmitting one object per connection, and for one connection at a time.

Another way you can do it is to wrap R serialization with msgpack::msgConnection. This only adds a few bytes to each message, and msgpack will handle assembling complete messages. This also allows you to send several values per connection, or poll several connections connections until one of them returns a decoded message.

unserialize.wrapped_socket <- timeSocketTransfer(dataset,
unserialize,
serialize,
wrap = msgpack::msgConnection)
showTimings(unserialize.wrapped_socket,
"R serialization over msgpack over TCP (same host)")
R serialization over msgpack over TCP (same host)
total.elapsed (s) 0.39
read.cpu (s) 0.39
read.elapsed (s) 0.39
write.cpu (s) 0.25
write.elapsed (s) 0.39
data size (MB) NA

Surprisingly, this actually works faster. R unserialize.wrapped_socket$total.elapsed < unserialize.socket$total.elapsed || stop()), include=FALSE. One hypothesis that might account for this is that there are fewer write syscalls if the object is serialized into memory before sending, and fewer read syscalls if the object is prepended with its length before reading.

Now I’ll start collecting benchmarks systematically. Since the methods we will explore have such a wide variance of performance, we will test them with successively larger datasets, until they exceed a timeout. The serialize.spec data structure annotated below specifies how to test R serialization and how to label the results. See the code file inst/benchmarking.R for more details.

# A spec is a named list of test factors.
# A test factor is a named list of arguments that will be passed to the
# test function at each factor level.
# (Actually go look at the simpler test spec for msgpack before trying to
# grok this)
serialize.spec <- combine_opts( # options of the same name concatenated
list(
method = list( # the factor "method"
convert = list(method = timeConvert)), # has a level "convert"
# that passes timeConvert
# to the "method"
# argument of the test
# function
encoder = list( # the factor "encoder"
serialize = list( # has a level named "serialize"
writer = serialize, # with these arguments specifying, reader, writer, to, from
from = unserialize,
to = function(data) serialize(data, NULL)))),
# Combined with these common options (that include connection-based
# test methods
buffer_read_options(reader = unserialize, raw = TRUE)
)
# arg_df takes the above spec and takes its outer product, generating
# a data frame with case labels and arguments used
serialize.calls <- arg_df(serialize.spec)
serialize.timings <- run_tests(serialize.calls)
test.plot <- (ggplot(
filter(benchmarks, encoder=="serialize"))
+ aes(x = size, y = total.elapsed, color = method)
+ geom_line()
+ scale_x_continuous(limits = c(0, 1), name = "Fraction of nycflights13 transmitted")
+ scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
)
test.plot + labs(title="R serialization performance")

Here the horizontal axis is the size of the test dataset, and the vertical axis is the time taken to transmit and receive.

R’s implementation of fifo connections seem to be consistently slower than other methods on the test system (OS X). The next slowest, in the case of R serialization, is the test case where data is transmitted over a gigabit link (if it matters, the reading computer is an Ubuntu box slower than the writing computer.)

Dput/source

The dput and deparse functions render R data objects to an ASCII connection in R-like syntax. The idea is that the text output “looks like” the code it takes to construct the object, to the extent that the mechanism for reading objects back in is to eval or source the text. (Hopefully one does not do this with untrusted data. A better technique may be to evaluate the data in a limited environment that just contains the needed constructors like structure, list and c etc.)

## Annoyingly, the behavior of "dump" and "dput" depend on this global option.
options(deparse.max.lines = NULL)
#the output mimics the input
dput(control=c(),
list(1, "2", verb=quote(buckle), my=c("s", "h", "o", "e")))
## list(1, "2", verb = buckle, my = c("s", "h", "o", "e"))

(Note how dput fails on transmitting language objects; if we try to eval the above we will try to evaluate “buckle” instead of getting just the name object. as.name("buckle").)

Performance-wise, dput and source should not be used for large datasets, because they display an O(n^2) characteristic in terms of the data size.

options(deparse.max.lines = NULL)
dput.timings <- run_tests(arg_df(c(
all.common.options,
list(
encoder = list(
dput = list(
from = function(t) eval(parse(text=t)),
to = deparse,
reader = function(c) eval(parse(c)),
writer = function(x, c) dput(x, file=c)))))))
(test.plot + labs(title="dput performance")) %+% filter(benchmarks, encoder == "dput")

It’s interesting that deparse (which is used for the “in-memory conversion” test method labeled conn) is much slower than the connection-based dput.

jsonlite

The jsonlite package includes a fromJSON and toJSON implementation. It also supports streaming reads and write, but only of records consisting of one data frame per message. Data frames are sent row-wise.

Since the test dataset consists of several data frames, I will send one after the other over the lifetime of one connection.

jsonlite_reader <- function(conn) {
append <- msgpack:::catenator()
jsonlite::stream_in(conn, verbose = FALSE, handler = function(x) append(list(x)))
append(action="read")
}
jsonlite_writer <- function(l, conn) {
lapply(l, function(x) jsonlite::stream_out(x, verbose=FALSE, conn))
}
jsonlite.spec <- c(
all.common.options,
list(
encoder = list(
jsonlite = list(
to = jsonlite::toJSON,
from = jsonlite::fromJSON,
reader = jsonlite_reader,
writer = jsonlite_writer,
raw = FALSE))))
jsonlite.timings <- run_tests(arg_df(jsonlite.spec))

jsonlite performs reasonably well, but is several times slower than serialization.

benchmarks <- store(jsonlite.timings, benchmarks)
(test.plot + labs(title="jsonlite performance")) %+% filter(benchmarks, encoder == "jsonlite")

Msgpack

msgpack is the package this vignette is written for.

msgpack_remote.spec <- c(
common.options,
list(method = list(remote = list(method=timeRemoteTransfer)),
encoder = list(
msgpack = list(
reader = msgpack::readMsg,
writer = msgpack::writeMsg,
wrap = msgpack::msgConnection))))
msgpackR_remote.timings <- run_tests(arg_df(msgpack_remote.spec))
(test.plot + labs(title="msgpack")) %+% msgpackR_remote.timings

msgpack.spec <- c(
all.common.options,
list(
encoder = list(
msgpack = list(
wrap = msgpack::msgConnection,
reader = msgpack::readMsg,
writer = msgpack::writeMsg,
to = msgpack::packMsg,
from = msgpack::unpackMsg))))
msgpack.timings <- run_tests(arg_df(msgpack.spec))
benchmarks <- store(msgpack.timings, benchmarks)
(test.plot + labs(title="msgpack performance")) %+% filter(benchmarks, encoder == "msgpack")

Implementing the streaming-mode callbacks has helped a lot, but there is still a quadratic characteristic going on here. Need to do some profiling of memory allocation.

Interestingly, msgpack is already faster than serialize for the remote use case (modulo some network glitches affecting one or two datapoints.)

msgpackR

There is an older pure-R implementation of msgpack on CRAN. One quirk is that it doesn’t accept NA in R vectors.

msgpackR::pack(c(1, 2, 3))
## [1] 93 01 02 03
msgpackR::pack(c(1, 2, 3, NA))
## Error in if (floor(data) == data) {: missing value where TRUE/FALSE needed

As a workaround I’ll substitute out all the NA values in the dataset.

dataset_mungenull <- map(dataset, map_dfc,
function(col) ifelse(is.na(col), 9999, col))
msgpackR.spec <- c(
all.common.options %but% list(
dataset = list(nycflights13 = list(data = dataset_mungenull)),
note = list("no NAs" = list())),
list(
encoder = list(
msgpackR = list(
from = msgpackR::unpack,
to = msgpackR::pack,
reader = bufferBytes(msgpackR::unpack),
writer = function(data, conn) writeBin(msgpackR::pack(data), conn)))))
msgpackR.timings <- run_tests(arg_df(msgpackR.spec))
(test.plot + labs(title="msgpackR performance")) %+% filter(benchmarks, encoder == "msgpackR")

Performance-wise, it is quite slow, but a deeper concern is that if I send larger objects and inspect the results it sometimes looks garbled, and there are intermittent errors like:

str(subsample(dataset_mungenull, .000044) %>% (msgpackR::pack) %>% (msgpackR::unpack))
## Error in result[[i]] <- .unpack_data(): attempt to select less than one element in integerOneIndex

rjson

rjson.spec <- combine_opts(
list(
method = list(
convert = list(method = timeConvert)),
encoder = list(
rjson = list(
from = rjson::fromJSON,
to = rjson::toJSON,
writer = function(data, con) writeChar(rjson::toJSON(data), con)))),
buffer_read_options(reader = function(con) rjson::fromJSON(file=con),
raw = TRUE,
buffer = bufferRawConn))
rjson.timings <- run_tests(arg_df(rjson.spec))

rjson is the fastest JSON implementation – only 10 times slower than msgpack, in memory, and 2.7 times slower across the wire. It does not support streaming reads, so we must byte-buffer to read from connections. But that turns out to be quite fast as well.

benchmarks <- store(rjson.timings, benchmarks)
(test.plot + labs(title="rjson performance")) %+% filter(benchmarks, encoder == "rjson")

RJSONIO

I am getting the following intermittent error in RJSONIO (i.e. this document fails to render once in a while.)

Error in RJSONIO::readJSONStream(con) : failed to parse json at 10240
RJSONIO.spec <- c(
all.common.options,
list(
encoder = list(
RJSONIO = list(
from = RJSONIO::fromJSON,
to = RJSONIO::toJSON,
writer = function(x, con) writeBin(RJSONIO::toJSON(x), con),
reader = RJSONIO::readJSONStream,
raw = TRUE))))
RJSONIO.timings <- run_tests(arg_df(RJSONIO.spec))

RJSONIO offers a function to do streaming reads from connection, but it has much overhead compared with in-memory conversion.

(test.plot + labs(title="RJSONIO performance")) %+% filter(benchmarks, encoder == "RJSONIO")

YAML

YAML is kind of “JSON, but more like Markdown.” Allegedly easier to read but with a more complex grammar. It’s popular for config files.

cat(yaml::as.yaml(list(compact=TRUE, schema = 0)))
## compact: yes
## schema: 0.0e+00

Unfortunately, the YAML package produces a protection stack overflow when decoding too large a message.

oops <- yaml::yaml.load(yaml::as.yaml(subsample(dataset, 0.15)))
## Error: protect(): protection stack overflow
yaml.timings <- run_tests(arg_df(c(
all.common.options,
list(
note = list("Max 0.14" = list(max = 0.14)),
encoder = list(
yaml = list(
reader = yaml::yaml.load_file,
writer = function(data, conn) writeChar(yaml::as.yaml(data), conn),
to = yaml::as.yaml,
from = yaml::yaml.load,
raw = TRUE))))))
benchmarks <- store(yaml.timings, benchmarks)
(test.plot + labs(title="yaml performance")) %+% filter(benchmarks, encoder == "yaml")

write.csv

We’re going here aren’t we. Let’s see if we can send messages with CSV. To send several CSV tables over one connection, I’ll prepend to each message a header saying how many rows to read following.

writeCsvs <- function(data, con) {
for (nm in names(data)) {
write.csv(as_tibble(list(name = nm, nrows = nrow(data[[nm]]))), con)
write.csv(data[[nm]], con, row.names = FALSE)
}
}
readCsvs <- function(data, con) {
output <- list()
tryCatch(
repeat {
header <- read.csv(con2, nrows=1, stringsAsFactors = FALSE)
output[[header$name]] <- read.csv(con2, nrows=header$nrows, stringsAsFactors = FALSE)
},
error = force)
output
}

I have just had a sinking feeling that a lot of people actually build web services that talk in CSV.

Unfortunately read.table can’t cope with non-blocking connections, so I have to buffer the read into memory when reading from a fifo or socket.

csv.timings <- run_tests(
arg_df(c(
list(encoder = list(csv = list(writer = writeCsvs))),
buffer_read_options(reader = readCsvs, raw = TRUE))))
benchmarks <- store(csv.timings, benchmarks)
(test.plot + labs(title="R csv performance")) %+% filter(benchmarks, encoder == "csv")

Surprisingly fast at writing to files.

Comparison of Encoder Speed

The most importent scenarios are conversion in memory, writing to file, writing to another process on the same host, and writing to remote host.

library(directlabels)
timing.plot <- ( benchmarks
%>% subset(method %in% c("convert", "socket", "remote", "file"))
%>% mutate(method = c(convert="Conversion in memory", socket="TCP (same host)",
remote="TCP (over LAN)", file = "File I/O")[method])
%>% mutate(encoder = factor(encoder, levels=encoder_order))
%>% { (ggplot(.)
+ aes(y = total.elapsed, x = size, color = encoder, group = encoder)
+ facet_wrap(~method)
+ geom_point()
+ geom_line()
+ theme(aspect.ratio=1)
+ scale_x_continuous(name = "Fraction of nycflights13 transmitted")
+ scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
+ labs(title="Elapsed time to encode, write, read, decode, by package")
+ expand_limits(x=1.5, y=30)
+ geom_dl(aes(label=encoder),
debug=FALSE,
method = list(
"last.points",
rot=45,
"bumpup",
dl.trans(x=x+0.2, y=y+0.2)))
+ guides(color=FALSE)
)})
timing.plot

Comparison of Encoder Data Usage

# We didn't have time to test each encoder with the full
# dataset, so we'll extrapolate from the largest data set tested
needed <- (benchmarks
%>% filter(!is.na(bytes))
%>% select(., encoder, method, size, bytes)
%>% group_by(encoder)
%>% filter(row_number() == 1)
%>% arrange(desc(size), .by_group=TRUE)
%>% slice(1)
%>% select(encoder, method)
)
model <- ( needed
%>% inner_join(benchmarks)
%>% select(bytes, encoder, size)
%>% lm(formula = bytes ~ encoder * size))
predicted <- (needed
%>% mutate(size = 1)
%>% augment(model, newdata=.)
%>% rename(bytes = .fitted)
)
datasize.plot <- (predicted
%>% mutate(encoder = factor(encoder, levels=encoder_order))
%>% ggplot
%>% +aes(x = reorder(encoder, bytes), y=bytes/2^20, fill=encoder)
%>% +geom_col()
%>% +labs(y = "MiB", x = NULL,
title = "Space used to encode nycflights13 (shorter is better)")
%>% +geom_dl(aes(label=encoder, group=encoder),
method=c(dl.trans(y=y+0.1), "top.bumpup"))
%>% +guides(fill = FALSE)
%>% +theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
)
datasize.plot

The difference in size between different JSON and msgpack implementations could use some investigation. It may down to whitespace, sending data row-wise vs. col-wise, differences in the mapping between R and data format types, or bugs.

save(benchmarks, file="../inst/benchmarks.RData")