This package contains a flexible framework for extending the pipe into a loop. The basic idea is this: I often run into the problem of wanting to access an unnamed intermediate in a pipe. Why? A basic strategy of working with data frames is to focus on a certain aspect of the data frame, make some changes, and then reincorporate these changes into the original data frame. This work-flow is best understood through illustration.
This tutorial assumes familiarity with Hadley Wickham’s dplyr
and magrittr
packages. If you don’t know what I’m talking about, go look them up. Your life is about to get a whole lot easier
Import useful libraries for chaining, knitr
for table output, and of course, loopr
.
library(loopr)
library(dplyr)
library(magrittr)
library(knitr)
Define our loop object.
loop = loop$new()
Set up an extremely simple data frame for illustration.
id = c(1, 2, 3, 4)
toFix = c(0, 0, 1, 1)
group = c(1, 1, 1, 0)
example = data_frame(id, toFix, group)
kable(example)
id | toFix | group |
---|---|---|
1 | 0 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
loopr
relies a stack framework. Let’s initialize one.
stack = stack$new()
We can push
data onto the stack
like this.
stack$push(1)
## [1] 1
stack$push(2)
## [1] 2
stack$push(3)
## [1] 3
We can peek
at the top of the stack
:
stack$peek
## [1] 3
or at the whole thing.
stack$stack
## $bottom
## NULL
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 3
We can find the height
of the stack
as well:
stack$height
## [1] 4
We can also pop
off items from the stack
:
stack$pop
## [1] 3
stack$pop
## [1] 2
stack$pop
## [1] 1
Now the stack
is empty.
stack$stack
## $bottom
## NULL
Why is this important? A loop
object inherits from stack
.
The begin
method is simply a copy of push
. After the loop begins, you can focus on any part of your data while still having access to the original data.
"first" %>%
loop$begin()
## [1] "first"
To end the loop, you need to merge the data at the beginning of the loop with the data at the end. There are two ending methods defined in loopr
: end
and cross
. Ending the loop takes a function, uses a pop
from the loop
stack
as the first argument to the given function, and its own first argument (or chained argument) as the second.
"second" %>%
loop$end(paste)
## [1] "first second"
cross
is nearly identical, but the order of the arguments gets reversed.
"first" %>%
loop$begin()
## [1] "first"
"second" %>%
loop$cross(paste)
## [1] "second first"
This is much easier to explain in code than in words.
end(endData, FUN, ...) = FUN(stack$pop, endData, ...)
cross(crossData, FUN, ...) = FUN(crossData, stack$pop, ...)
There are two useful ending functions that are included in this package:insert
and amend
. Why are special ending functions needed? In general, traditional join functions are not well suited to the focus-modify-restore work-flow. We need insert
and amend
to prioritize information in modified data over information in the original data.
insert
is the slightly more simple case. Let’s use our example data again.
Create a set of data to insert
.
insertData =
example %>%
filter(toFix == 0) %>%
mutate(toFix = 1) %>%
select(-group)
Now let’s insert
it back into the original data.
insert(example, insertData, by = "id")
## Source: local data frame [4 x 3]
##
## id toFix group
## 1 1 1 NA
## 2 2 1 NA
## 3 3 1 1
## 4 4 1 0
What happened? Where the by
variables matched, insert
excised all rows from example
and inserted insertData
. At the end, data was sorted by the by
variable.
Let’s take a look at the slightly more complicated ending function: amend
. To understand amend, we first need to understand the underlying column update function.
amendColumns
updates an old set of columns with all non-NA
values from a matching new set of columns.
oldColumn = c(0, 0)
newColumn = c(1, NA)
data_frame(oldColumn, newColumn) %>%
amendColumns("oldColumn", "newColumn")
## Source: local data frame [2 x 1]
##
## oldColumn
## 1 1
## 2 0
There is also a matching function called fillColumns. In this function, NA
’s from newColumn
are replaced with numbers from the oldColumn
, but nothing else.
oldColumn = c(0, 0)
newColumn = c(1, NA)
data_frame(oldColumn, newColumn) %>%
fillColumns("newColumn", "oldColumn")
## Source: local data frame [2 x 1]
##
## newColumn
## 1 1
## 2 0
amend
is simply dplyr::full_join
followed by amendColumns
to over-write non-key columns from the original dataset with matching-named columns from the new dataset. In this case, group
from amendData
overwrites group
from example
.
amendData = insertData
amend(example, amendData, by = "id")
## Amending columns: toFix
## Source: local data frame [4 x 3]
##
## id toFix group
## 1 1 1 1
## 2 2 1 1
## 3 3 1 1
## 4 4 1 0
If it is not included, by
defaults to the grouping variables in data.
amendData = insertData
example %<>% group_by(id)
amend(example, amendData)
## Amending columns: toFix
## Source: local data frame [4 x 3]
##
## id toFix group
## 1 1 1 1
## 2 2 1 1
## 3 3 1 1
## 4 4 1 0
A warning: amend
internally uses the suffix "toFix"
. If this suffix is already used in your data, modify the suffix
argument.
Now that we understsand how it works, let’s use use our loop
!
Remind ourselves of what the example
data looks like.
kable(example)
id | toFix | group |
---|---|---|
1 | 0 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
Here, we convert the toFix value in the first row from a 0 to a 1.
example %>%
ungroup %>%
loop$begin() %>%
slice(1) %>%
mutate(toFix = 1) %>%
loop$end(insert, by = "id") %>%
kable
id | toFix | group |
---|---|---|
1 | 1 | 1 |
2 | 0 | 1 |
3 | 1 | 1 |
4 | 1 | 0 |
In general, insert
is best suited to filter
/slice
type operations.
Here, we summarize toFix in each of the two groups, reverse the results, and then reintegrate the summary into the original data.
example %>%
group_by(group) %>%
loop$begin() %>%
summarize(toFix = mean(toFix)) %>%
mutate(group = rev(group)) %>%
loop$end(amend) %>%
kable
## Amending columns: toFix
group | id | toFix |
---|---|---|
0 | 4 | 0.3333333 |
1 | 1 | 1.0000000 |
1 | 2 | 1.0000000 |
1 | 3 | 1.0000000 |
In general, amend
is best suited to summarize
/do
type operations.
This is only the tip of the iceberg. Do not feel limited to using amend
and insert
as ending functions. A whole host of others could be useful: join functions, merge functions, even setNames. Loops within loops are in fact quite possible. I would be cautious using them. It can be exhilarating, but make sure to indent each loop carefully.