Assigning after dplyr
Hadley Wickham’s dplyr and tidyr packages completely changed the way I do data manipulation/munging in R. These packages make it possible to write shorter, faster, more legible, easier-to-intepret code to accomplish the sorts of manipulations that you have to do with practically any real-world data analysis. The legibility and interpretability benefits come from
- using functions that are simple verbs that do exactly what they say (e.g.,
filter
,summarize
,group_by
) and - chaining multiple operations together, through the pipe operator
%>%
from the magrittr package.
Chaining is particularly nice because it makes the code read like a story. For example, here’s the code to calculate sample means for the baseline covariates in a little experimental dataset I’ve been working with recently:
library(dplyr)
dat <- read.csv("http://jepusto.com/data/Mineo_2009_data.csv")
dat %>%
group_by(Condition) %>%
select(Age, starts_with("Baseline")) %>%
summarise_each(funs(mean)) ->
baseline_means
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
Each line of the code is a different action: first group the data by Condition
, then select the relevant variables, then summarise each of the variables with its sample mean in each group. The results are stored in a dataset called baseline_means
.
As I’ve gotten familiar with dplyr
, I’ve adopted the style of using the backwards assignment operator (->
) to store the results of a chain of manipulations. This is perhaps a little bit odd—in all the rest of my code I stick with the forward assignment operator (<-
) with the object name on the left—but the alternative is to break the “flow” of the story, effectively putting the punchline before the end of the joke. Consider:
baseline_means <- dat %>%
group_by(Condition) %>%
select(Age, starts_with("Baseline")) %>%
summarise_each(funs(mean))
## Adding missing grouping variables: `Condition`
That’s just confusing to me. So backward assignment operator it is.
Assigning as a verb
My only problem with this convention is that, with complicated chains of manipulations, I often find that I need to tweak the order of the verbs in the chain. For example, I might want to summarize all of the variables, and only then select which ones to store:
dat %>%
group_by(Condition) %>%
summarise_each(funs(mean)) %>%
select(Age, starts_with("Baseline")) ->
baseline_means
## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA
## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA
## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA
In revising the code, it’s necessary to change the symbols at the end of the second and third steps, which is a minor hassle. It’s possible to do it by very carefully cutting-and-pasting the end of the second step through everything but the ->
after the third step, but that’s a delicate operation, prone to error if you’re programming after hours or after beer. Wouldn’t it be nice if every step in the chain ended with %>%
so that you could move around whole lines of code without worrying about the bit at the end?
Here’s one crude way to end each link in the chain with a pipe:
dat %>%
group_by(Condition) %>%
select(Age, starts_with("Baseline")) %>%
summarise_each(funs(mean)) %>%
identity() -> baseline_means
## Adding missing grouping variables: `Condition`
But this is still pretty ugly—it’s got an extra function call that’s not a verb, and the name of the resulting object is tucked away in the middle of a line. What I need is a verb to take the results of a chain of operations and assign to an object. Base R has a suitable candidate here: the assign
function. How about the following?
dat %>%
group_by(Condition) %>%
select(Age, starts_with("Baseline")) %>%
summarise_each(funs(mean)) %>%
assign("baseline_means_new", .)
## Adding missing grouping variables: `Condition`
exists("baseline_means_new")
## [1] FALSE
This doesn’t work because of some subtlety with the environment into which baseline_means_new
is assigned. A brute-force fix would be to specify that the assign should be into the global environment. This will probably work 90%+ of the time, but it’s still not terribly elegant.
Here’s a function that searches the call stack to find the most recent invocation of itself that does not involve non-standard evaluation, then assigns to its parent environment:
put <- function(x, name, where = NULL) {
if (is.null(where)) {
sys_calls <- sys.calls()
put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
where <- sys.frame(max(which(put_calls)) - 1)
}
assign(name, value = x, pos = where)
}
Here are my quick tests that this function is assigning to the right environment:
put(dat, "dat1")
dat %>% put("dat2")
f <- function(dat, name) {
put(dat, "dat3")
dat %>% put("dat4")
put(dat, name)
c(exists("dat3"), exists("dat4"), exists(name))
}
f(dat,"dat5")
## [1] TRUE TRUE TRUE
grep("dat",ls(), value = TRUE)
## [1] "dat" "dat1" "dat2"
This appears to work even if you’ve got multiple nested calls to put
:
put(f(dat, "dat6"), "dat7")
grep("dat",ls(), value = TRUE)
## [1] "dat" "dat1" "dat2" "dat7"
dat7
## [1] TRUE TRUE TRUE
f(dat, "dat8") %>% put("dat9")
grep("dat",ls(), value = TRUE)
## [1] "dat" "dat1" "dat2" "dat7" "dat9"
dat9
## [1] TRUE TRUE TRUE
It works! (I think…)
To be consistent with the style of dplyr, let me also tweak the function to allow name
to be the unquoted object name:
put <- function(x, name, where = NULL) {
name_string <- deparse(substitute(name))
if (is.null(where)) {
sys_calls <- sys.calls()
put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
where <- sys.frame(max(which(put_calls)) - 1)
}
assign(name_string, value = x, pos = where)
}
Returning to my original chain of manipulations, here’s how it looks with the new function:
dat %>%
group_by(Condition) %>%
select(Age, starts_with("Baseline")) %>%
summarise_each(funs(mean)) %>%
put(baseline_means_new)
## Adding missing grouping variables: `Condition`
print(baseline_means_new)
## # A tibble: 3 x 4
## Condition Age Baseline.Gaze Baseline.Vocalizations
## <fct> <dbl> <dbl> <dbl>
## 1 OtherVR 122. 91.9 2.86
## 2 SelfVid 121. 102. 1.86
## 3 SelfVR 139. 95.5 1.43
If you’ve been following along, let me know what you think of this. Is it a good idea, or is it dangerous? Are there cases where this will break? Can you think of a better name?