Assigning after dplyr

Hadley Wickham’s dplyr and tidyr packages completely changed the way I do data manipulation/munging in R. These packages make it possible to write shorter, faster, more legible, easier-to-intepret code to accomplish the sorts of manipulations that you have to do with practically any real-world data analysis. The legibility and interpretability benefits come from

  • using functions that are simple verbs that do exactly what they say (e.g., filter, summarize, group_by) and
  • chaining multiple operations together, through the pipe operator %>% from the magrittr package.

Chaining is particularly nice because it makes the code read like a story. For example, here’s the code to calculate sample means for the baseline covariates in a little experimental dataset I’ve been working with recently:

library(dplyr)
dat <- read.csv("http://jepusto.com/data/Mineo_2009_data.csv")

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) ->
  baseline_means
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Each line of the code is a different action: first group the data by Condition, then select the relevant variables, then summarise each of the variables with its sample mean in each group. The results are stored in a dataset called baseline_means.

As I’ve gotten familiar with dplyr, I’ve adopted the style of using the backwards assignment operator (->) to store the results of a chain of manipulations. This is perhaps a little bit odd—in all the rest of my code I stick with the forward assignment operator (<-) with the object name on the left—but the alternative is to break the “flow” of the story, effectively putting the punchline before the end of the joke. Consider:

baseline_means <- dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean))
## Adding missing grouping variables: `Condition`

That’s just confusing to me. So backward assignment operator it is.

Assigning as a verb

My only problem with this convention is that, with complicated chains of manipulations, I often find that I need to tweak the order of the verbs in the chain. For example, I might want to summarize all of the variables, and only then select which ones to store:

dat %>%
  group_by(Condition) %>%
  summarise_each(funs(mean)) %>%
  select(Age, starts_with("Baseline")) ->
  baseline_means
## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Expressive.Language): argument is not numeric or
## logical: returning NA

In revising the code, it’s necessary to change the symbols at the end of the second and third steps, which is a minor hassle. It’s possible to do it by very carefully cutting-and-pasting the end of the second step through everything but the -> after the third step, but that’s a delicate operation, prone to error if you’re programming after hours or after beer. Wouldn’t it be nice if every step in the chain ended with %>% so that you could move around whole lines of code without worrying about the bit at the end?

Here’s one crude way to end each link in the chain with a pipe:

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  identity() -> baseline_means
## Adding missing grouping variables: `Condition`

But this is still pretty ugly—it’s got an extra function call that’s not a verb, and the name of the resulting object is tucked away in the middle of a line. What I need is a verb to take the results of a chain of operations and assign to an object. Base R has a suitable candidate here: the assign function. How about the following?

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  assign("baseline_means_new", .)
## Adding missing grouping variables: `Condition`
exists("baseline_means_new")
## [1] FALSE

This doesn’t work because of some subtlety with the environment into which baseline_means_new is assigned. A brute-force fix would be to specify that the assign should be into the global environment. This will probably work 90%+ of the time, but it’s still not terribly elegant.

Here’s a function that searches the call stack to find the most recent invocation of itself that does not involve non-standard evaluation, then assigns to its parent environment:

put <- function(x, name, where = NULL) {
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name, value = x, pos = where)
}

Here are my quick tests that this function is assigning to the right environment:

put(dat, "dat1")
dat %>% put("dat2")

f <- function(dat, name) {
  put(dat, "dat3")
  dat %>% put("dat4")
  put(dat, name)
  c(exists("dat3"), exists("dat4"), exists(name))
}

f(dat,"dat5")
## [1] TRUE TRUE TRUE
grep("dat",ls(), value = TRUE)
## [1] "dat"  "dat1" "dat2"

This appears to work even if you’ve got multiple nested calls to put:

put(f(dat, "dat6"), "dat7")
grep("dat",ls(), value = TRUE)
## [1] "dat"  "dat1" "dat2" "dat7"
dat7
## [1] TRUE TRUE TRUE
f(dat, "dat8") %>% put("dat9")
grep("dat",ls(), value = TRUE)
## [1] "dat"  "dat1" "dat2" "dat7" "dat9"
dat9
## [1] TRUE TRUE TRUE

It works! (I think…)

To be consistent with the style of dplyr, let me also tweak the function to allow name to be the unquoted object name:

put <- function(x, name, where = NULL) {
  name_string <- deparse(substitute(name))
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name_string, value = x, pos = where)
}

Returning to my original chain of manipulations, here’s how it looks with the new function:

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  put(baseline_means_new)
## Adding missing grouping variables: `Condition`
print(baseline_means_new)
## # A tibble: 3 x 4
##   Condition   Age Baseline.Gaze Baseline.Vocalizations
##   <fct>     <dbl>         <dbl>                  <dbl>
## 1 OtherVR    122.          91.9                   2.86
## 2 SelfVid    121.         102.                    1.86
## 3 SelfVR     139.          95.5                   1.43

If you’ve been following along, let me know what you think of this. Is it a good idea, or is it dangerous? Are there cases where this will break? Can you think of a better name?

comments powered by Disqus

Related