Running R in parallel on the TACC

2013-12-20

UPDATE (4/8/2014): I have learned from Mr. Yaakoub El Khamra that he and the good folks at TACC have made some modifications to TACC’s custom MPI implementation and R build in order to correct bugs in Rmpi and snow that were causing crashes. This post has been updated to reflect the modifications.

I’ve started to use the Texas Advanced Computing Cluster to run statistical simulations in R. It takes a little bit of time to get up and running, but once you do it is an amazing tool. To get started, you’ll need

An account on the TACC and an allocation of computing time.
An ssh client like PUTTY.
Some R code that can be adapted to run in parallel.
A SLURM script that tells the server (called Stampede) how to run the R.

The R script

I’ve been running my simulations using a combination of several packages that provide very high-level functionality for parallel computing, namely foreach, doSNOW, and the maply function in plyr. All of this runs on top of an Rmpi implementation developed by the folks at TACC (more details here).

In an earlier post, I shared code for running a very simple simulation of the Behrens-Fisher problem. Here’s adapted code for running the same simulation on Stampede. The main difference is that there are a few extra lines of code to set up a cluster, seed a random number generator, and pass necessary objects (saved in source_func) to the nodes of the cluster:

library(Rmpi)
library(snow)
library(foreach)
library(iterators)
library(doSNOW)
library(plyr)

# set up parallel processing
cluster <- getMPIcluster()
registerDoSNOW(cluster)

# export source functions
clusterExport(cluster, source_func)

Once it is all set up, running the code is just a matter of turning on the parallel option in mdply:

BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE)

I fully admit that my method of passing source functions is rather kludgy. One alternative would be to save all of the source functions in a separate file (say, source_functions.R), then source the file at the beginning of the simulation script:

rm(list=ls())
source("source_functions.R")
print(source_func <- ls())

Another, more elegant alternative would be to put all of your source functions in a little package (say, BehrensFisher), install the package, and then pass the package in the maply call:

BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE, .paropts = list(.packages="BehrensFisher"))

Of course, developing a package involves a bit more work on the front end.

The SLURM script

Suppose that you’ve got your R code saved in a file called Behrens_Fisher.R. Here’s an example of a SLURM script that runs the R script after configuring an Rmpi cluster:

#!/bin/bash
#SBATCH -J Behrens          # Job name
#SBATCH -o Behrens.o%j      # Name of stdout output file (%j expands to jobId)
#SBATCH -e Behrens.o%j      # Name of stderr output file(%j expands to jobId)
#SBATCH -n 32               # Total number of mpi tasks requested
#SBATCH -p normal           # Submit to the 'normal' or 'development' queue
#SBATCH -t 0:20:00          # Run time (hh:mm:ss)
#SBATCH -A A-yourproject    # Allocation name to charge job against
#SBATCH --mail-user=you@email.address # specify email address for notifications
#SBATCH --mail-type=begin   # email when job begins
#SBATCH --mail-type=end     # email when job ends

# load R module
module load Rstats           

# call R code from RMPISNOW
ibrun RMPISNOW < Behrens_Fisher.R

The file should be saved in a plain text file called something like run_BF.slurm. The file has to use ANSI encoding and Unix-type end-of-line encoding; Notepad++ is a text editor that can create files in this format.

Note that for full efficiency, the -n option should be a multiple of 16 because their are 16 cores per compute node. Further details about SBATCH options can be found here.

Running on Stampede

Follow these directions to log in to the Stampede server. Here’s the User Guide for Stampede. The first thing you’ll need to do is ensure that you’ve got the proper version of MVAPICH loaded. To do that, type

module swap intel intel/14.0.1.106
module setdefault

The second line sets this as the default, so you won’t need to do this step again.

Second, you’ll need to install whatever R packages you’ll need to run your code. To do that, type the following at the login4$ prompt:

login4$module load Rstats
login4$R

This will start an interactive R session. From the R prompt, use install.packages to download and install, e.g.

install.packages("plyr","reshape","doSNOW","foreach","iterators")

The packages will be installed in a local library. Now type q() to quit R.

Next, make a new directory for your project:

login4$mkdir project_name
login4$cd project_name

Upload your files to the directory (using psftp, for instance). Check that your R script is properly configured by viewing it in Vim.

Finally, submit your job by typing

login4$sbatch run_BF.slurm

or whatever your SLURM script is called. To check the status of the submitted job, type showq -u followed by your TACC user name (more details here).

Further thoughts

TACC accounts come with a limited number of computing hours, so you should be careful to write efficient code. Before you even start worrying about running on TACC, you should profile your code and try to find ways to speed up the computations. (Some simple improvements in my Behrens-Fisher code would make it run MUCH faster.) Once you’ve done what you can in terms of efficiency, you should do some small test runs on Stampede. For example, you could try running only a few iterations for each combination of factors, and/or running only some of the combinations rather than the full factorial design. Based on the run-time for these jobs, you’ll then be able to estimate how long the full code would take. If it’s acceptable (and within your allocation), then go ahead and sbatch the full job. If it’s not, you might reconsider the number of factor levels in your design or the number of iterations you need. I might have more comments about those some other time.

Comments? Suggestions? Corrections? Drop a comment.

Running R in parallel on the TACC

The R script

The SLURM script

Running on Stampede

Further thoughts

Related