Running R in parallel on the TACC
UPDATE (4/8/2014): I have learned from Mr. Yaakoub El Khamra that he and the good folks at TACC have made some modifications to TACC’s custom MPI implementation and R build in order to correct bugs in Rmpi and snow that were causing crashes. This post has been updated to reflect the modifications.
I’ve started to use the Texas Advanced Computing Cluster to run statistical simulations in R. It takes a little bit of time to get up and running, but once you do it is an amazing tool. To get started, you’ll need
- An account on the TACC and an allocation of computing time.
- An ssh client like PUTTY.
- Some R code that can be adapted to run in parallel.
- A SLURM script that tells the server (called Stampede) how to run the R.
The R script
I’ve been running my simulations using a combination of several packages that provide very high-level functionality for parallel computing, namely
doSNOW, and the
maply function in
plyr. All of this runs on top of an
Rmpi implementation developed by the folks at TACC (more details here).
In an earlier post, I shared code for running a very simple simulation of the Behrens-Fisher problem. Here’s adapted code for running the same simulation on Stampede. The main difference is that there are a few extra lines of code to set up a cluster, seed a random number generator, and pass necessary objects (saved in
source_func) to the nodes of the cluster:
library(Rmpi) library(snow) library(foreach) library(iterators) library(doSNOW) library(plyr) # set up parallel processing cluster <- getMPIcluster() registerDoSNOW(cluster) # export source functions clusterExport(cluster, source_func)
Once it is all set up, running the code is just a matter of turning on the parallel option in
BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE)
I fully admit that my method of passing source functions is rather kludgy. One alternative would be to save all of the source functions in a separate file (say,
source the file at the beginning of the simulation script:
rm(list=ls()) source("source_functions.R") print(source_func <- ls())
Another, more elegant alternative would be to put all of your source functions in a little package (say,
BehrensFisher), install the package, and then pass the package in the
BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE, .paropts = list(.packages="BehrensFisher"))
Of course, developing a package involves a bit more work on the front end.
The SLURM script
Suppose that you’ve got your R code saved in a file called
Behrens_Fisher.R. Here’s an example of a SLURM script that runs the R script after configuring an Rmpi cluster:
#!/bin/bash #SBATCH -J Behrens # Job name #SBATCH -o Behrens.o%j # Name of stdout output file (%j expands to jobId) #SBATCH -e Behrens.o%j # Name of stderr output file(%j expands to jobId) #SBATCH -n 32 # Total number of mpi tasks requested #SBATCH -p normal # Submit to the 'normal' or 'development' queue #SBATCH -t 0:20:00 # Run time (hh:mm:ss) #SBATCH -A A-yourproject # Allocation name to charge job against #SBATCH --firstname.lastname@example.org # specify email address for notifications #SBATCH --mail-type=begin # email when job begins #SBATCH --mail-type=end # email when job ends # load R module module load Rstats # call R code from RMPISNOW ibrun RMPISNOW < Behrens_Fisher.R
The file should be saved in a plain text file called something like
run_BF.slurm. The file has to use ANSI encoding and Unix-type end-of-line encoding; Notepad++ is a text editor that can create files in this format.
Note that for full efficiency, the
-n option should be a multiple of 16 because their are 16 cores per compute node. Further details about SBATCH options can be found here.
Running on Stampede
Follow these directions to log in to the Stampede server. Here’s the User Guide for Stampede. The first thing you’ll need to do is ensure that you’ve got the proper version of MVAPICH loaded. To do that, type
module swap intel intel/220.127.116.11 module setdefault
The second line sets this as the default, so you won’t need to do this step again.
Second, you’ll need to install whatever R packages you’ll need to run your code. To do that, type the following at the
login4$module load Rstats login4$R
This will start an interactive R session. From the R prompt, use
install.packages to download and install, e.g.
The packages will be installed in a local library. Now type
q() to quit R.
Next, make a new directory for your project:
login4$mkdir project_name login4$cd project_name
Upload your files to the directory (using psftp, for instance). Check that your R script is properly configured by viewing it in Vim.
Finally, submit your job by typing
or whatever your SLURM script is called. To check the status of the submitted job, type
showq -u followed by your TACC user name (more details here).
TACC accounts come with a limited number of computing hours, so you should be careful to write efficient code. Before you even start worrying about running on TACC, you should profile your code and try to find ways to speed up the computations. (Some simple improvements in my Behrens-Fisher code would make it run MUCH faster.) Once you’ve done what you can in terms of efficiency, you should do some small test runs on Stampede. For example, you could try running only a few iterations for each combination of factors, and/or running only some of the combinations rather than the full factorial design. Based on the run-time for these jobs, you’ll then be able to estimate how long the full code would take. If it’s acceptable (and within your allocation), then go ahead and
sbatch the full job. If it’s not, you might reconsider the number of factor levels in your design or the number of iterations you need. I might have more comments about those some other time.
Comments? Suggestions? Corrections? Drop a comment.