Learning to parallelise

References that I found useful in trying to understand this vast topic:

R bloggers how to go parallel in R basics tips
this site on running the same R script on many different input files
The many documents on HPC and parallel computing put up by various universities around the world, particularly
the HPC User Guide provided by the University of Sussex
the webpage on Parallel Processing in R at the Dept of Statistics, University of Michigan

There’s more information on different ways of using parallel computing on Sheffield’s HPC site (though I don’t understand most of it…)

When to parallelise

If you have to repeat the same steps several times. Which you might do with a for loop, but you should do with the apply family of functions (lapply, tapply etc.) instead because they’re more efficient.
If you have multiple cores on your computer (most modern computers do) that you can utilise (because R only runs one process aka thread by default, which runs on only one core)
If you have access to a HPC

A core on the HPC cluster is what you automatically get when you make a job request, it specifies a single machine in the cluster network. On your laptop or desktop, you may have just one processor but it’ll have multiple cores. Check using the R package parallel

library(parallel) # comes with base, no need to install
detectCores()

On my MacBook Pro Late 2011 version with its 2.4 GHz Intel Core i5 processor, there are 4 cores. 2 physical (when I check system report on my Mac), 4 virtual. Though I’m not sure what the differences between physical and virtual cores are…

Parallel computing on your own computer

You would basically be splitting a ‘lapply’-like function over the multiple cores in your computer. There are two main methods: 1) PSOCK: Parallel Socket Cluster and 2) FORK

PSOCK launches a new version of R on each core and works on all systems including Windows, but the environment is empty so they need to be explicitly shared. It appears to be more time-consuming and more complicated.

FORKing copies the entire current version of R and transfers it to a new core, so has a shared environment (no need to explicitly share variables and libraries again) and apart from the fact that it does not work on Windows, it seems to be the generally preferred method (if you’re not running on Windows).

Using base package parallel The base package parallel allows a lot of parallelisation to be done without additional packages installed. For example, using parLapply (PSOCK method) or mclapply (FORK method) instead of lapply.

Using the PSOCK method:

library(parallel)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}
# assume we also need a variable
x <- c(1:5)

# number of cores to use (-1 so as not to freeze the computer)
noCores <- detectCores() - 1
# Initiate a cluster
cl <- makeCluster(noCores)
# 'share' library to each core
clusterEvalQ(cl, library(lme4))
# if there are variables, they also need to be shared
clusterExport(cl, x)
# call the parallel version of lapply to run the function 10 times
parLapply(cl, 1:10, ex)
# stop the cluster to free up resources
stopCluster(cl)

Or using mclapply

library(parallel)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}

# call the parallel version of lapply to run the function 10 times
mclapply(1:10, ex)

Other packages Alternatively, there are also specific libraries you can install to use, like foreach which requires the doParallel package too. The doParallel package is a wrapper for parallel.

# install.packages(c('foreach', 'doParallel'))
library(parallel)
library(foreach)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}
# assume we also need a variable
x <- c(1:5)


# start a cluster
cl <- makeCluster(noCores)
registerDoParallel(cl) # this is necessary for the parallelisation to work

# use foreach
foreach(1:10,
        .combine=list, # or you can output a vector 'c', or matrix 'rbind' or 'cbind'
        .export='x', # .export is the same as clusterExport 
        .packages='lme4') %dopar% # .packages is the same as clusterEvalQ
  ex

stopCluster(cl)

Parallel computing on the HPC cluster

There are several different ways to do this on the cluster. The simplest one, which is perhaps most like how you would do it on your computer, is a task array, where each task is independent of the other and will be run on different cores on different nodes at different times (as determined by the job scheduler SGE). Then there is Shared Memory Parallelism, where the tasks have to share the same memory and so will be run on different cores on the same node. A little more complicated, is Message Passing Interface, where the tasks might be run on different cores but need to share memory, and so need to communicate data/other information by passing messages. Then there’s a hybrid smp/mpi way. Unfortunately, I’ve not quite got the hang of smp and/or mpi at the moment, so this guide won’t include those.

Setting up a task array job

This is (I think) what one is most likely to use the HPC to do parallel computing for. It takes multiple input files and runs the same R script on them to get the corresponding output files. From my search, it seemed like there may be many different ways to do this/to write the necessary scripts, but this is what I’ve done which works for me on ShARC.

Writing the bash script

Similar to submitting a batch serial job, this requires a bash script (that I’ll call taskArray.sh) but with modifications.

#!/bin/bash
#$ -l rmem=XXG
#$ -t 1-N
#$ -N jobName
#$ -M UserName@email.address
#$ -m bea
#$ -o /file/path/to/Output.txt
#$ -j y

cd /working/directory

# specify input files in a text file called Index, one file per line
INPUT_FILE=$(sed -n -e "$SGE_TASK_ID p" Index.txt)

# output the input file name and task ID to keep track
echo "The input file is $INPUT_FILE"
echo "Task ID is $SGE_TASK_ID"

module load apps/R/3.6.3/gcc-8.2.0

Rscript myScript.R $INPUT_FILE

The specifications are the same, the only additional argument is -t 1-N which tells SGE this is a task array job, and to run the R script myScript.R 1 to N times, for the N input files you have. It can only call your input files using the $SGE_TASK_ID (which would be 1 to N) assigned; I found it easier to have my input file names in a plain text document (called Index.txt) instead of renaming my files (for e.g. input.1, input.2, … input.N).

My plain text file contains the list of my input files, including file path if needed relative to the working directory I specified in my bash script, and looks like this

directory/firstFile.csv
directory/secondFile.csv
directory/thirdFile.csv

The line INPUT_FILE=$(sed -n -e "$SGE_TASK_ID p" Index.txt) basically calls the corresponding line in the Index.txt file and feeds that into the Rscript command. This bash script has to be made executable as well, and both the bash script taskArray.sh and the text file with the list of input files Index.txt have to be in the same directory.

If you don’t have multiple different input files, and just want to run the same script repeatedly, you can just drop the lines about INPUT FILES. Within your R script though, assuming you have different output for each run, you will likely need to include a line to get the array number: arrayid <- as.numeric(Sys.getenv('SGE_TASK_ID')).

Writing the R script

The R script has to be modified slightly since you’re not hard-coding the input file into the script itself. This R script should be in the same directory as taskArray.sh and Index.txt files.

shhh <- suppressPackageStartupMessages # so I don't get a bunch of start up messages in the output file, a tip I encountered while searching through StackOverflow...
shhh(library('packageName'))
shhh(library('packageName'))

## read in dataset
args <- commandArgs(trailingOnly = TRUE) # to get the input files from command line
dat <- read.csv(args[1]) # feed the input file into the read.csv function

Submitting the batch job is basically the same process, being connected on the VPN, logging on the the HPC system, then running qsub taskArray.sh. I find the commandArgs function quite useful to be able to feed in different input arguments from a bash serial script without having to change the R script.