Difference between revisions of "Analysing Data on Armis or Great Lakes using Rmd Files"

(Wrote outline of initial page)
 
(Added some details about doing the analysis on a remote serverr)
 
(10 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
=== On Windows ===
 
=== On Windows ===
 
* Install Putty for command line access https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
 
* Install Putty for command line access https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
 +
* Install Filezilla for file transfers https://filezilla-project.org/
  
 
== Access ==
 
== Access ==
Line 21: Line 22:
 
You can access the server either through the command line or through a remote desktop if you prefer a GUI.  There a limited functionality to Armis2 here that can also be used (see these [https://arc.umich.edu/armis2/user-guide/ instructions]).   
 
You can access the server either through the command line or through a remote desktop if you prefer a GUI.  There a limited functionality to Armis2 here that can also be used (see these [https://arc.umich.edu/armis2/user-guide/ instructions]).   
  
If you submit a deidentified data request via https://datadirect.precisionhealth.umich.edu, your data will be sent as a 7zip file to the turbo storage in a folder named for your IRB.  You will be prompted to enter a password to protect this file.
+
== Deidentified Data Request ==
 +
 
 +
If you submit a deidentified data request via https://datadirect.precisionhealth.umich.edu, your data will be sent as a 7zip file to the turbo storage in a folder named for your IRB.  You will be prompted to enter a password to protect this file, do not lose this password.
 +
 
 +
== Server Access ==
  
 
=== Accessing the Server - Command Line ===
 
=== Accessing the Server - Command Line ===
 +
 +
On OSX open a terminal and enter <code>ssh <UNIQNAME>@armis2.arc-ts.umich.edu</code> replacing with your uniqname.  Enter your level 1 password.  Authenticate using Duo following the instructions.  You will be placed in your home folder.  Your data should be in <code>/nfs/turbo/precision-health/DataDirect/<YOUR-IRB></code>.  I find it convenient to make symlinks to quickly navigate to your data folder so first find your folder name
 +
 +
<code>ls /nfs/turbo/precision-health/DataDirect/</code>
 +
 +
locate the name of your folder <FOLDER_NAME>
 +
 +
create the symlink
 +
 +
<code>ln -s /nfs/turbo/precision-health/DataDirect/<FOLDER_NAME> <LINK_NAME></code>
 +
 +
Now you can navigate from your home folder to your data folder by typing
 +
 +
<code> cd <LINK_NAME></code>
 +
 
=== Accessing the Server - Remote Desktop ===
 
=== Accessing the Server - Remote Desktop ===
 +
 +
* You will need to unzip newly extracted files using the terminal (see [[Extracting Data]]).
 +
* Now go to https://armis2.arc-ts.umich.edu/ and authenticate.
 +
* Click on '''Interactive Apps''', then RStudio, then enter slurm account, set time and memory needed.  Usually, the defaults are fine.  Click Launch to open a RStudio session, wait a moment then click Launch RStudio, and ignore update.
 +
* Analyse your data using R or Rmd files on this secure environment.
 +
* Pull processed ('''but never raw''') data from the server using Filezilla
 +
 +
== Analyzing Data on the Cluster ==
 +
 
=== Extracting Data ===
 
=== Extracting Data ===
== Submitting Scripts to the Server ==
 
  
== Transferring Results ==
+
If you have a data archive from DataDirect it will be a password-protected 7zip file.  You will need to first install 7zip in your home directory in a bin folder before extracting the archive
 +
<pre>
 +
mkdir bin
 +
cd bin
 +
wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz
 +
tar -xf 7z2301-linux-x64.tar.xz
 +
7zz
 +
</pre>
 +
 
 +
If this works you should see the manual entry for 7zip.
 +
 
 +
To extract your archive move into your working directory and then use the command <code>7zz x '<ARCHIVE>.zip'</code>, entering the password you created in [[#Deidentified Data Request|Deidentified Data Request]].  This should generate a folder containing a series of csv files to be used by your R scripts.
 +
 
 +
=== Submitting Scripts to the Server ===
 +
 
 +
==== Configuring R ====
 +
Some R packages may need to be installed in your home folder.  To do this go to your home folder <code>cd ~/</code> and enter the following to load the R modules, enter an R shell, install the packages and exit out.  You will only have to do this once to install the relevant R packages in your script
 +
 
 +
<pre>module load R
 +
R
 +
install.packages("PACKAGE_NAME")
 +
exit()
 +
</pre>
 +
 
 +
Follow the prompts agreeing to make a personal library and using the Michigan mirror (71)
 +
 
 +
==== Creating a Rmd Script====
 +
I prefer to generate Rmd scripts on my own computer and then transfer them to the data folder to run (see [[#Transferring Files|Transfering Files]] to move scripts onto the server).
 +
 
 +
==== Creating a Batch Script ====
 +
To submit a job to the server you will need to create a batch script to submit your jobs.  The Armis2 and Greatlakes servers use slurm to submit jobs.
 +
Details of how to configure a slurm file can be found [https://arc.umich.edu/armis2/slurm-user-guide here].  Each script can include multiple Rmd files.  A sample script is below.  Make sure to replace <USER> with your slurm user id and entering a name for the job and your email.
 +
 
 +
<pre>
 +
#!  /bin/bash
 +
#SBATCH --job-name <JOB_NAME>
 +
#SBATCH --nodes=1
 +
#SBATCH --cpus-per-task=1
 +
#SBATCH --mem-per-cpu=2G
 +
#SBATCH --time=00:15:00
 +
#SBATCH --account=<USER>
 +
#SBATCH --partition=standard
 +
#SBATCH --mail-user=<YOUR_EMAIL>
 +
#SBATCH --mail-type=END,FAIL
 +
#SBATCH --output=/home/%u/%x-%j.log
 +
#SBATCH --error=/home/%u/error-%x-%j.log
 +
 
 +
module purge
 +
module load R
 +
module load RStudio
 +
 
 +
echo "Running from $(pwd)"
 +
 
 +
Rscript -e "rmarkdown::render('<FIRST_SCRIPT>.Rmd')"
 +
Rscript -e "rmarkdown::render('<SECOND_SCRIPT>.Rmd')"
 +
</pre>
 +
 
 +
Save this as a <JOB_NAME>.slurm file in your working folder on turbo.  The <nowiki>mem-per-cpu</nowiki> and <nowiki>time</nowiki> commands should be adjusted to approximate how complex your job is (if it is too little memory or time you will get an update in the email response).
 +
 
 +
== Transferring Files ==

Latest revision as of 18:32, 5 September 2023

Software

On OSX

On Windows

Access

At Michigan there are two clusters, one for human data (Armis) and one for non-PHI data (Great Lakes). Both of these are connected to a storage service called Turbo. These are available as part of the UM Research Computing Package.

Permissions


You can access the server either through the command line or through a remote desktop if you prefer a GUI. There a limited functionality to Armis2 here that can also be used (see these instructions).

Deidentified Data Request

If you submit a deidentified data request via https://datadirect.precisionhealth.umich.edu, your data will be sent as a 7zip file to the turbo storage in a folder named for your IRB. You will be prompted to enter a password to protect this file, do not lose this password.

Server Access

Accessing the Server - Command Line

On OSX open a terminal and enter ssh <UNIQNAME>@armis2.arc-ts.umich.edu replacing with your uniqname. Enter your level 1 password. Authenticate using Duo following the instructions. You will be placed in your home folder. Your data should be in /nfs/turbo/precision-health/DataDirect/<YOUR-IRB>. I find it convenient to make symlinks to quickly navigate to your data folder so first find your folder name

ls /nfs/turbo/precision-health/DataDirect/

locate the name of your folder <FOLDER_NAME>

create the symlink

ln -s /nfs/turbo/precision-health/DataDirect/<FOLDER_NAME> <LINK_NAME>

Now you can navigate from your home folder to your data folder by typing

cd <LINK_NAME>

Accessing the Server - Remote Desktop

  • You will need to unzip newly extracted files using the terminal (see Extracting Data).
  • Now go to https://armis2.arc-ts.umich.edu/ and authenticate.
  • Click on Interactive Apps, then RStudio, then enter slurm account, set time and memory needed. Usually, the defaults are fine. Click Launch to open a RStudio session, wait a moment then click Launch RStudio, and ignore update.
  • Analyse your data using R or Rmd files on this secure environment.
  • Pull processed (but never raw) data from the server using Filezilla

Analyzing Data on the Cluster

Extracting Data

If you have a data archive from DataDirect it will be a password-protected 7zip file. You will need to first install 7zip in your home directory in a bin folder before extracting the archive

mkdir bin
cd bin
wget https://www.7-zip.org/a/7z2301-linux-x64.tar.xz
tar -xf 7z2301-linux-x64.tar.xz
7zz

If this works you should see the manual entry for 7zip.

To extract your archive move into your working directory and then use the command 7zz x '<ARCHIVE>.zip', entering the password you created in Deidentified Data Request. This should generate a folder containing a series of csv files to be used by your R scripts.

Submitting Scripts to the Server

Configuring R

Some R packages may need to be installed in your home folder. To do this go to your home folder cd ~/ and enter the following to load the R modules, enter an R shell, install the packages and exit out. You will only have to do this once to install the relevant R packages in your script

module load R
R
install.packages("PACKAGE_NAME")
exit()

Follow the prompts agreeing to make a personal library and using the Michigan mirror (71)

Creating a Rmd Script

I prefer to generate Rmd scripts on my own computer and then transfer them to the data folder to run (see Transfering Files to move scripts onto the server).

Creating a Batch Script

To submit a job to the server you will need to create a batch script to submit your jobs. The Armis2 and Greatlakes servers use slurm to submit jobs. Details of how to configure a slurm file can be found here. Each script can include multiple Rmd files. A sample script is below. Make sure to replace <USER> with your slurm user id and entering a name for the job and your email.

#!  /bin/bash
#SBATCH --job-name <JOB_NAME>
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=00:15:00
#SBATCH --account=<USER>
#SBATCH --partition=standard
#SBATCH --mail-user=<YOUR_EMAIL>
#SBATCH --mail-type=END,FAIL
#SBATCH --output=/home/%u/%x-%j.log
#SBATCH --error=/home/%u/error-%x-%j.log

module purge
module load R
module load RStudio

echo "Running from $(pwd)"

Rscript -e "rmarkdown::render('<FIRST_SCRIPT>.Rmd')"
Rscript -e "rmarkdown::render('<SECOND_SCRIPT>.Rmd')"

Save this as a <JOB_NAME>.slurm file in your working folder on turbo. The mem-per-cpu and time commands should be adjusted to approximate how complex your job is (if it is too little memory or time you will get an update in the email response).

Transferring Files