Compute Canada Quickstart
By William Guimont-Martin • 5 minutes read •
Here’s a small guide on how to use Compute Canada clusters to train deep learning models.
In this guide, we’ll do the necessary setup to get you up and running on a cluster. This guide is written to work with Narval, but it should be easy to adapt it to other clusters by changing the URLs. You can see available clusters at this URL.
We recommend Narval for deep learning as of now (2021-2022) because it has 4 NVIDIA A100 (40 Gb memory) per node.
This guide was written by William Guimont-Martin, 2021
Also available on Norlab Wiki.
Account creation
First of all, you’ll need a Compute Canada account to access the clusters. To do so, please see Tips for new students. You’ll see the URL to apply for an account and Philippe Giguère’s sponsor code.
You’ll need your username and your password for the following steps.
SSH and ssh-copy-id
To access Compute Canada’s clusters, you’ll need to ssh into them. You can get the login node URL by checking the cluster’s wiki page. For Narval, the URL is narval.computecanada.ca.
To connect to the cluster:
# replace <username> with your actual Compute Canada username
You’ll be asked to enter your password. If everything goes right, you’ll be greeted by the cluster’s welcome message.
To make things easier later, we recommend to set up an SSH key to connect to the cluster.
To do so, you’ll need to generate an SSH key if you don’t already have one you’d like to use.
Please see Generating a new SSH key and adding it to the ssh-agent from GitHub to generate a new SSH key.
Once you have added your SSH key and added it to the ssh-agent, copy your key to the cluster with
# replace <username> with your actual Compute Canada username
If everything went right, you should now be able to ssh into your cluster without being asked your password.
Clone your code
Now that you are logged in the cluster, you should have a bash shell. From that shell, you can use most Linux commands you’re used to like ls
, cd
, etc..
We’ll now clone your project on the cluster.
From the cluster, you can create another SSH key and add it to GitHub.
This will allow you to git clone
your repository into your home directory.
Setup your Python venv
It is recommended to create your Python venv from inside a SLURM job, if your installation script is quite complex, it will be easier to set up before starting any job.
To create a Python venv:
Once created, you can activate it using:
You should now see (venv)
at the beginning of your shell prompt.
You can now pip
install all of your dependencies.
Please note that when installing mmdetection and its dependencies, you might need to also run:
If you want to create your venv from inside a job, note the commands you did to setup your environment and copy them into your job script.
Putting your data on the cluster
To put your data on the cluster, we recommend using sftp
.
First of all, take the time to familiarize yourself with the different types of storage available on the clusters.
From that, you’ll have to upload your dataset in ~/projects/def-philg/<username>
. Note that this space is shared across the people in the lab. Please keep it clean!
From you local computer:
# replace <username> with your actual Compute Canada username
# use "put file" to upload a file, use "get file" to download a file
# you can use * to wildcard select multiple files
If you are using mmdetection, you might want to add a symlink to your data in code/data
. To do so, run the following commands:
SLURM jobs
We can now create jobs. To do so, create a .sh file and paste the following inside:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=40000M
#SBATCH --time=1-00:00
#SBATCH --job-name=01_baseline_pointpillars
#SBATCH --output=%x-%j.out
# Variables
# Modules
# Setup Python
# Start task
# TODO change that part to how you run your python script, this example shows how to do multi GPU training using mmdetection3d
CONFIG_FILE=configs/01_baseline_pointpillars.py
WORK_DIR=path/to/work/dir # TODO change this path too
GPUS=4
# Cleaning up
This script is organized in five main parts.
Configure job
This part is used to configure the resources needed for your job.
#!/bin/bash
#SBATCH --gres=gpu:4 # Ask for 4 GPU
#SBATCH --cpus-per-task=8 # Ask for 8 CPU
#SBATCH --mem=40000M # Ask for 40 000M of RAM
#SBATCH --time=1-00:00 # Run for DD-HH:MM
#SBATCH --job-name=01_baseline_pointpillars # Job name
#SBATCH --output=%x-%j.out # Log will be written to f'{job_name}_{job_id}'
Setup installation
This part sets environment variables for multithreading and loads needed modules. You might need to change the loaded versions.
# Variables
# Modules
Loading the venv
This will load the venv we created earlier
# Setup Python
Running your actual code
You will need to change this part to run what you want to run. This scripts shows how to use multi-GPU with mmdetection3d.
# Start task
# TODO change that part to how you run your python script, this example shows how to do multi GPU training using mmdetection3d
CONFIG_FILE=configs/01_baseline_pointpillars.py
WORK_DIR=path/to/work/dir # TODO change this path too
GPUS=4
Cleaning up
Not strictly needed, but it is always a good thing to clean up after yourself.
# Cleaning up
Run the job
To run a job, you first have to queue it to the job scheduler. To do so, run the following command:
You can then verify the job is queued by running:
More information is available here
Useful tools
sbatch
: queue a job.
sq
: view your queued jobs
scancel <id>
: cancel job with id salloc --account=def-philg --gres=gpu:2 --cpus-per-task=4 --mem=32000M --time=5:00:00
: start an interactive job, which will allow you to test your scripts before queuing jobssftp
: useful tool to transfer data from and to the clusterdiskusage_report
: see used disk space
Copy Code https://willguimont.com/search_index.en.json $MATCHES more matches