Skip to content

Latest commit

 

History

History
84 lines (53 loc) · 2.14 KB

README.md

File metadata and controls

84 lines (53 loc) · 2.14 KB

SC 24 - Tutorial on Distributed Training of Deep Neural Networks

Join slack

All the code for the hands-on exercies can be found in this repository.

Table of Contents

Setup

To request an account on Zaratan, please join slack at the link above, and fill this Google form.

We have pre-built the dependencies required for this tutorial on Zaratan. This will be activated automatically when you run the bash scripts.

Model weights and the training dataset have been downloaded in /scratch/zt1/project/sc24/shared/.

Basics of Model Training

Using PyTorch Lightning

CONFIG_FILE=configs/single_gpu.json sbatch --ntasks-per-node=1  train.sh

Mixed Precision

Open configs/single_gpu.json and change precision to bf16-mixed and then run -

CONFIG_FILE=configs/single_gpu.json sbatch --ntasks-per-node=1  train.sh

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

CONFIG_FILE=configs/ddp.json sbatch --ntasks-per-node=4  train.sh

Fully Sharded Data Parallelism (FSDP)

CONFIG_FILE=configs/fsdp.json sbatch --ntasks-per-node=4  train.sh

Tensor Parallelism

CONFIG_FILE=configs/axonn.json sbatch --ntasks-per-node=4  train.sh

Inference

Add more prompts to data/inference/prompts.txt if you want. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=1  infer.sh

With torch.compile

Open configs/axonn_inference.json and change compile to true. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=1  infer.sh

With tensor parallelism

Open configs/axonn_inference.json and change tp_dimensions to [4, 1, 1]. Then run

CONFIG_FILE=configs/inference_axonn.json sbatch --ntasks-per-node=4  infer.sh