Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce multi-node training setup #26

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented May 4, 2024

Enable multi-node GPU training with SLURM

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.

Key changes

  • Set use_distributed_sampler to True when not in evaluation mode to enable distributed training
  • Detect if running within a SLURM job by checking for the SLURM_JOB_ID environment variable
  • If running with SLURM:
    • Set the number of devices per node (devices) based on the SLURM_GPUS_PER_NODE environment variable, falling back to torch.cuda.device_count() if not set
    • Set the total number of nodes (num_nodes) based on the SLURM_JOB_NUM_NODES environment variable, defaulting to 1 if not set

Rationale for using SLURM

SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.

By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.

@sadamov sadamov requested a review from joeloskarsson May 4, 2024 13:44
@sadamov sadamov added the enhancement New feature or request label May 4, 2024
@sadamov sadamov requested a review from leifdenby May 14, 2024 05:32
Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.

train_model.py Show resolved Hide resolved
@leifdenby leifdenby changed the title Introduces multi-node training setup Introduce multi-node training setup May 30, 2024
@joeloskarsson
Copy link
Collaborator

An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.

In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.

@sadamov
Copy link
Collaborator Author

sadamov commented Jun 7, 2024

I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples (is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md.
@joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...

@sadamov
Copy link
Collaborator Author

sadamov commented Dec 16, 2024

As discussed at the dev-meeting just now, here is a slurm submission script example. I realized I put everything + some documentation already in this here PR: docs/examples/submit_slurm_job.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants