-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce multi-node training setup #26
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.
An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working. In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough. |
I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to |
As discussed at the dev-meeting just now, here is a slurm submission script example. I realized I put everything + some documentation already in this here PR: docs/examples/submit_slurm_job.sh |
Enable multi-node GPU training with SLURM
This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.
Key changes
use_distributed_sampler
toTrue
when not in evaluation mode to enable distributed trainingSLURM_JOB_ID
environment variabledevices
) based on theSLURM_GPUS_PER_NODE
environment variable, falling back totorch.cuda.device_count()
if not setnum_nodes
) based on theSLURM_JOB_NUM_NODES
environment variable, defaulting to 1 if not setRationale for using SLURM
SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.
By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.