-
Notifications
You must be signed in to change notification settings - Fork 90
Executing Albany on Ride or White
These execution instructions are for running Albany on the Ride or White IBM Power8 GPU clusters at Sandia National Laboratories. Batch scripts are used to submit jobs to a queue manager. The script will run when resources become available.
As of February 2017, Ride and White are split into three queues, each having different numbers of nodes and GPUs:
Ride | Name | Node Names | Number of Nodes | GPU Model | Number of GPUs per Node |
---|---|---|---|---|---|
Firestone nodes (default queue) | rhel7F | ride7 - ride16 | 10 | K80 (12GB) | 4 |
Garrison nodes | rhel7G | ride17 - ride28 | 12 | P100 (16GB) | 4 |
Tuleta nodes | rhel7T | ride2 - ride5 | 4 | K40m (12GB) | 2 |
White | Name | Node Names | Number of Nodes | GPU Model | Number of GPUs per Node |
---|---|---|---|---|---|
Firestone nodes (default queue) | rhel7F | white20 - white27 | 8 | K80 (12GB) | 4 |
Garrison nodes | rhel7G | white28 - white35 | 8 | P100 (16GB) | 4 |
Tuleta nodes | rhel7T | white13 - white19 | 7 | K40m (12GB) | 2 |
Ride and White use LSF as a resource manager and job scheduler. Here is a list of useful commands:
-
bsub -Is bash
- Submit an interactive job to the LSF system -
bsub < [BatchScriptFile]
- Submit a batch job to the LSF system where[BatchScriptFile]
refers to the batch script file being used -
bkill
- Kill a running job -
bjobs
- See the status of user jobs in the LSF queue -
bjobs -u all
- See the status of all jobs in the LSF queue -
bqueues
- Information about LSF batch queues -
bqueues -l
- More detailed information about the settings for each queue
A useful reference for LSF commands can be found here.
The following script executes Albany with 8 MPI ranks across 2 nodes (4 ranks per node). Since each GPU pair is connected to a socket, --map-by ppr:2:socket
is used to set 2 MPI ranks per socket. --kokkos-ndevices=4
is used to set the number of GPUs used per node.
#!/bin/bash -login
#BSUB -J MPIGPUjob # Job Name
#BSUB -o MPIGPUjob.%J.out # Standard output filename (%J is the job number)
#BSUB -e MPIGPUjob.%J.err # Standard error filename
#BSUB -q rhel7G # Queue Name
#BSUB -m "ride27 ride28" # Node Names
#BSUB -n 8 # Number of processors
#BSUB -R "span[ptile=4]" # Number of processors per node
#BSUB -W 02:00 # Runtime limit [Hours]:[Minutes]
#BSUB -x # No other jobs can run on this node
# Limit disk usage for large files
ulimit -c 0
# Load modules
source ${HOME}/Albany/doc/ride-white/modules_cuda.sh
# Run MPIGPU job
mpirun -n 8 --map-by ppr:2:socket [AlbanyExecutable] [InputFile] --kokkos-ndevices=4