We recommend that you work with your own personal fork of the summerschool git-repository. That way you can easily commit and push your own solutions to exercises.
Before starting out, synchronize your fork with "Sync fork" in the github web-GUI.
We also recommend that you create a separate branch for your own work, see "Using local workstation" or "Using supercomputers" below for details.
Once forked to yourself, you can sync with the original repository (in case of updates) by running:
git pull https://github.com/csc-training/summerschool.git
The exercise assignments are provided in various README.md
s.
For most of the exercises, skeleton codes are provided both for
Fortran and C/C++ in the corresponding subdirectory. Some exercise
skeletons have sections marked with “TODO” for completing the
exercises. In addition, all of the exercises have exemplary full codes
(that can be compiled and run) in the solutions
folder. Note that these are
seldom the only or even the best way to solve the problem.
In case you have working parallel program development environment in your laptop (Fortran or C/C++ compiler, MPI development library, etc.) you may use that for exercises. Note, however, that no support for installing MPI environment can be provided during the course. Otherwise, you can use CSC supercomputers for carrying out the exercises.
Clone your personal fork in appropriate directory:
git clone [email protected]:<my-github-id>/summerschool.git
Create a branch:
git checkout -b hpcss23
Exercises can be carried out using the LUMI supercomputer.
LUMI can be accessed via ssh using the provided username and ssh key pair:
ssh -i <path-to-private-key> <username>@lumi.csc.fi
All the exercises in the supercomputers should be carried out in the
scratch disk area. The name of the scratch directory can be
queried with the command lumi-workspaces
. As the base directory is
shared between members of the project, you should create your own
directory:
cd /scratch/project_465000536/
mkdir -p $USER
cd $USER
In order to push code to your own fork, you need to add your SSH public key in LUMI to your github account. The SSH key can be added via "Settings"->"SSH and GPG keys"->"New SSH key", by copy-pasting output of the following command:
cat $HOME/.ssh/id_rsa.pub
Once succesfull, make sure you in your personal workspace in scratch area /scratch/project_465000536/$USER
, clone the repository, and a create a branch:
git clone [email protected]:<my-github-id>/summerschool.git
git checkout -b hpcss23
If you haven't used git before in LUMI, you need to add also your identity:
git config --global user.email "[email protected]"
git config --global user.name "Your Name"
Default editor for commit messages is vim, if you prefer something else you can add
to the file $HOME/.bashrc
e.g.
export EDITOR=nano
For editing program source files you can use e.g. nano editor:
nano prog.f90
(^
in nano's shortcuts refer to Ctrl key, i.e. in order to save file and exit editor press Ctrl+X
)
Also other popular editors such as emacs and vim are available.
LUMI has several programming environments. For summerschool, we recommend that you use the special summerschool modules:
module use /project/project_465000536/modules
For CPU programming use:
module load hpcss/cpu
For GPU programming use:
module load hpcss/gpu
Compilation of the MPI programs can be performed with the CC
, cc
, or ftn
wrapper commands:
CC -o my_mpi_exe test.cpp
or
cc -o my_mpi_exe test.c
or
ftn -o my_mpi_exe test.f90
The wrapper commands include automatically all the flags needed for building MPI programs.
Pure OpenMP (as well as serial) programs can also be compiled with the CC
,
cc
, and ftn
wrapper commands. OpenMP is enabled with the
-fopenmp
flag:
CC -o my_exe test.cpp -fopenmp
or
cc -o my_exe test.c -fopenmp
or
ftn -o my_exe test.f90 -fopenmp
When code uses also MPI, the wrapper commands include automatically all the flags needed for building MPI programs.
In order to use HDF5 in CSC supercomputers, you need the load the HDF5 module with MPI I/O support. The appropriate module in Lumi is
module load cray-hdf5-parallel/1.12.2.1
No special flags are needed for compiling and linking, the compiler wrappers take care of them automatically.
Usage in local workstation may vary.
On Lumi, the following modules are required:
module load LUMI/22.08
module load partition/G
module load cce/15.0.1
module load rocm/5.3.3
On Lumi, to compile your program, use
CC -fopenmp <source.cpp>
Use the following modules :
module load LUMI/22.08
module load partition/G
module load cce/15.0.1
module load rocm/5.3.3
To compile your program, use:
CC -xhip <source.cpp>
The following modules are required:
module load LUMI/22.08
module load partition/G
module load cce/15.0.1
module load rocm/5.3.3
Because the default HIPFORT
installation only supports gfortran, we use a custom installation prepared in the summer school project. This package provide Fortran modules compatible with the Cray Fortran compiler as well as direct use of hipfort with the Fortran Cray Compiler wrapper (ftn).
The package was installed via:
git clone https://github.com/ROCmSoftwarePlatform/hipfort.git
cd hipfort;
mkdir build;
cd build;
cmake -DHIPFORT_INSTALL_DIR=<path-to>/HIPFORT -DHIPFORT_COMPILER_FLAGS="-ffree -eZ" -DHIPFORT_COMPILER=ftn -DHIPFORT_AR=${CRAY_BINUTILS_BIN_X86_64}/ar -DHIPFORT_RANLIB=${CRAY_BINUTILS_BIN_X86_64}/ranlib ..
make -j 64
make install
We will use the Cray 'ftn' compiler wrapper as you would do to compile any fortran code plus some additional flags:
export HIPFORT_HOME=/project/project_465000536/appl/HIPFORT
ftn -I$HIPFORT_HOME/include/hipfort/amdgcn "-DHIPFORT_ARCH=\"amd\"" -L$HIPFORT_HOME/lib -lhipfort-amdgcn $LIB_FLAGS -c <fortran_code>.f90
CC -xhip -c <hip_kernels>.cpp
ftn -I$HIPFORT_HOME/include/hipfort/amdgcn "-DHIPFORT_ARCH=\"amd\"" -L$HIPFORT_HOME/lib -lhipfort-amdgcn $LIB_FLAGS -o main <fortran_code>.o hip_kernels.o
This option gives enough flexibility for calling HIP libraries from Fortran or for a mix of OpenMP/OpenACC offloading to GPUs and HIP kernels/libraries.
Programs need to be executed via the batch job system. Simple job running with 4 MPI tasks can be submitted with the following batch job script:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --account=project_465000536
#SBATCH --partition=standard
#SBATCH --reservation=summerschool_standard
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
srun my_mpi_exe
Save the script e.g. as job.sh
and submit it with sbatch job.sh
.
The output of job will be in file slurm-xxxxx.out
. You can check the status of your jobs with squeue -u $USER
and kill possible hanging applications with
scancel JOBID
.
The reservation summerschool
is available during the course days and it
is accessible only with the training user accounts.
For pure OpenMP programs one should use only one node and one MPI task per nodesingle tasks and specify the number of cores reserved
for threading with --cpus-per-task
:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --account=project_465000536
#SBATCH --partition=standard
#SBATCH --reservation=summerschool_standard
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
# Set the number of threads based on --cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun my_omp_exe
For hybrid MPI+OpenMP programs it is recommended to specify explicitly number of nodes, number of
MPI tasks per node (pure OpenMP programs as special case with one node and one task per node),
and number of cores reserved for threading. The number of nodes is specified with --nodes
(for most of the exercises you should use only a single node), number of MPI tasks per node
with --ntasks-per-node
, and number of cores reserved for threading with --cpus-per-task
.
The actual number of threads is specified with OMP_NUM_THREADS
environment variable.
Simple job running with 4 MPI tasks and 4 OpenMP threads per MPI task can be submitted with
the following batch job script:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --account=project_465000536
#SBATCH --partition=standard
#SBATCH --reservation=summerschool_standard
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=4
# Set the number of threads based on --cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun my_exe
When running GPU programs, few changes need to made to the batch job
script. The partition
is are now different, and one must also request explicitly given number of GPUs per node with the
--gpus-per-node=8
option. As an example, in order to use a
single GPU with single MPI task and a single thread use:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --account=project_465000536
#SBATCH --partition=standard-g
#SBATCH --reservation=summerschool_standard-g
#SBATCH --gpus-per-node=8
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
srun my_gpu_exe
In most MPI implementations parallel program can be started with the mpiexec
launcher:
mpiexec -n 4 ./my_mpi_exe
In most workstations, programs build with OpenMP use as many threads as there are CPU cores (note that this might include also "logical" cores with simultaneous multithreading). A pure OpenMP program can be normally started with specific number of threads with
OMP_NUM_THREADS=4 ./my_exe
and a hybrid MPI+OpenMP program e.g. with
OMP_NUM_THREADS=4 mpiexec -n 2 ./my_exe
See the MPI debugging exercise, CSC user guide, and LUMI documentation for possible debugging options.
# Create installation directory
mkdir -p .../appl/tau
cd .../appl/tau
# Download TAU
wget https://www.cs.uoregon.edu/research/tau/tau_releases/tau-2.32.tar.gz
tar xvf tau-2.32.tar.gz
mv tau-2.32 2.32
# Go to TAU directory
cd 2.32
./configure -bfd=download -otf=download -unwind=download -dwarf=download -iowrapper -cc=cc -c++=CC -fortran=ftn -pthread -mpi -phiprof -papi=/opt/cray/pe/papi/6.0.0.15/
make -j 64
./configure -bfd=download -otf=download -unwind=download -dwarf=download -iowrapper -cc=cc -c++=CC -fortran=ftn -pthread -mpi -papi=/opt/cray/pe/papi/6.0.0.15/ -rocm=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/ -rocprofiler=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/rocprofiler
make -j 64
./configure -bfd=download -otf=download -unwind=download -dwarf=download -iowrapper -cc=cc -c++=CC -fortran=ftn -pthread -mpi -papi=/opt/cray/pe/papi/6.0.0.15/ -rocm=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/ -roctracer=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/roctracer
make -j 64
TAU
and Omniperf
can be used to do performance analysis.
In order to use TAU one only has to load the modules needed to run the application be ran and set the paths to the TAU install folder:
export TAU=/project/project_465000536/appl/tau/2.32/craycnl
export PATH=$TAU/bin:$PATH
Profiling mpi code:
srun --cpus-per-task=1 --account=project_465000536 --nodes=1 --ntasks-per-node=4 --partition=standard --time=00:05:00 --reservation=summerschool_standard tau_exec -ebs ./mandelbrot
In order to to see the paraprof
in browser use vnc
:
module load lumi-vnc
start-vnc
Visualize:
paraprof
Tracing:
export TAU_TRACE=1
srun --cpus-per-task=1 --account=project_465000536 --nodes=1 --ntasks-per-node=4 --partition=standard --time=00:05:00 --reservation=summerschool_standard tau_exec -ebs ./mandelbrot
tau_treemerge.pl
tau_trace2json tau.trc tau.edf -chrome -ignoreatomic -o app.json
Copy app.json
to local computer, open ui.perfett.dev and then load the app.json
file.
https://amdresearch.github.io/omniperf/installation.html#client-side-installation
In order to use omniperf load the following modules:
module use /project/project_465000536/Omni/omniperf/modulefiles
module load omniperf
module load cray-python
srun -p standard-g --gpus 1 -N 1 -n 1 -c 1 --time=00:30:00 --account=project_465000536 omniperf profile -n workload_xy --roof-only --kernel-names -- ./heat_hip
omniperf analyze -p workloads/workload_xy/mi200/ > analyse_xy.txt
In additition to this one has to load the usual modules for running GPUs. Keep in mind the the above installation was done with rocm/5.3.3
.
It is useful add to the compilation of the application to be analysed the follwing -g -gdwarf-4
.
More information about TAU can be found in TAU User Documentation, while for Omniperf at Omniperf User Documentation