-
Notifications
You must be signed in to change notification settings - Fork 15
System‐specific tuning and issues
Building DLA-Future on LUMI may end up linking to xpmem
(indirectly, e.g. through hwloc
). This can have a large detrimental impact on DLA-Future performance (up to 50% slower). Cray MPICH will default to using xpmem
for on-node messages if it's linked into an application. To explicitly opt out of using xpmem
set the environment variable MPICH_SMP_SINGLE_COPY_MODE=CMA
.
If using DLAF_WITH_MPI_GPU_SUPPORT=ON
on LUMI, the environment variable MPICH_GPU_SUPPORT_ENABLED=1
must be set. In addition, if the application is built without Cray's compiler wrappers you must ensure that the application links against libmpi_gtl_hsa.so
. If this isn't done during link-time, you may preload the library with the environment variable LD_PRELOAD=/opt/cray/pe/lib64/libmpi_gtl_hsa.so.0
. This assumes the use of the system installation of HIP and MPICH.
When using stackinator to build a HIP environment, the required HIP libraries are loaded dynamically from the environment by Cray MPICH. Unless HIP paths are added explicitly to LD_LIBRARY_PATH
, GPU-aware MPI is likely to hang, in particular when using multiple nodes. Intra-node communication may work without setting the path. The path that should be added to LD_LIBRARY_PATH
is the library directory of the hsa-rocr-dev
package, e.g. with export LD_LIBRARY_PATH=$(spack location -i hsa-rocr-dev)/lib:$LD_LIBRARY_PATH
. Some versions of HIP place the libraries under lib64
.
When using GPU-aware MPI communication may fail inside a GPU kernel with Memory access fault by GPU node-N
. According to HPE this is likely a bug in MPICH and the chances of this failure happening can be reduced by increasing the initial size of the Umpire memory pools in DLA-Future, e.g. to 16 GiB (the default is 1 GiB). This can be done with
export DLAF_UMPIRE_HOST_MEMORY_POOL_INITIAL_BYTES=$((1 << 34))
export DLAF_UMPIRE_DEVICE_MEMORY_POOL_INITIAL_BYTES=$((1 << 34))
Setting export FI_MR_CACHE_MAX_COUNT=0
may avoid hangs during shutdown.
Not setting export FI_MR_CACHE_MAX_COUNT=0
may significantly degrade performance compared to other HIP versions. Setting it restores performance to similar values as with 5.2.3.
MPICH may deadlock on larger input matrices either without warning or explicitly with the following warning:
PE 31: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
In that case setting export FI_CXI_RDZV_THRESHOLD=131072
or higher may help avoid hangs.