You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
The eigenvalue solver heev sub-interface he2hb crashes when running with a multi-process,details are as follows:
[mshuangzd@b13r2n09 test]$ mpirun -n 4 ./tester --dim 1000 --nb 100 --target d he2hb
% SLATE version 2023.11.05, id f1c8490
% input: ./tester --dim 1000 --nb 100 --target d he2hb
% 2024-11-28 15:25:09, 4 MPI ranks, CPU-only MPI, 1 OpenMP threads, 4 GPU devices per MPI rank
type origin target A uplo n nb ib p q pt error time (s) gflop/s status
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
[b13r2n09:29260] *** Process received signal ***
[b13r2n09:29260] Signal: Aborted (6)
[b13r2n09:29260] Signal code: (-6)
[b13r2n09:29260] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b04433645d0]
[b13r2n09:29260] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b047c042207]
[b13r2n09:29260] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b047c0438f8]
[b13r2n09:29260] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b047bd657d5]
[b13r2n09:29260] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b047bd63746]
[b13r2n09:29260] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b047bd63773]
[b13r2n09:29260] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b0443d816ab]
[b13r2n09:29260] *** End of error message ***
terminate called after throwing an instance of 'slate::Exception'
what(): Error copying tile(8, 0), rank(3), invalid source -2 -> 1 in tileGet at /public/home/mshuangzd/slate-commit/include/slate/BaseMatrix.hh:2723
[b13r2n09:29262] *** Process received signal ***
[b13r2n09:29262] Signal: Aborted (6)
[b13r2n09:29262] Signal code: (-6)
[b13r2n09:29262] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b1e5d9e75d0]
[b13r2n09:29262] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b1e966c5207]
[b13r2n09:29262] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b1e966c68f8]
[b13r2n09:29262] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b1e963e87d5]
[b13r2n09:29262] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b1e963e6746]
[b13r2n09:29262] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b1e963e6773]
[b13r2n09:29262] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b1e5e4046ab]
[b13r2n09:29262] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 1 with PID 0 on node b13r2n09 exited on signal 6 (Aborted).
Steps To Reproduce
Use the he2hb interface in the test set:mpirun -n 4 ./tester --dim 1000 --nb 100 --target d he2hb
If it can be reproduced with SLATE's testers, that is best. Otherwise, including a minimal reproducer is super helpful.
Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
SLATE version / commit ID (e.g., git log --oneline -n 1):SLATE version 2023.11.05, id f1c8490
How installed:
git clone
How compiled:
CMake (include your command line options):cmake -Dgpu_backend=hip -Dblas=mkl -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=hipcc -Dbuild_tests=true ..
Compiler & version (e.g., mpicxx --version):intel mpi2021
Description
The eigenvalue solver heev sub-interface he2hb crashes when running with a multi-process,details are as follows:
[mshuangzd@b13r2n09 test]$ mpirun -n 4 ./tester --dim 1000 --nb 100 --target d he2hb
% SLATE version 2023.11.05, id f1c8490
% input: ./tester --dim 1000 --nb 100 --target d he2hb
% 2024-11-28 15:25:09, 4 MPI ranks, CPU-only MPI, 1 OpenMP threads, 4 GPU devices per MPI rank
type origin target A uplo n nb ib p q pt error time (s) gflop/s status
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
[b13r2n09:29260] *** Process received signal ***
[b13r2n09:29260] Signal: Aborted (6)
[b13r2n09:29260] Signal code: (-6)
[b13r2n09:29260] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b04433645d0]
[b13r2n09:29260] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b047c042207]
[b13r2n09:29260] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b047c0438f8]
[b13r2n09:29260] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b047bd657d5]
[b13r2n09:29260] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b047bd63746]
[b13r2n09:29260] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b047bd63773]
[b13r2n09:29260] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b0443d816ab]
[b13r2n09:29260] *** End of error message ***
terminate called after throwing an instance of 'slate::Exception'
what(): Error copying tile(8, 0), rank(3), invalid source -2 -> 1 in tileGet at /public/home/mshuangzd/slate-commit/include/slate/BaseMatrix.hh:2723
[b13r2n09:29262] *** Process received signal ***
[b13r2n09:29262] Signal: Aborted (6)
[b13r2n09:29262] Signal code: (-6)
[b13r2n09:29262] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b1e5d9e75d0]
[b13r2n09:29262] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b1e966c5207]
[b13r2n09:29262] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b1e966c68f8]
[b13r2n09:29262] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b1e963e87d5]
[b13r2n09:29262] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b1e963e6746]
[b13r2n09:29262] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b1e963e6773]
[b13r2n09:29262] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b1e5e4046ab]
[b13r2n09:29262] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 1 with PID 0 on node b13r2n09 exited on signal 6 (Aborted).
Steps To Reproduce
If it can be reproduced with SLATE's testers, that is best. Otherwise, including a minimal reproducer is super helpful.
Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
SLATE version / commit ID (e.g.,
git log --oneline -n 1
):SLATE version 2023.11.05, id f1c8490How installed:
How compiled:
Compiler & version (e.g.,
mpicxx --version
):intel mpi2021BLAS library (e.g., MKL, ESSL, OpenBLAS) & version:intel MKL2021
CUDA / ROCm / oneMKL version (e.g.,
nvcc --version
):intel MKL2021MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes
mpicxx -v
gives info.):intel mpi2021OS:linux
Hardware (CPUs, GPUs, nodes):MI50 rocm5.7
The text was updated successfully, but these errors were encountered: