Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

he2hb Multi-Process Crash Issue #202

Open
2 tasks done
hualiang123 opened this issue Nov 28, 2024 · 0 comments
Open
2 tasks done

he2hb Multi-Process Crash Issue #202

hualiang123 opened this issue Nov 28, 2024 · 0 comments

Comments

@hualiang123
Copy link

hualiang123 commented Nov 28, 2024

Description
The eigenvalue solver heev sub-interface he2hb crashes when running with a multi-process,details are as follows:
[mshuangzd@b13r2n09 test]$ mpirun -n 4 ./tester --dim 1000 --nb 100 --target d he2hb
% SLATE version 2023.11.05, id f1c8490
% input: ./tester --dim 1000 --nb 100 --target d he2hb
% 2024-11-28 15:25:09, 4 MPI ranks, CPU-only MPI, 1 OpenMP threads, 4 GPU devices per MPI rank

type origin target A uplo n nb ib p q pt error time (s) gflop/s status
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
[b13r2n09:29260] *** Process received signal ***
[b13r2n09:29260] Signal: Aborted (6)
[b13r2n09:29260] Signal code: (-6)
[b13r2n09:29260] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b04433645d0]
[b13r2n09:29260] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b047c042207]
[b13r2n09:29260] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b047c0438f8]
[b13r2n09:29260] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b047bd657d5]
[b13r2n09:29260] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b047bd63746]
[b13r2n09:29260] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b047bd63773]
[b13r2n09:29260] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b0443d816ab]
[b13r2n09:29260] *** End of error message ***
terminate called after throwing an instance of 'slate::Exception'
what(): Error copying tile(8, 0), rank(3), invalid source -2 -> 1 in tileGet at /public/home/mshuangzd/slate-commit/include/slate/BaseMatrix.hh:2723
[b13r2n09:29262] *** Process received signal ***
[b13r2n09:29262] Signal: Aborted (6)
[b13r2n09:29262] Signal code: (-6)
[b13r2n09:29262] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b1e5d9e75d0]
[b13r2n09:29262] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b1e966c5207]
[b13r2n09:29262] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b1e966c68f8]
[b13r2n09:29262] [ 3] /usr//lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b1e963e87d5]
[b13r2n09:29262] [ 4] /usr//lib64/libstdc++.so.6(+0x5e746)[0x2b1e963e6746]
[b13r2n09:29262] [ 5] /usr//lib64/libstdc++.so.6(+0x5e773)[0x2b1e963e6773]
[b13r2n09:29262] [ 6] /public/home/mshuangzd/slate-commit/build/libslate.so(+0xe86ab)[0x2b1e5e4046ab]
[b13r2n09:29262] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node b13r2n09 exited on signal 6 (Aborted).

Steps To Reproduce

  1. Use the he2hb interface in the test set:mpirun -n 4 ./tester --dim 1000 --nb 100 --target d he2hb

If it can be reproduced with SLATE's testers, that is best. Otherwise, including a minimal reproducer is super helpful.

Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.

  • SLATE version / commit ID (e.g., git log --oneline -n 1):SLATE version 2023.11.05, id f1c8490

  • How installed:

    • git clone
  • How compiled:

    • CMake (include your command line options):cmake -Dgpu_backend=hip -Dblas=mkl -DCMAKE_INSTALL_PREFIX=./install -DCMAKE_CXX_COMPILER=hipcc -Dbuild_tests=true ..
  • Compiler & version (e.g., mpicxx --version):intel mpi2021

  • BLAS library (e.g., MKL, ESSL, OpenBLAS) & version:intel MKL2021

  • CUDA / ROCm / oneMKL version (e.g., nvcc --version):intel MKL2021

  • MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes mpicxx -v gives info.):intel mpi2021

  • OS:linux

  • Hardware (CPUs, GPUs, nodes):MI50 rocm5.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant