You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I have been working with a slightly modified version of the eigenvalues demo script (included here). This code runs successfully (no errors, correct output as far as I can tell) when I run using up to a 10k x 10k matrix on a 8 node, 8 MPI process per node, 4 CPUs per MPI process grid. When I scale to a 100k x 100k matrix on a 10 node, 5 MPI process per node, 4 CPUs per MPI process grid, I get the error below (printed once for each of the 50 MPI processes):
terminate called after throwing an instance of 'std::domain_error'
what(): Input matrix contains Inf or NaN
[r6401:3333756] *** Process received signal ***
[r6401:3333756] Signal: Aborted (6)
[r6401:3333756] Signal code: (-6)
[r6401:3333758] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f59276f9090]
[r6401:3333758] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f59276f900b]
[r6401:3333758] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f59276d8859]
[r6401:3333758] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa4ee6)[0x7f59279c2ee6]
[r6401:3333758] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb6f8c)[0x7f59279d4f8c]
[r6401:3333758] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb6ff7)[0x7f59279d4ff7]
[r6401:3333758] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb7258)[0x7f59279d5258]
[r6401:3333758] [ 7] /path/to/spack/install/spack/opt/spack/linux-ubuntu20.04-haswell/gcc-9.4.0/slate-2023.11.05-5ysppah/lib/libslate.so(_ZN5slate5stedcIfEEvRSt6vectorIT_SaIS2_EES5_RNS_6MatrixIS2_EERKSt3mapINS_6OptionENS_11OptionValueESt4lessISA_ESaISt4pairIKSA_SB_EEE+0x4e5)[0x7f5929c00935]
[r6403:3333758] [ 9] fast_eigenvals(+0xbecd)[0x563f5b231ecd]
[r6403:3333758] [10] fast_eigenvals(+0x5876)[0x563f5b22b876]
[r6403:3333758] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f36cef0e083]
[r6403:3333758] [12] fast_eigenvals(+0x59ee)[0x563f5b22b9ee]
[r6403:3333758] *** End of error message ***
At the end, I then see:
--------------------------------------------------------------------------
mpirun noticed that process rank 27 with PID 662523 on node r6406 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Command exited with non-zero status 134
I trimmed this because this message is just repeated over and over, but if helpful, I can give the whole output.
Steps To Reproduce
Compile the script (linking BLASpp, LAPACKpp, Intel MKL, and SLATE. I also use the unmodified utils.hh file from the examples SLATE tutorial (Tutorial Link).
Batch the script using OpenMPI.
C++ Code
// Solve Hermitian eigenvalues A = X Lambda X^H
#include<slate/slate.hh>
#include"util.hh"
#include<string>// Include I/O
#include<iostream>
#include<fstream>
#include<unistd.h>int mpi_size = 0;
int mpi_rank = 0;
//------------------------------------------------------------------------------template <typename T>
voidhermitian_eig(std::string& output_file)
{
usingreal_t = blas::real_type<T>;
print_func( mpi_rank );
// Print hostnamechar hostname[256];
gethostname( hostname, 256 );
printf( "Rank %d of %d on %s\n", mpi_rank, mpi_size, hostname );
// TODO: failing if n not divisible by nb?int64_t n=100000, nb=500, p=10, q=5;
// Maybe can remove thisassert(mpi_size == p*q);
// Rank 0 prints out the parametersif (mpi_rank == 0) {
printf( "Parameters: n %ld, nb %ld, p %ld, q %ld\n", n, nb, p, q );
}
//slate::HermitianMatrix<T> A( slate::Uplo::Lower, n, nb, p, q, MPI_COMM_WORLD );
slate::SymmetricMatrix<T> A( slate::Uplo::Lower, n, nb, p, q, MPI_COMM_WORLD );
A.insertLocalTiles();
random_matrix( A );
std::vector<real_t> Lambda( n );
// A = X Lambda X^H, eigenvalues only// slate::eig_vals( A, Lambda ); // simplified API//slate::Matrix<T> Xempty;//slate::heev( slate::Job::NoVec, A, Lambda, Xempty );
slate::Matrix<T> X( n, n, nb, p, q, MPI_COMM_WORLD );
X.insertLocalTiles();
slate::eig( A, Lambda, X ); // simplified API// // Write to file as csv// if (mpi_rank == 0) {// std::ofstream file(output_file);// for (int i = 0; i < n; i++) {// file << Lambda[i] << "\n";// }// file << std::endl;// file.close();// }
}
//------------------------------------------------------------------------------intmain( int argc, char** argv )
{
int provided = 0;
int err = MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided );
assert( err == 0 );
assert( provided == MPI_THREAD_MULTIPLE );
err = MPI_Comm_size( MPI_COMM_WORLD, &mpi_size );
assert( err == 0 );
// if (mpi_size != 4) {// printf( "Usage: mpirun -np 4 %s # 4 ranks hard coded\n", argv[0] );// return -1;// }
err = MPI_Comm_rank( MPI_COMM_WORLD, &mpi_rank );
assert( err == 0 );
// so random_matrix is different on different ranks.srand( 100 * mpi_rank );
std::string output_file = "eigenvals.csv";
hermitian_eig< float >( output_file );
err = MPI_Finalize();
assert( err == 0 );
// Print finished from rank 0if (mpi_rank == 0) {
printf( "Finished\n" );
}
}
Description
I have been working with a slightly modified version of the eigenvalues demo script (included here). This code runs successfully (no errors, correct output as far as I can tell) when I run using up to a 10k x 10k matrix on a 8 node, 8 MPI process per node, 4 CPUs per MPI process grid. When I scale to a 100k x 100k matrix on a 10 node, 5 MPI process per node, 4 CPUs per MPI process grid, I get the error below (printed once for each of the 50 MPI processes):
At the end, I then see:
I trimmed this because this message is just repeated over and over, but if helpful, I can give the whole output.
Steps To Reproduce
C++ Code
Corresponding SLURM File
Makefile
Environment
All software other than SLURM and gcc was installed through spack:
git log --oneline -n 1
):[email protected]
make.inc
)mpicxx --version
):[email protected]
[email protected]
mpicxx -v
gives info.):[email protected]
The text was updated successfully, but these errors were encountered: