Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENABLE_PARALLELRESTART functionality produces failures on some UFS regression test platforms #368

Open
grantfirl opened this issue Jan 7, 2025 · 4 comments

Comments

@grantfirl
Copy link

Describe the bug
This is related to the NOAA-EMC fork and dev/emc branch.

During testing of ufs-community/ufs-weather-model#2529 that included atmos_cubed_sphere PR NOAA-EMC#89, the following error was noted:

The control_restart_p8_intel test is failing with a segmentation fault.

The err log shows the first non-libarary error as:
0x0000000002232a28 fv_io_mod_mp_fv_io_read_restart_() /scratch1/BMC/gmtb/Grant.Firl/ufs-weather-model-grantfirl/FV3/atmos_cubed_sphere/tools/fv_io.F90:495

To Reproduce
Turn ENABLE_PARALLELRESTART to ON in CMakeLists.txt and run the control_restart_p8_intel UFS regression test on Hera or Hercules (error was not reproduced on Acorn, according to @dkokron).

Expected behavior
The test completes without error.

System Environment
Hera and Hercules UFS RT environment

Additional context
See NOAA-EMC/fv3atm#896 for some related discussion.

@laurenchilutti
Copy link
Contributor

I took a look at the issue and want to note the error in more detail for those who do not have access to Hera to look at the logs:
Comments on the fv3atm pr: NOAA-EMC/fv3atm#896 point to a location on Hera.
I was able to find a traceback of the seg fault in:
/scratch1/BMC/gmtb/Grant.Firl/stmp2/Grant.Firl/FV3_RT/rt_1304460/control_restart_p8_intel/err:
120: forrtl: severe (174): SIGSEGV, segmentation fault occurred
120: Image PC Routine Line Source
120: fv3.exe 00000000041E194A Unknown Unknown Unknown
120: libpthread-2.28.s 000014ACE7D03D10 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE86D0DB0 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE84D7C79 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE879D1DF Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE881603A Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE880F3CD Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE8130F55 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE8130A56 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE87A9E8E Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE87A9877 PMPI_Waitall Unknown Unknown
120: libmpi.so.12.0.0 000014ACE7F96920 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE7F955CD Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE8B39219 Unknown Unknown Unknown
120: libmpi.so.12.0.0 000014ACE8B394D2 MPI_File_read_at_ Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6F12492 Unknown Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6F047CB H5FD_read Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6ED9433 H5F__accum_read Unknown Unknown
120: libhdf5.so.310.0. 000014ACE70305E0 H5PB_read Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6EE919E H5F_shared_block_ Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E90876 H5D__mpio_select_ Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E96562 Unknown Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E9DD95 Unknown Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E9FE3B H5D__collective_r Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E8BE1E H5D__read Unknown Unknown
120: libhdf5.so.310.0. 000014ACE715C9F4 H5VL__native_data Unknown Unknown
120: libhdf5.so.310.0. 000014ACE7144E7D H5VL_dataset_read Unknown Unknown
120: libhdf5.so.310.0. 000014ACE6E4D9CA H5Dread Unknown Unknown
120: libnetcdf.so.19.2 000014ACEBBB2C5C NC4_get_vars Unknown Unknown
120: libnetcdf.so.19.2 000014ACEBBB1EA6 NC4_get_vara Unknown Unknown
120: libnetcdf.so.19.2 000014ACEBB1493F NC_get_vara Unknown Unknown
120: libnetcdff.so.7.2 000014ACEB64FA6B nf_get_vara_real_ Unknown Unknown
120: libnetcdff.so.7.2 000014ACEB6BACAB netcdf_mp_nf90_ge Unknown Unknown
120: fv3.exe 0000000002650FF5 netcdf_io_mod_mp_ 483 netcdf_read_data.inc
120: fv3.exe 0000000002624C96 fms_netcdf_domain 373 domain_read.inc
120: fv3.exe 000000000261E60F fms_netcdf_domain 673 fms_netcdf_domain_io.F90
120: fv3.exe 0000000002232A28 fv_io_mod_mp_fv_i 495 fv_io.F90
120: fv3.exe 00000000022B95DA fv_restart_mod_mp 367 fv_restart.F90
120: fv3.exe 0000000001E39C86 atmosphere_mod_mp 460 atmosphere.F90
120: fv3.exe 0000000001CA9901 atmos_model_mod_m 569 atmos_model.F90
120: fv3.exe 0000000001AE8039 module_fcst_grid_ 816 module_fcst_grid_comp.F90
120: fv3.exe 0000000000A97464 Unknown Unknown Unknown
120: fv3.exe 0000000000A9B00F Unknown Unknown Unknown
120: fv3.exe 0000000000942A0A Unknown Unknown Unknown
120: fv3.exe 0000000001213CCF Unknown Unknown Unknown
120: fv3.exe 0000000000A988AA Unknown Unknown Unknown
120: fv3.exe 0000000000965B70 Unknown Unknown Unknown
120: fv3.exe 0000000000C9C181 Unknown Unknown Unknown
120: fv3.exe 0000000001AD8D46 fv3atm_cap_mod_mp 452 fv3_cap.F90
120: fv3.exe 00000000006A0878 Unknown Unknown Unknown
120: fv3.exe 00000000006A07DA Unknown Unknown Unknown
120: fv3.exe 00000000006A0E83 Unknown Unknown Unknown
120: fv3.exe 00000000004711CB Unknown Unknown Unknown
120: fv3.exe 0000000002419006 Unknown Unknown Unknown
120: fv3.exe 0000000000A97464 Unknown Unknown Unknown
120: fv3.exe 0000000000A9B00F Unknown Unknown Unknown
120: fv3.exe 00000000009427FA Unknown Unknown Unknown
120: fv3.exe 0000000001213CCF Unknown Unknown Unknown
120: fv3.exe 0000000000A988AA Unknown Unknown Unknown
120: fv3.exe 0000000000965B70 Unknown Unknown Unknown
120: fv3.exe 0000000000C9C181 Unknown Unknown Unknown
120: fv3.exe 00000000008E3BE0 Unknown Unknown Unknown
120: fv3.exe 000000000090C514 Unknown Unknown Unknown
120: fv3.exe 0000000000915CCB Unknown Unknown Unknown
120: fv3.exe 0000000000A97464 Unknown Unknown Unknown
120: fv3.exe 0000000000A9B00F Unknown Unknown Unknown
120: fv3.exe 00000000009427FA Unknown Unknown Unknown
120: fv3.exe 0000000001213CCF Unknown Unknown Unknown
120: fv3.exe 0000000000A988AA Unknown Unknown Unknown
120: fv3.exe 0000000000965B70 Unknown Unknown Unknown
120: fv3.exe 0000000000C9C181 Unknown Unknown Unknown
120: fv3.exe 000000000042E1FF MAIN__ 392 UFS.F90
120: fv3.exe 000000000042D022 Unknown Unknown Unknown
120: libc-2.28.so 000014ACE773D7E5 __libc_start_main Unknown Unknown
120: fv3.exe 000000000042CF2E Unknown Unknown Unknown

@laurenchilutti
Copy link
Contributor

@grantfirl has @dkokron been made aware of this issue?

@grantfirl
Copy link
Author

@grantfirl has @dkokron been made aware of this issue?

Yes indeed. He tried to reproduce the error on Acorn without success.

@bensonr
Copy link
Contributor

bensonr commented Jan 10, 2025

@grantfirl (cc @dkokron) - Looking at the error message @laurenchilutti included, I am inclined to believe this might could be a resource issue. A sigsegv when we are most likely asking for more memory down in the NetCDF/HDF layer is indicative of a lack of memory resources. The fact it works on other machines makes it even more likely, in my mind. Can you double check that the user environment running these tests is setting the shell stacklimit to unlimited in the shell startup rc/profile files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants