Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninitialised ssnow%rtevap_sat and ssnow%rtevap_unsat in worker processes #397

Open
SeanBryan51 opened this issue Sep 11, 2024 · 2 comments · May be fixed by #471
Open

Uninitialised ssnow%rtevap_sat and ssnow%rtevap_unsat in worker processes #397

SeanBryan51 opened this issue Sep 11, 2024 · 2 comments · May be fixed by #471
Assignees

Comments

@SeanBryan51
Copy link
Collaborator

Hacking a temporary fix for #395 and running CABLE-MPI offline (main branch - commit d45a62c) using the crujra_accessN96_1h:main-reduced-landmask configuration through the DDT debugger results in the following floating point exception:

forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source             
cable-mpi          0000000000C6DD14  Unknown               Unknown  Unknown
libpthread-2.28.s  0000155548397D20  Unknown               Unknown  Unknown
cable-mpi          00000000008522E5  cable_canopy_modu         500  cable_canopy.F90
cable-mpi          00000000007F682F  cable_cbm_module_         184  cbl_model_driver_offline.F90
cable-mpi          00000000004949C4  cable_mpiworker_m         664  cable_mpiworker.F90
cable-mpi          000000000040E60D  MAIN__                     56  cable_mpidrv.F90
cable-mpi          000000000040DA22  Unknown               Unknown  Unknown
libc-2.28.so       0000155547C677E5  __libc_start_main     Unknown  Unknown
cable-mpi          000000000040D92E  Unknown               Unknown  Unknown

The error occurs on the following line of the code:

ssnow%isflag, REAL(ssnow%satfrac),ssnow%rtsoil, &

It looks like variables ssnow%rtevap_sat and ssnow%rtevap_unsat are uninitialised in all worker processes.

Steps to reproduce (Gadi)

Note: the reduced land mask input file requires access to /g/data/tm70.

  1. Checkout the CABLE main branch at commit d45a62c.
  2. Apply the following patch to fix the error described in Uninitialised variables used in veg%froot calculation #395 (WARNING - this patch is untested and should not be used for work other than reproducing this issue):
diff --git a/src/offline/cable_parameters.F90 b/src/offline/cable_parameters.F90
index 8dc6d28..2aa25f3 100644
--- a/src/offline/cable_parameters.F90
+++ b/src/offline/cable_parameters.F90
@@ -2937,11 +2937,11 @@ CONTAINS
     totdepth = 0.0
     DO is = 1, ms-1
        totdepth = totdepth + soil_zse(is) * 100.0  ! unit in centimetres
-       veg%froot(:, is) = MIN( 1.0, 1.0-veg%rootbeta(:)**totdepth )
+       veg%froot(ifmp:fmp, is) = MIN( 1.0, 1.0-veg%rootbeta(ifmp:fmp)**totdepth )
     END DO
-    veg%froot(:, ms) = 1.0 - veg%froot(:, ms-1)
+    veg%froot(ifmp:fmp, ms) = 1.0 - veg%froot(ifmp:fmp, ms-1)
     DO is = ms-1, 2, -1
-       veg%froot(:, is) = veg%froot(:, is)-veg%froot(:,is-1)
+    veg%froot(ifmp:fmp, is) = veg%froot(ifmp:fmp, is)-veg%froot(ifmp:fmp,is-1)
     END DO

   END SUBROUTINE init_veg_from_vegin
  1. Build the MPI executable with a debug build.
  2. Clone the crujra_accessN96_1h repository and change branch to main-reduced-landmask.
  3. Set exe in config.yaml to point to the absolute path of the CABLE-MPI executable.
  4. Setup the configuration with payu setup:
# via hh5 modules
module load conda/analysis3-24.07
payu setup
  1. Change directory into work and submit an interactive job with X11 forwarding enabled:
cd work
qsub -IX -l walltime=3600 -l ncpus=4 -l mem=16GB -l wd -j n -l storage=gdata/hh5+gdata/rp23+gdata/tm70
  1. Load modules (e.g. MPI implementation and linaro-forge):
module load intel-mpi/2019.5.281 linaro-forge/24.0.2
  1. Launch CABLE with DDT:
ddt mpirun -n $PBS_NCPUS ./cable-mpi
  1. Enable "Memory Debugging" (balanced settings) when prompted to run the debugging session.
  2. Run the executable and wait for the floating point exception to occur.
@SeanBryan51
Copy link
Collaborator Author

The ssnow%rtevap_sat and ssnow%rtevap_unsat variables seem to initialised in the master process on the following line:

ssnow%rtevap_sat = 0.0
ssnow%rtevap_unsat = 0.0

Stack trace of master process:

mpi_driver (cable_mpidrv.F90:54)
  cable_mpimaster::mpidrv_master (cable_mpimaster.F90:597)
    cable_input_module::load_parameters (cable_input.F90:2674)
      cable_param_module::write_default_params (cable_parameters.F90:1645)

The master_cable_params routine looks like it is responsible for communicating the initialised parameters in the master process to the workers - however it looks like the function has no knowledge of the ssnow%rtevap_unsat or ssnow%rtevap_sat variables. Might be a case where the developer forgot to add these variables to the master_cable_params routine.

@har917
Copy link
Collaborator

har917 commented Sep 11, 2024

@SeanBryan51 @ccarouge It appears that %rtevap_sat and %rtevap_unsat have slipped through rigorous testing of the MPI code in that they are fields in the ssnow TYPE that do not appear in the worker_cable_params section that receives information from the master. In practice this makes little/no difference as the configurations that use these variables (GW_model and or_evap) will initialise them in a routine called from cable_canopy

In serial offline these variables are specifically initialised (to zero) in cable_parameters - so rather than attempting to modify the MPI section it's possibly easiest to simply add

ssnow%rtevap_sat(:) = 0.0
ssnow%rtevap_unsat(:) = 0.0

at/around 532 in cable_mpiworker (which is where the other fields in the met, canopy and ssnow TYPEs that aren't propoerly broadcast are initialised).

As far as I know - other than possibly wishing to use these variables in an output - these can stay on the worker side (so this 'bug' is largely a result of how the MPI was implemented). Hopefully the parallel i/o work can address this.

I think the issue that @bibivking has found is different since that use case requires that members of the rad% TYPE are passed from the master to worker each time step - and this is not possible given the current structure of worker_intype()

@SeanBryan51 SeanBryan51 self-assigned this Nov 11, 2024
@SeanBryan51 SeanBryan51 linked a pull request Nov 12, 2024 that will close this issue
2 tasks
@SeanBryan51 SeanBryan51 linked a pull request Nov 12, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants