Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL failures for fluids example on SunSpot #1603

Open
jrwrigh opened this issue Jun 11, 2024 · 11 comments
Open

SYCL failures for fluids example on SunSpot #1603

jrwrigh opened this issue Jun 11, 2024 · 11 comments
Assignees

Comments

@jrwrigh
Copy link
Collaborator

jrwrigh commented Jun 11, 2024

I've seen some failures on SunSpot with the fluids examples. The general behavior is:

  • /gpu/sycl/ref passes fine everytime
  • /gpu/sycl/shared fails about 90% of the time
  • /gpu/sycl/gen fails about 10% of the time

The failures are only present on a few tests (SunSpot is down for maintenance today, so I can't confirm which ones exactly right now, but I'm fairly certain the Gaussian wave tests are one of them), but the above behavior is pretty consistent. This is observed using the oneapi/release/2024.04.15.001.

The failure specifically is a non-linear solver divergence:

[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: TSStep has failed due to DIVERGED_NONLINEAR_SOLVE, increase -ts_max_snes_failures or make negative to attempt recovery

Given the relationships between the backends, I'm guessing the error is probably in the shared functions between the shared and gen backends.

Tagging @kris-rowe @uumesh

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Sep 28, 2024

Running on Sunspot with the following environment:

Currently Loaded Modules:
  1) spack-pe-gcc/0.7.0-24.086.0   5) gcc/12.2.0                             9) oneapi/eng-compiler/2024.04.15.002  13) fmt/8.1.1-enhyvzg              17) re2/2023-09-01-7s7ikri        21) bear/3.0.20
  2) gmp/6.2.1-pcxzkau             6) mpich/icc-all-pmix-gpu/20231026       10) libfabric/1.15.2.0                  14) abseil-cpp/20230125.3-af6loxb  18) grpc/1.44.0-xoe6dyh           22) tmux/3.3a
  3) mpfr/4.2.0-w7v7yjv            7) mpich-config/collective-tuning/1024   11) cray-pals/1.3.3                     15) c-ares/1.15.0-bvkwg2y          19) nlohmann-json/3.11.2-ejousvp  23) cmake/3.27.7
  4) mpc/1.3.1-dfagrna             8) intel_compute_runtime/release/821.36  12) cray-libpals/1.3.3                  16) protobuf/3.21.12               20) spdlog/1.10.0-g3jfctv

Note: these tests are actually done with HONEE rather than the fluids example, but the tests are nearly identical between the two

I'm getting errors on the following tests (and their respective backends):

Test: navierstokes Advection 2D, implicit square wave, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Advection 2D, explicit square wave, indirect div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, IDL and Entropy variables: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Sequential Ceed: /gpu/sycl/shared
Test: navierstokes Gaussian Wave, explicit, supg, IDL: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
Test: navierstokes Advection, skew: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Indirect Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Direct Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Advection, rotation, cosine, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, using MatShell: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/gen
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Gaussian Wave, with IDL: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
Test: navierstokes Blasius: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow, Weak Temperature: /gpu/sycl/shared
Test: navierstokes Blasius, Strong STG Inflow: /gpu/sycl/shared
Test: navierstokes Channel: /gpu/sycl/gen
Test: navierstokes Channel, Primitive: /gpu/sycl/gen
Test: navierstokes Density Current, explicit: /gpu/sycl/shared
Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
Test: navierstokes Advection, rotation, implicit, SUPG stabilization: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/gen
Test: navierstokes Euler, explicit: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/gen

The failures are inconsistent. On back-to-back runs, I see the following failure differences:

$ diff junit2_failure_names.log junit_failure_names.log
5a6
> Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
13a15,17
> Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
15a20
> Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
20c25,26
< Test: navierstokes Channel: /gpu/sycl/shared
---
> Test: navierstokes Channel: /gpu/sycl/gen
> Test: navierstokes Channel, Primitive: /gpu/sycl/gen
21a28
> Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
23,24d29
< Test: navierstokes Advection, translation, implicit, SU stabilization: /gpu/sycl/shared
< Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/shared
26d30
< Test: navierstokes Advection 2D, rotation, implicit, SUPG stabilization: /gpu/sycl/shared

I've attached the make junit results here:
junit.log
junit2.log

Most of the failures are:

TSStep has failed due to DIVERGED_NONLINEAR_SOLVE,

Some fail when comparing to the reference solution. Of note on those is the following:

Test: navierstokes Gaussian Wave, explicit, supg, IDL
  $ build/navierstokes -ceed /gpu/sycl/shared -test_type solver -options_file examples/gaussianwave.yaml -compare_final_state_atol 1e-8 -compare_final_state_filename tests/output/fluids-navierstokes-gaussianwave-explicit.bin -dm_plex_box_faces 2,2,1 -ts_max_steps 5 -degree 3 -implicit false -ts_type rk -stab supg -state_var conservative -mass_ksp_type gmres -mass_pc_jacobi_type diagonal -idl_decay_time 2e-3 -idl_length 0.25 -idl_start 0 -idl_pressure 70
FAIL: stdout
Output:
Test failed with error norm 1.7366e+142

I say of note because:

  • The comically large error norm
  • It only fails with /gpu/sycl/shared
  • It fails on both back-to-back runs with the absurdly large error norm

Now the error norm might simply be due to numerical instabilities in the solution (it is explicit after all), but perhaps this problem in particular might illuminate the problems better.

@uumesh
Copy link
Contributor

uumesh commented Sep 29, 2024

Thanks for the notes. We will revisit the implementation of the kernels in the shared and gen backends. In the meantime, do you know if these tests also fail on the libCEED (fluids) side. Also worth checking are the ex1 and ex2 test cases in libCEED and if they pass with these backends.

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Sep 30, 2024

It's the same behavior for the fluids example tests, minus the tests that are in HONEE and not libCEED.

ex1 and ex2 pass fine.

@jeremylt
Copy link
Member

note - the SYCL backends also need a ton of updates from the CUDA/HIP backends, so that might be a more worthwhile usage of time since those are such extensive changes

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Oct 2, 2024

Per suggestion of @nbeams , I tried the libCEED tests with export ZE_SERIALIZE=2 and they pass with this environment variable set. TBH, I'm not sure what it does, but I'm guessing it disallows some form of out-of-order execution.

@nbeams
Copy link
Contributor

nbeams commented Oct 2, 2024

ZE_SERIALIZE=2 forces all kernel launches to be serialized with respect to the host. Since that seems to fix the problem, that points to a sync issue somewhere being the source of the failures. Unfortunately, it doesn't help us narrow down where it's coming from...

@uumesh
Copy link
Contributor

uumesh commented Oct 2, 2024

@nbeams - when you say serialize the kernel launches, is that equivalent to meaning in-order execution of the queue? If that is the case, it might be easier to look for where we might have missed a queue synchronization.

@nbeams
Copy link
Contributor

nbeams commented Oct 2, 2024

It would make the kernels in-order, but I think it also means the kernel launches are blocking. I've been told it's like doing CUDA_LAUNCH_BLOCKING=1.

@jeremylt
Copy link
Member

jeremylt commented Dec 3, 2024

@jrwrigh is this issue solved?

@jrwrigh
Copy link
Collaborator Author

jrwrigh commented Dec 3, 2024

Nope, just a fresh round of debugging to try and nail it down. (Assuming you saw the slack message. If you didn't, your spidey senses are working well. haha)

@jeremylt
Copy link
Member

jeremylt commented Dec 3, 2024

Checking on issues before the release 🕸️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants