-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL failures for fluids example on SunSpot #1603
Comments
Running on Sunspot with the following environment:
Note: these tests are actually done with HONEE rather than the fluids example, but the tests are nearly identical between the two I'm getting errors on the following tests (and their respective backends):
The failures are inconsistent. On back-to-back runs, I see the following failure differences:
I've attached the Most of the failures are:
Some fail when comparing to the reference solution. Of note on those is the following:
I say of note because:
Now the error norm might simply be due to numerical instabilities in the solution (it is explicit after all), but perhaps this problem in particular might illuminate the problems better. |
Thanks for the notes. We will revisit the implementation of the kernels in the shared and gen backends. In the meantime, do you know if these tests also fail on the libCEED (fluids) side. Also worth checking are the ex1 and ex2 test cases in libCEED and if they pass with these backends. |
It's the same behavior for the fluids example tests, minus the tests that are in HONEE and not libCEED. ex1 and ex2 pass fine. |
note - the SYCL backends also need a ton of updates from the CUDA/HIP backends, so that might be a more worthwhile usage of time since those are such extensive changes |
Per suggestion of @nbeams , I tried the libCEED tests with |
|
@nbeams - when you say serialize the kernel launches, is that equivalent to meaning in-order execution of the queue? If that is the case, it might be easier to look for where we might have missed a queue synchronization. |
It would make the kernels in-order, but I think it also means the kernel launches are blocking. I've been told it's like doing |
@jrwrigh is this issue solved? |
Nope, just a fresh round of debugging to try and nail it down. (Assuming you saw the slack message. If you didn't, your spidey senses are working well. haha) |
Checking on issues before the release 🕸️ |
I've seen some failures on SunSpot with the fluids examples. The general behavior is:
/gpu/sycl/ref
passes fine everytime/gpu/sycl/shared
fails about 90% of the time/gpu/sycl/gen
fails about 10% of the timeThe failures are only present on a few tests (SunSpot is down for maintenance today, so I can't confirm which ones exactly right now, but I'm fairly certain the Gaussian wave tests are one of them), but the above behavior is pretty consistent. This is observed using the
oneapi/release/2024.04.15.001
.The failure specifically is a non-linear solver divergence:
Given the relationships between the backends, I'm guessing the error is probably in the shared functions between the
shared
andgen
backends.Tagging @kris-rowe @uumesh
The text was updated successfully, but these errors were encountered: