-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading not working when using OpenBLAS #107
Comments
Update 1: After installing MKL, multithreading seems to be working. Maybe this is related: JuliaLang/LinearAlgebra.jl#1000? |
On master there is nothing explicitly threaded in AlgebraicMultigrid.jl. Only for the coarse solver you can expect that it might be multithreaded. I will try to provide threaded smoothers, RAP and setup algorithms in the future. Right now I don't have enough time and hands to implement it. However, if you (or someone else reading this) want to go ahead, then also feel free to grab this and ping me in the PR or here. |
Thanks! Quick question: Any intuition on which part might be the bottleneck? |
In the default setup it is likely the coarse solver (pinv). In general: It depends on what smoothers, cycle and coarse solver you choose. |
I'm trying to solve the Laplace equation on a very large system of equations, ~ 1 billion unknowns. I'm willing to spend some time on a PR if there are significant gains to be achieved. I know it's difficult to know in advance without proper profiling the code though. |
If you want to keep to the shared memory environment, then I would suggest that you setup a Laplace problem with fewer dofs first and measure with e.g. https://github.com/KristofferC/TimerOutputs.jl the setup, smoother, RAP and coarse solve timings. Also, if it is possible use a different coarse solver. I think I forgot to add it to docs, so here is an example how to swap the coarse solver using AlgebraicMultigrid, LinearSolve
ml = ruge_stuben(A, coarse_solver=AlgebraicMultigrid.LinearSolveWrapper(UMFPACKFactorization())) The amount of speedup you can achieve depends on your specific system at hand (i.e. number of cores, cache sizes, memory bandwidth,...). |
Speeding up RAP should also be straight forward. We can try something in the direction of https://github.com/BacAmorim/ThreadedSparseCSR.jl/blob/main/src/batch_matmul.jl#L13-L34 . |
cc @tirtho109 @fredrikekre is there an update regarding the distributed memory implementation? |
I haven't done a rigorous profile, but eyeballing the CPU usage, it seems that the
_solve
step is single-threaded. Do you plan on adding multithreading? Thanks!The text was updated successfully, but these errors were encountered: