Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading not working when using OpenBLAS #107

Open
ma-sadeghi opened this issue Oct 22, 2023 · 9 comments
Open

Multithreading not working when using OpenBLAS #107

ma-sadeghi opened this issue Oct 22, 2023 · 9 comments

Comments

@ma-sadeghi
Copy link

I haven't done a rigorous profile, but eyeballing the CPU usage, it seems that the _solve step is single-threaded. Do you plan on adding multithreading? Thanks!

@ma-sadeghi
Copy link
Author

ma-sadeghi commented Oct 22, 2023

Update 1: After installing MKL, multithreading seems to be working. Maybe this is related: JuliaLang/LinearAlgebra.jl#1000?

@ma-sadeghi ma-sadeghi changed the title Support multithreading? Multithreading not working when using OpenBLAS Oct 22, 2023
@ma-sadeghi
Copy link
Author

Update 2: Upon closer looking, even with MKL, it seems that the multithreaded part is only ~5% of the total run time. Here's a screenshot of CPU usage during the _solve phase:

image

@termi-official
Copy link
Contributor

On master there is nothing explicitly threaded in AlgebraicMultigrid.jl. Only for the coarse solver you can expect that it might be multithreaded. I will try to provide threaded smoothers, RAP and setup algorithms in the future. Right now I don't have enough time and hands to implement it. However, if you (or someone else reading this) want to go ahead, then also feel free to grab this and ping me in the PR or here.

@ma-sadeghi
Copy link
Author

Thanks! Quick question: Any intuition on which part might be the bottleneck?

@termi-official
Copy link
Contributor

In the default setup it is likely the coarse solver (pinv). In general: It depends on what smoothers, cycle and coarse solver you choose.

@ma-sadeghi
Copy link
Author

I'm trying to solve the Laplace equation on a very large system of equations, ~ 1 billion unknowns. I'm willing to spend some time on a PR if there are significant gains to be achieved. I know it's difficult to know in advance without proper profiling the code though.

@termi-official
Copy link
Contributor

If you want to keep to the shared memory environment, then I would suggest that you setup a Laplace problem with fewer dofs first and measure with e.g. https://github.com/KristofferC/TimerOutputs.jl the setup, smoother, RAP and coarse solve timings. Also, if it is possible use a different coarse solver. I think I forgot to add it to docs, so here is an example how to swap the coarse solver

using AlgebraicMultigrid, LinearSolve
ml = ruge_stuben(A, coarse_solver=AlgebraicMultigrid.LinearSolveWrapper(UMFPACKFactorization()))

The amount of speedup you can achieve depends on your specific system at hand (i.e. number of cores, cache sizes, memory bandwidth,...).

@termi-official
Copy link
Contributor

Speeding up RAP should also be straight forward. We can try something in the direction of https://github.com/BacAmorim/ThreadedSparseCSR.jl/blob/main/src/batch_matmul.jl#L13-L34 .

@termi-official
Copy link
Contributor

cc @tirtho109 @fredrikekre is there an update regarding the distributed memory implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants