-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic CUB dispatch for scan to support c.parallel #3398
base: main
Are you sure you want to change the base?
Conversation
@@ -242,19 +276,19 @@ struct policy_hub | |||
struct Policy350 : ChainedPolicy<350, Policy350, Policy350> | |||
{ | |||
// GTX Titan: 29.5B items/s (232.4 GB/s) @ 48M 32-bit T | |||
using ScanPolicyT = | |||
using ScanPolicy = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macro CUB_DEFINE_SUB_POLICY_GETTER
makes some assumptions about the name of the policy so this rename is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sooo this is potentially problematic. The macro can be changed, what I'm mildly concerned about here is that this will break users who provide their own policies and spell them ScanPolicyT
, and that stops working with this PR. cc @gevtushenko
We should probably just inline the macro above and make this work with the existing name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, code passing user-defined policy should still work. For instance, this change should brake our scan tuning:
using ScanPolicyT = |
🟨 CI finished in 1h 50m: Pass: 87%/78 | Total: 1d 23h | Avg: 36m 46s | Max: 1h 15m | Hits: 284%/12340
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 1h 56m: Pass: 100%/78 | Total: 2d 03h | Avg: 39m 50s | Max: 1h 12m | Hits: 285%/12340
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me other than the renaming of the sub-policy type. Please see the comment thread from earlier for details.
typename OffsetT, | ||
typename AccumT, | ||
bool ForceInclusive> | ||
struct DeviceScanKernelSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: we probably don't want this struct as part of our public API. Let's wrap it into detail
namespace.
// TODO(ashwin): should this come from the launcher factory instead? | ||
// Get max x-dimension of grid | ||
int max_dim_x; | ||
error = CubDebug(cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: yes, all CUDA Runtime calls should be consolidated in launcher factory
@@ -242,19 +276,19 @@ struct policy_hub | |||
struct Policy350 : ChainedPolicy<350, Policy350, Policy350> | |||
{ | |||
// GTX Titan: 29.5B items/s (232.4 GB/s) @ 48M 32-bit T | |||
using ScanPolicyT = | |||
using ScanPolicy = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, code passing user-defined policy should still work. For instance, this change should brake our scan tuning:
using ScanPolicyT = |
Description
Closes #3397
Analogous to #2591, this PR extends the CUB dispatch layer for
Scan
to support dynamic kernel launching. This will later be used by c.parallel's scan implementation.