Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic CUB dispatch for scan to support c.parallel #3398

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jan 15, 2025

Description

Closes #3397

Analogous to #2591, this PR extends the CUB dispatch layer for Scan to support dynamic kernel launching. This will later be used by c.parallel's scan implementation.

@shwina shwina requested review from a team as code owners January 15, 2025 13:00
@@ -242,19 +276,19 @@ struct policy_hub
struct Policy350 : ChainedPolicy<350, Policy350, Policy350>
{
// GTX Titan: 29.5B items/s (232.4 GB/s) @ 48M 32-bit T
using ScanPolicyT =
using ScanPolicy =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro CUB_DEFINE_SUB_POLICY_GETTER makes some assumptions about the name of the policy so this rename is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sooo this is potentially problematic. The macro can be changed, what I'm mildly concerned about here is that this will break users who provide their own policies and spell them ScanPolicyT, and that stops working with this PR. cc @gevtushenko

We should probably just inline the macro above and make this work with the existing name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, code passing user-defined policy should still work. For instance, this change should brake our scan tuning:

, we just don't build it as part of CI

Copy link
Contributor

🟨 CI finished in 1h 50m: Pass: 87%/78 | Total: 1d 23h | Avg: 36m 46s | Max: 1h 15m | Hits: 284%/12340
  • 🟨 cub: Pass: 86%/38 | Total: 1d 05h | Avg: 46m 07s | Max: 1h 15m | Hits: 369%/3120

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  86%/36  | Total:  1d 03h | Avg: 45m 32s | Max:  1h 15m | Hits: 369%/3120  
      🟩 arm64              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 20s | Max: 58m 03s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 08s | Max:  1h 02m
      🔍 nvcc               Pass:  86%/36  | Total:  1d 03h | Avg: 45m 23s | Max:  1h 15m | Hits: 369%/3120  
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/14  | Total: 11h 57m | Avg: 51m 14s | Max:  1h 02m
      🔍 GCC                Pass:  72%/18  | Total: 10h 42m | Avg: 35m 42s | Max: 58m 44s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 25m | Avg:  1h 06m | Max:  1h 15m | Hits: 369%/3120  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🔍 gpu: v100 🔍
      🟩 h100               Pass: 100%/2   | Total: 42m 25s | Avg: 21m 12s | Max: 26m 06s
      🔍 v100               Pass:  86%/36  | Total:  1d 04h | Avg: 47m 30s | Max:  1h 15m | Hits: 369%/3120  
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  83%/31  | Total:  1d 02h | Avg: 52m 10s | Max:  1h 15m | Hits: 369%/3120  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 58s | Avg: 19m 58s | Max: 19m 58s
      🟩 GraphCapture       Pass: 100%/1   | Total: 14m 57s | Avg: 14m 57s | Max: 14m 57s
      🟩 HostLaunch         Pass: 100%/3   | Total: 54m 47s | Avg: 18m 15s | Max: 19m 18s
      🟩 TestGPU            Pass: 100%/2   | Total: 45m 32s | Avg: 22m 46s | Max: 24m 21s
    🔍 std: 17 🔍
      🔍 17                 Pass:  64%/14  | Total: 11h 38m | Avg: 49m 55s | Max:  1h 05m | Hits: 369%/2340  
      🟩 20                 Pass: 100%/24  | Total: 17h 33m | Avg: 43m 53s | Max:  1h 15m | Hits: 368%/780   
    🟨 ctk
      🟨 12.0               Pass:  60%/5   | Total:  4h 02m | Avg: 48m 35s | Max:  1h 04m | Hits: 369%/780   
      🟩 12.5               Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟨 12.6               Pass:  90%/31  | Total: 23h 02m | Avg: 44m 36s | Max:  1h 15m | Hits: 369%/2340  
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 58m | Avg: 59m 08s | Max:  1h 02m
      🟨 nvcc12.0           Pass:  60%/5   | Total:  4h 02m | Avg: 48m 35s | Max:  1h 04m | Hits: 369%/780   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟨 nvcc12.6           Pass:  89%/29  | Total: 21h 04m | Avg: 43m 36s | Max:  1h 15m | Hits: 369%/2340  
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 41m | Avg: 55m 24s | Max: 57m 51s
      🟩 Clang15            Pass: 100%/1   | Total: 59m 08s | Avg: 59m 08s | Max: 59m 08s
      🟩 Clang16            Pass: 100%/1   | Total: 55m 22s | Avg: 55m 22s | Max: 55m 22s
      🟩 Clang17            Pass: 100%/1   | Total: 54m 35s | Avg: 54m 35s | Max: 54m 35s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 26m | Avg: 46m 40s | Max:  1h 02m
      🟥 GCC7               Pass:   0%/2   | Total:  1h 04m | Avg: 32m 29s | Max: 33m 17s
      🟥 GCC8               Pass:   0%/1   | Total: 31m 43s | Avg: 31m 43s | Max: 31m 43s
      🟥 GCC9               Pass:   0%/2   | Total:  1h 04m | Avg: 32m 27s | Max: 33m 17s
      🟩 GCC10              Pass: 100%/1   | Total: 55m 13s | Avg: 55m 13s | Max: 55m 13s
      🟩 GCC11              Pass: 100%/1   | Total: 58m 05s | Avg: 58m 05s | Max: 58m 05s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 41m | Avg: 33m 43s | Max: 58m 44s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 26m | Avg: 33m 20s | Max: 58m 03s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits: 369%/1560  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 15m | Hits: 368%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 42m 25s | Avg: 21m 12s | Max: 26m 06s
      🟩 90a                Pass: 100%/1   | Total: 25m 00s | Avg: 25m 00s | Max: 25m 00s
    
  • 🟨 thrust: Pass: 86%/37 | Total: 18h 01m | Avg: 29m 13s | Max: 1h 06m | Hits: 255%/9220

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  85%/35  | Total: 17h 04m | Avg: 29m 16s | Max:  1h 06m | Hits: 255%/9220  
      🟩 arm64              Pass: 100%/2   | Total: 56m 59s | Avg: 28m 29s | Max: 30m 26s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 55s | Avg: 26m 57s | Max: 28m 11s
      🔍 nvcc               Pass:  85%/35  | Total: 17h 07m | Avg: 29m 21s | Max:  1h 06m | Hits: 255%/9220  
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/14  | Total:  6h 23m | Avg: 27m 24s | Max: 35m 37s
      🔍 GCC                Pass:  68%/16  | Total:  5h 11m | Avg: 19m 29s | Max: 34m 49s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 33m | Avg: 54m 47s | Max:  1h 06m | Hits: 255%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 53s | Max: 56m 08s
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  83%/31  | Total: 16h 21m | Avg: 31m 39s | Max:  1h 06m | Hits: 228%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 51m 05s | Avg: 17m 01s | Max: 36m 11s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 49m 03s | Avg: 16m 21s | Max: 20m 05s
    🔍 std: 17 🔍
      🔍 17                 Pass:  64%/14  | Total:  6h 48m | Avg: 29m 09s | Max:  1h 01m | Hits: 228%/5532  
      🟩 20                 Pass: 100%/21  | Total: 10h 31m | Avg: 30m 05s | Max:  1h 06m | Hits: 296%/3688  
    🟨 ctk
      🟨 12.0               Pass:  60%/5   | Total:  2h 05m | Avg: 25m 07s | Max: 53m 45s | Hits: 228%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 53s | Max: 56m 08s
      🟨 12.6               Pass:  90%/30  | Total: 14h 04m | Avg: 28m 08s | Max:  1h 06m | Hits: 262%/7376  
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 55s | Avg: 26m 57s | Max: 28m 11s
      🟨 nvcc12.0           Pass:  60%/5   | Total:  2h 05m | Avg: 25m 07s | Max: 53m 45s | Hits: 228%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 51m | Avg: 55m 53s | Max: 56m 08s
      🟨 nvcc12.6           Pass:  89%/28  | Total: 13h 10m | Avg: 28m 13s | Max:  1h 06m | Hits: 262%/7376  
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 02m | Avg: 30m 38s | Max: 32m 17s
      🟩 Clang15            Pass: 100%/1   | Total: 35m 37s | Avg: 35m 37s | Max: 35m 37s
      🟩 Clang16            Pass: 100%/1   | Total: 28m 27s | Avg: 28m 27s | Max: 28m 27s
      🟩 Clang17            Pass: 100%/1   | Total: 32m 31s | Avg: 32m 31s | Max: 32m 31s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 44m | Avg: 23m 31s | Max: 31m 19s
      🟥 GCC7               Pass:   0%/2   | Total: 11m 04s | Avg:  5m 32s | Max:  5m 34s
      🟥 GCC8               Pass:   0%/1   | Total:  4m 59s | Avg:  4m 59s | Max:  4m 59s
      🟥 GCC9               Pass:   0%/2   | Total: 10m 58s | Avg:  5m 29s | Max:  5m 35s
      🟩 GCC10              Pass: 100%/1   | Total: 32m 41s | Avg: 32m 41s | Max: 32m 41s
      🟩 GCC11              Pass: 100%/1   | Total: 31m 20s | Avg: 31m 20s | Max: 31m 20s
      🟩 GCC12              Pass: 100%/1   | Total: 34m 49s | Avg: 34m 49s | Max: 34m 49s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 06m | Avg: 23m 15s | Max: 34m 07s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 55m | Avg: 57m 50s | Max:  1h 01m | Hits: 228%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 38m | Avg: 52m 45s | Max:  1h 06m | Hits: 274%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 53s | Max: 56m 08s
    🟨 gpu
      🟨 v100               Pass:  86%/37  | Total: 18h 01m | Avg: 29m 13s | Max:  1h 06m | Hits: 255%/9220  
    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 41m 15s | Avg: 20m 37s | Max: 27m 14s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 19m 05s | Avg: 19m 05s | Max: 19m 05s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 02s | Avg: 4m 31s | Max: 6m 50s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  6m 50s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 12s | Avg:  2m 12s | Max:  2m 12s
      🟩 Test               Pass: 100%/1   | Total:  6m 50s | Avg:  6m 50s | Max:  6m 50s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 11s | Avg: 26m 11s | Max: 26m 11s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@shwina shwina requested a review from miscco January 15, 2025 16:33
Copy link
Contributor

🟩 CI finished in 1h 56m: Pass: 100%/78 | Total: 2d 03h | Avg: 39m 50s | Max: 1h 12m | Hits: 285%/12340
  • 🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 40s | Max: 1h 12m | Hits: 371%/3120

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 05h | Avg: 49m 08s | Max:  1h 12m | Hits: 371%/3120  
      🟩 arm64              Pass: 100%/2   | Total:  1h 58m | Avg: 59m 22s | Max:  1h 02m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 50m | Avg: 58m 10s | Max:  1h 08m | Hits: 371%/780   
      🟩 12.5               Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 12.6               Pass: 100%/31  | Total:  1d 00h | Avg: 47m 24s | Max:  1h 12m | Hits: 371%/2340  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 59m | Avg: 59m 58s | Max:  1h 01m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 50m | Avg: 58m 10s | Max:  1h 08m | Hits: 371%/780   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
      🟩 nvcc12.6           Pass: 100%/29  | Total: 22h 29m | Avg: 46m 32s | Max:  1h 12m | Hits: 371%/2340  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 58s | Max:  1h 01m
      🟩 nvcc               Pass: 100%/36  | Total:  1d 05h | Avg: 49m 06s | Max:  1h 12m | Hits: 371%/3120  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 39m | Avg: 54m 53s | Max: 58m 05s
      🟩 Clang15            Pass: 100%/1   | Total: 54m 28s | Avg: 54m 28s | Max: 54m 28s
      🟩 Clang16            Pass: 100%/1   | Total: 52m 16s | Avg: 52m 16s | Max: 52m 16s
      🟩 Clang17            Pass: 100%/1   | Total: 57m 23s | Avg: 57m 23s | Max: 57m 23s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 32m | Avg: 47m 31s | Max:  1h 02m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 49m | Avg: 54m 55s | Max: 56m 27s
      🟩 GCC8               Pass: 100%/1   | Total: 53m 17s | Avg: 53m 17s | Max: 53m 17s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 37s | Max: 57m 46s
      🟩 GCC10              Pass: 100%/1   | Total: 52m 45s | Avg: 52m 45s | Max: 52m 45s
      🟩 GCC11              Pass: 100%/1   | Total: 59m 10s | Avg: 59m 10s | Max: 59m 10s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 34m | Avg: 31m 30s | Max: 54m 20s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 48m | Avg: 36m 05s | Max: 58m 12s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 08m | Hits: 371%/1560  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 22m | Avg:  1h 11m | Max:  1h 12m | Hits: 371%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 11h 56m | Avg: 51m 10s | Max:  1h 02m
      🟩 GCC                Pass: 100%/18  | Total: 12h 49m | Avg: 42m 45s | Max: 59m 10s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 34m | Avg:  1h 08m | Max:  1h 12m | Hits: 371%/3120  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 40m 12s | Avg: 20m 06s | Max: 24m 06s
      🟩 v100               Pass: 100%/36  | Total:  1d 06h | Avg: 51m 19s | Max:  1h 12m | Hits: 371%/3120  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 04h | Avg: 55m 55s | Max:  1h 12m | Hits: 371%/3120  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 11s | Avg: 24m 11s | Max: 24m 11s
      🟩 GraphCapture       Pass: 100%/1   | Total: 21m 22s | Avg: 21m 22s | Max: 21m 22s
      🟩 HostLaunch         Pass: 100%/3   | Total: 57m 26s | Avg: 19m 08s | Max: 23m 33s
      🟩 TestGPU            Pass: 100%/2   | Total: 51m 19s | Avg: 25m 39s | Max: 28m 27s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 40m 12s | Avg: 20m 06s | Max: 24m 06s
      🟩 90a                Pass: 100%/1   | Total: 23m 15s | Avg: 23m 15s | Max: 23m 15s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 13h 36m | Avg: 58m 20s | Max:  1h 09m | Hits: 371%/2340  
      🟩 20                 Pass: 100%/24  | Total: 17h 51m | Avg: 44m 37s | Max:  1h 12m | Hits: 370%/780   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 19h 45m | Avg: 32m 03s | Max: 1h 04m | Hits: 255%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 41m 23s | Avg: 20m 41s | Max: 27m 10s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 18h 49m | Avg: 32m 16s | Max:  1h 04m | Hits: 255%/9220  
      🟩 arm64              Pass: 100%/2   | Total: 56m 01s | Avg: 28m 00s | Max: 29m 29s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 54m | Avg: 34m 54s | Max: 54m 24s | Hits: 228%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 51s | Max: 56m 12s
      🟩 12.6               Pass: 100%/30  | Total: 15h 03m | Avg: 30m 07s | Max:  1h 04m | Hits: 262%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 54m 49s | Avg: 27m 24s | Max: 27m 27s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 54m | Avg: 34m 54s | Max: 54m 24s | Hits: 228%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 47m | Avg: 53m 51s | Max: 56m 12s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 08m | Avg: 30m 19s | Max:  1h 04m | Hits: 262%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 54m 49s | Avg: 27m 24s | Max: 27m 27s
      🟩 nvcc               Pass: 100%/35  | Total: 18h 51m | Avg: 32m 19s | Max:  1h 04m | Hits: 255%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 00m | Avg: 30m 04s | Max: 31m 43s
      🟩 Clang15            Pass: 100%/1   | Total: 33m 00s | Avg: 33m 00s | Max: 33m 00s
      🟩 Clang16            Pass: 100%/1   | Total: 30m 07s | Avg: 30m 07s | Max: 30m 07s
      🟩 Clang17            Pass: 100%/1   | Total: 31m 40s | Avg: 31m 40s | Max: 31m 40s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 44m | Avg: 23m 27s | Max: 30m 56s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 03m | Avg: 31m 39s | Max: 33m 07s
      🟩 GCC8               Pass: 100%/1   | Total: 31m 40s | Avg: 31m 40s | Max: 31m 40s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 14s | Max: 35m 37s
      🟩 GCC10              Pass: 100%/1   | Total: 32m 33s | Avg: 32m 33s | Max: 32m 33s
      🟩 GCC11              Pass: 100%/1   | Total: 33m 41s | Avg: 33m 41s | Max: 33m 41s
      🟩 GCC12              Pass: 100%/1   | Total: 30m 54s | Avg: 30m 54s | Max: 30m 54s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 56m | Avg: 22m 02s | Max: 34m 12s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 48m | Avg: 54m 22s | Max: 54m 24s | Hits: 228%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 35m | Avg: 51m 45s | Max:  1h 04m | Hits: 274%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 51s | Max: 56m 12s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 19m | Avg: 27m 05s | Max: 33m 00s
      🟩 GCC                Pass: 100%/16  | Total:  7h 14m | Avg: 27m 11s | Max: 35m 37s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 24m | Avg: 52m 48s | Max:  1h 04m | Hits: 255%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 47m | Avg: 53m 51s | Max: 56m 12s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 19h 45m | Avg: 32m 03s | Max:  1h 04m | Hits: 255%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 18h 08m | Avg: 35m 06s | Max:  1h 04m | Hits: 228%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 50m 50s | Avg: 16m 56s | Max: 35m 03s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 46m 51s | Avg: 15m 37s | Max: 16m 37s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 16m 52s | Avg: 16m 52s | Max: 16m 52s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  8h 42m | Avg: 37m 18s | Max: 55m 16s | Hits: 228%/5532  
      🟩 20                 Pass: 100%/21  | Total: 10h 22m | Avg: 29m 38s | Max:  1h 04m | Hits: 296%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 42s | Avg: 4m 21s | Max: 6m 31s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 42s | Avg:  4m 21s | Max:  6m 31s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
      🟩 Test               Pass: 100%/1   | Total:  6m 31s | Avg:  6m 31s | Max:  6m 31s
    
  • 🟩 python: Pass: 100%/1 | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 24m 45s | Avg: 24m 45s | Max: 24m 45s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me other than the renaming of the sub-policy type. Please see the comment thread from earlier for details.

typename OffsetT,
typename AccumT,
bool ForceInclusive>
struct DeviceScanKernelSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: we probably don't want this struct as part of our public API. Let's wrap it into detail namespace.

Comment on lines +346 to 349
// TODO(ashwin): should this come from the launcher factory instead?
// Get max x-dimension of grid
int max_dim_x;
error = CubDebug(cudaDeviceGetAttribute(&max_dim_x, cudaDevAttrMaxGridDimX, device_ordinal));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: yes, all CUDA Runtime calls should be consolidated in launcher factory

@@ -242,19 +276,19 @@ struct policy_hub
struct Policy350 : ChainedPolicy<350, Policy350, Policy350>
{
// GTX Titan: 29.5B items/s (232.4 GB/s) @ 48M 32-bit T
using ScanPolicyT =
using ScanPolicy =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, code passing user-defined policy should still work. For instance, this change should brake our scan tuning:

, we just don't build it as part of CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

Add dynamic CUB dispatch for scan
4 participants