-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for histogram CPU implementation #1930
base: main
Are you sure you want to change the base?
Changes from 26 commits
a03577e
ce117f5
ccc001e
d518a14
6e03468
10c4e50
efa7c9b
1ac82fd
02523c4
ac7b654
506fb62
1c6cb47
ceee3e3
0711090
3c5ad12
06a734f
b858a0e
53f4643
d718e0e
bb9e6f9
17e0510
9614209
2964a9e
9287fd2
cdf5092
215c2b7
04d5127
fe1efa2
60ec0e5
54e16b6
77435a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
# Host Backends Support for the Histogram APIs | ||
|
||
## Introduction | ||
The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These | ||
APIs are defined in the oneAPI Specification 1.4. Please see the | ||
[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) | ||
for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending | ||
histogram support to these backends. | ||
|
||
The pull request for the proposed implementation exists [here](https://github.com/oneapi-src/oneDPL/pull/1974). | ||
|
||
## Motivations | ||
There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding | ||
the support is simply to be spec compliant with the oneAPI specification. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Due to it is not a story telling, I would suggest omitting introductory expressions like "It may make more sense" or "It's natural for a user to expect"... Only short and exact information. For example, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. taken suggestion. Thanks |
||
## Design Considerations | ||
|
||
### Key Requirements | ||
Provide support for the `histogram` APIs with the following policies and backends: | ||
- Policies: `seq`, `unseq`, `par`, `par_unseq` | ||
- Backends: `serial`, `tbb`, `openmp` | ||
|
||
Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends | ||
which they can select from when using oneDPL. It is important that all combinations of these options have support for | ||
the `histogram` APIs. | ||
|
||
### Performance | ||
With little computation, a histogram algorithm is likely a memory-bound algorithm. So, the implementation prioritize | ||
reducing memory accesses and minimizing temporary memory traffic. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taking into account my shared thought above, I would propose to re-prahse it keeping the main point shorter: "A histogram algorithm is a memory-bound algorithm. So, the implementation should care of reducing memory accesses and minimizing temporary memory traffic." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taken mostly. Thanks |
||
### Memory Footprint | ||
mmichel11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
There are no guidelines here from the standard library as this is an extension API. Still, we will minimize memory | ||
footprint where possible. | ||
|
||
### Code Reuse | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we can this topic omit at all. It tells nothing about 'histogram', just general wording, which can be applied for any new feature in oneDPL... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've removed some of the general language and added something which is important for histogram in an attempt to answer feedback from @akukanov to clarify where the implementation of the algorithm will live. |
||
We want to minimize adding requirements for parallel backends to implement, and lift as much as possible to the | ||
algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual | ||
backends, and instead rely upon `__parallel_for`. | ||
|
||
### SIMD/openMP SIMD Implementation | ||
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across | ||
loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The second sentence may be omitted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. applied. |
||
provided by OpenMP. | ||
|
||
There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment | ||
there are two APIs, even and custom range which have significantly different methods to determine the bin to | ||
increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as | ||
each input has the same mathematical operations applied to each. However, for the custom range API, each input element | ||
uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a | ||
different length and control flow based upon each input element and will be very difficult to vectorize. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we can calculate the bin indexes for the input data in SIMD manner. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase. |
||
|
||
Next, lets consider the increment operation itself. This operation increments a data dependent bin location, and may | ||
result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without | ||
more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not | ||
available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the | ||
vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which | ||
we can use to exempt the increment from SIMD operations as well. However, this often results in vectorization being | ||
refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop. | ||
|
||
Last, for our below proposed implementation there is the task of combining temporary histogram data into the global | ||
output histogram. This is directly vectorizable via our existing brick_walk implementation, and will be vectorized when | ||
a vector policy is used. | ||
|
||
### Serial Backend | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://github.com/oneapi-src/oneDPL/pull/1930/files#diff-fb5f6394ad0d350d719b9f31b139fa60c347ec64795c78e56875c4f002aeb0e7R25 Explanation what is "Serial Backend" means as the others backends mean, is a kind of "oneDPL general description" and not related to RFC for histogram feature, IMHO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With some recent changes, there is some specifics about the serial implementation I wanted to add here so I've kept the section. |
||
We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all | ||
policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other | ||
approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In | ||
our benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly | ||
into the output global histogram. There seems to be no performance motivated reason to special case the serial | ||
algorithm to use the global histogram directly. | ||
|
||
## Existing APIs / Patterns | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we intent to give some information about OneDPL parallel backend patterns on which histogram can based on, I would notify, there is not "count_if" pattern, there is "reduce"("transform_reduce") pattern. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I clarified the language a little here to make it more clear that copy_if uses reduce internally. I still think it deserves some text describing it as it may not be immediately obvious to everyone that reduce is not well matched. |
||
### count_if | ||
`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a | ||
sequence. `count_if` relies upon the `transform_reduce` pattern internally, and returns a scalar-typed value and doesn't | ||
provide any function to modify the variable being incremented. Using `count_if` without significant modification would | ||
require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth | ||
perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched | ||
pattern performance-wise. | ||
|
||
### parallel_for | ||
`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what | ||
we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use | ||
`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output | ||
histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way | ||
to synchronize and accumulate between threads. | ||
|
||
|
||
## Alternative Approaches | ||
|
||
### Atomics | ||
This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the | ||
output histogram data, we can merely run a `parallel_for` pattern. | ||
|
||
To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics | ||
specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can | ||
only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then | ||
copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output | ||
data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic | ||
operations, but that is only available for the OpenMP backend. The working plan was to implement a macro like | ||
`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins | ||
like `InterlockedAdd` or `__atomic_fetch_add_n`. In a proof of concept implementation,this seemed to work, but does | ||
reach more into details than compiler / OS specifics than is desired for implementations prior to `C++20`. | ||
|
||
After experimenting with a proof of concept implementation of this implementation, it seems that the atomic | ||
implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with | ||
number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation | ||
outperformed the proposed implementation (below). However, this was generally limited to some specific cases where | ||
the number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases | ||
with a small number for input elements relative to number of bins. This makes sense because the atomic implementation | ||
is able to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when | ||
the number of bins is large compared to the number of input elements. With many bins, contention on atomics is also | ||
limited as compared to the embarassingly parallel proposal which does experience this contention. | ||
|
||
When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important | ||
use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to | ||
characterize the result. Cases for which there are similar or more bins than input elements are not very practical in | ||
practice. The maintenance and complexity cost associated with supporting and maintaining a second implementation to | ||
serve this subset of cases does not seem to be justified. Therefore, this implementation has been discarded at this | ||
time. | ||
|
||
### Other Unexplored Approaches | ||
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, I have a curiosity question. Which approach does NVidia use? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NVidia has a similar API within CUB but not within Thrust, and therefore does not have a CPU implementation that I am aware of, only one specifically for a GPU device. |
||
modifying them. It's possible such an approach could provide a similar approach to atomics, but with different | ||
overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring. | ||
|
||
* Another possible approach could be to do something like the proposed implementation one, but with some sparse | ||
representation of output data. However, I think the general assumptions we can make about the normal case make this | ||
less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large | ||
percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple | ||
threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach | ||
does not provide a good fallback. | ||
|
||
## Proposal | ||
After exploring the above implementation for `histogram`, the following proposal better represents the use | ||
cases which are important, and provides reasonable performance for most cases. | ||
|
||
### Embarrassingly Parallel Via Temporary Histograms | ||
This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the | ||
`histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra | ||
histogram-specific patterns in the implementation of a backend. | ||
|
||
This algorithm does however require that each parallel backend will add a `__thread_enumerable_storage<_StoredType>` | ||
struct which provides the following: | ||
* constructor which takes a variadic list of args to pass to the constructor of each thread's object | ||
* `get_for_current_thread()` returns reference to the current thread's stored object | ||
* `get_with_id(int i)` returns reference to the stored object for an index | ||
* `size()` returns number of stored objects | ||
|
||
In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, we implement our own similar | ||
thread local storage which will allocate and initialize the thread local storage at the first usage for each active | ||
thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial | ||
backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we | ||
use a single copy of histogram. In practice, our benchmarking reports little difference in performance between this | ||
implementation and the original, which directly accumulated to the output histogram. | ||
|
||
With this new structure we will use the following algorithm: | ||
|
||
1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into | ||
its own temporary histogram returned by `__thread_enumerable_storage`. The parallelism is divided on the input | ||
element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability. | ||
2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the | ||
histogram created within `__thread_enumerable_storage` into the output histogram sequence. The parallelism is divided | ||
on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the output | ||
histogram. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me share my thoughts:
In my understanding RFC is not a book... So, I would preferer to have a short, concise and precise description of what is offered, without frills, like a mathematical theorem. For example:
"The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
oneAPI Specification for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've accepted your language here. Thanks.