From a03577e9e3458bf6603f798d018582cc4a872656 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 30 Oct 2024 11:27:20 -0400
Subject: [PATCH 01/31] initial rough commit

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 63 +++++++++++++++++++
 1 file changed, 63 insertions(+)
 create mode 100644 rfcs/proposed/host_backend_histogram/README.md

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
new file mode 100644
index 00000000000..cb786fcac04
--- /dev/null
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -0,0 +1,63 @@
+# Host backends support for the histogram APIs
+
+## Introduction
+histogram added spec, gpu. Not supported with serial tbb, openmp backends. Not part of stl. 
+
+Serial implementation is straightforward and is not worth discussing in much length here. We will add it but there is not much to discuss there.
+
+## Motivations
+Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host side arallel imlementation or even a vector implementation of histogram. It's natural for a user to expect that one to deal with support these other back ends for all APIs.
+Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
+
+## Design considerations
+For execution policies when calling 1 dpll apis. They also have a number of options for backends which they can select from when using one dpl. It is important that all of these options have some support for histogram.
+We also care about how these perform and how they scale to the number of threads, and we also care about their memory footprint.
+
+In general I believe we can safely assume that the normal case is for the number of elements to be far greater than the number of bins.Also this is a very low computation api which will likely be limited by memory bandwidth. This means  we should
+
+### Key requirements
+seq, unseq, par, par_unseq
+serial, tbb, openmp
+
+### Performance
+As with all algorithms in oneDPL, our goal is to make them a performant as possible.
+
+### Memory Footprint
+There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because as mentioned above this will be very likely to be a memory bandwidth bound api.
+
+### Code Reuse
+Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else this has to be balanced with performance considerations.
+
+### Vector
+As mentioned above histogram looks to be a memory bandwidth dependent algorithm. This means that we are unlikely to get much benefit from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions multiplying the number of virtual threads by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram is unlikely to provide much benefit especially when we account for extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrant streams of execution. It may make sense to decline to add vectorized operations within histogram even when they are requested by the user via the execution policy.
+
+## Existing patterns
+
+### count_if
+
+Histogram is similar to count if in that it is conditionally incrementing a number of counters based upon the data in a sequence. Count_if returns a different type which is a scalar, And doesn't provide any function To modify the variable being incremented.  Using count_if without modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective this seems far from ideal.
+
+### parallel_for
+
+Parallel_for is an interesting pattern in that it is very generic and should allow us to do what we want to do for histogram ultimately. However we cannot simply use it without any added infrastructure. If we were to just run it through a parallel_for there would be a race condition between threads when incrementing the values in the output histogram.I believe parallel_for will be a part of our implementation but it requires some way to synchronize and accumulate between threads.
+
+## Proposal
+I propose to add a new pattern specific to histogram which goes as follows:
+1) Determine the number of threads that we will use, perhaps add some method to do this generically based on back end.
+2) Create temporary data for the number of threads minus one copies of the histogram output sequence.
+3) Run a parallel_for pattern which performs a histogram on The input sequence where each thread accumulates into its own copy of the histogram using both the temporary storage and provided output sequence to remove any race conditions.
+4) Run a second parallel for over the histogram sequence which accumulates all temporary copies of the histogram into the output histogram sequence.
+
+New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using. Ideally these can be generic wrappers around the specific back ends which would allow a unified implementation for all host backends.
+
+## Alternative Option
+
+One alternative way to provide a parallel histogram which would minimize memory footprint would be to use atomic operations to remove the race conditions during accumulation. The user provides the output sequence It won't be a atomic variable. Open MP does provide wrappers around generic memory to provide atomic operations within an open MP parallel section however I do not know of a way to provide this within the tbb back end. We can Alternatively allocate a copy of the histogram as atomic variables and use them however this would require us to add a copy from the atomic copy of the histogram to the output sequence provided by the user. With large enough histogram bin counts relative to the number of threads, atomics may be an attractive solution because contention on the atomics will be relatively low. It also limits the requirement for extra temporary storage. Especially for open MP it may make sense to explore this option and compare performance.
+
+## Open Questions
+If we had access to std::atomic_ref from C++20, atomics may be a better option for many cases, without the need for extra allocation or copies / accumulation.
+Would it be worthwhile to add our own implementation of atomic_ref for C++17? I believe this would require specializations for each of our supported compilers.
+
+What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
+
+Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation?
\ No newline at end of file

From ce117f524916a9de6e251cc64463723d7c66aed7 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 30 Oct 2024 13:17:39 -0400
Subject: [PATCH 02/31] minor improvements

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index cb786fcac04..0f75e2de0d5 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -10,7 +10,7 @@ Users don't always want to use device policies and accelerators to run their cod
 Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
 
 ## Design considerations
-For execution policies when calling 1 dpll apis. They also have a number of options for backends which they can select from when using one dpl. It is important that all of these options have some support for histogram.
+For execution policies when calling oneDPL APIs. They also have a number of options for backends which they can select from when using one dpl. It is important that all of these options have some support for histogram.
 We also care about how these perform and how they scale to the number of threads, and we also care about their memory footprint.
 
 In general I believe we can safely assume that the normal case is for the number of elements to be far greater than the number of bins.Also this is a very low computation api which will likely be limited by memory bandwidth. This means  we should
@@ -28,7 +28,7 @@ There are no guidelines here from the standard library as this is an extension A
 ### Code Reuse
 Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else this has to be balanced with performance considerations.
 
-### Vector
+### unseq backend
 As mentioned above histogram looks to be a memory bandwidth dependent algorithm. This means that we are unlikely to get much benefit from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions multiplying the number of virtual threads by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram is unlikely to provide much benefit especially when we account for extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrant streams of execution. It may make sense to decline to add vectorized operations within histogram even when they are requested by the user via the execution policy.
 
 ## Existing patterns
@@ -51,7 +51,6 @@ I propose to add a new pattern specific to histogram which goes as follows:
 New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using. Ideally these can be generic wrappers around the specific back ends which would allow a unified implementation for all host backends.
 
 ## Alternative Option
-
 One alternative way to provide a parallel histogram which would minimize memory footprint would be to use atomic operations to remove the race conditions during accumulation. The user provides the output sequence It won't be a atomic variable. Open MP does provide wrappers around generic memory to provide atomic operations within an open MP parallel section however I do not know of a way to provide this within the tbb back end. We can Alternatively allocate a copy of the histogram as atomic variables and use them however this would require us to add a copy from the atomic copy of the histogram to the output sequence provided by the user. With large enough histogram bin counts relative to the number of threads, atomics may be an attractive solution because contention on the atomics will be relatively low. It also limits the requirement for extra temporary storage. Especially for open MP it may make sense to explore this option and compare performance.
 
 ## Open Questions

From ccc001e25a7c7f11823cb5725ceca236f10dbeb5 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 14:48:31 -0400
Subject: [PATCH 03/31] revision

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 69 +++++++++++--------
 1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 0f75e2de0d5..afdab5723b7 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,62 +1,77 @@
 # Host backends support for the histogram APIs
 
 ## Introduction
-histogram added spec, gpu. Not supported with serial tbb, openmp backends. Not part of stl. 
-
-Serial implementation is straightforward and is not worth discussing in much length here. We will add it but there is not much to discuss there.
+In version 2022.6.0 two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release, and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs.  In short, they take an input sequence and classifies them either evenly distributed or user defined bins via a list of separating values and count the number of values in each bin, writing to a user provided output histogram sequence. 
+Currently `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends.
+The serial implementation is straightforward and is not worth discussing in much length here. We will add it but there is not much to discuss within the RFC, as its implementation will be straighforward.
 
 ## Motivations
-Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host side arallel imlementation or even a vector implementation of histogram. It's natural for a user to expect that one to deal with support these other back ends for all APIs.
-Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
+Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other back ends for all APIs.  Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
 
 ## Design considerations
-For execution policies when calling oneDPL APIs. They also have a number of options for backends which they can select from when using one dpl. It is important that all of these options have some support for histogram.
-We also care about how these perform and how they scale to the number of threads, and we also care about their memory footprint.
-
-In general I believe we can safely assume that the normal case is for the number of elements to be far greater than the number of bins.Also this is a very low computation api which will likely be limited by memory bandwidth. This means  we should
 
 ### Key requirements
-seq, unseq, par, par_unseq
-serial, tbb, openmp
+Provide support for the `histogram` APIs with the following policies and backends:
+Policies: `seq`, `unseq`, `par`, `par_unseq`
+Backends: `serial`, `tbb`, `openmp`
+
+Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which they can select from when using oneDPL. It is important that all combinations of these options have support for the `histogram` APIs.
 
 ### Performance
-As with all algorithms in oneDPL, our goal is to make them a performant as possible.
+As with all algorithms in oneDPL, our goal is to make them a performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries will likely have a high impact on overall performance.
 
 ### Memory Footprint
-There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because as mentioned above this will be very likely to be a memory bandwidth bound api.
+There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because as mentioned above this will be very likely to be a memory bandwidth bound API.
+In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins.  We may be able to use that to our advantage.
 
 ### Code Reuse
-Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else this has to be balanced with performance considerations.
+Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
 
 ### unseq backend
-As mentioned above histogram looks to be a memory bandwidth dependent algorithm. This means that we are unlikely to get much benefit from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions multiplying the number of virtual threads by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram is unlikely to provide much benefit especially when we account for extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrant streams of execution. It may make sense to decline to add vectorized operations within histogram even when they are requested by the user via the execution policy.
+As mentioned above histogram looks to be a memory bandwidth dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram is may not provide much benefit especially when we account for extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrant streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.
 
 ## Existing patterns
 
 ### count_if
 
-Histogram is similar to count if in that it is conditionally incrementing a number of counters based upon the data in a sequence. Count_if returns a different type which is a scalar, And doesn't provide any function To modify the variable being incremented.  Using count_if without modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective this seems far from ideal.
+`histogram` is similar to `count_if` in that it is conditionally incrementing a number of counters based upon the data in a sequence. `count_if` returns a scalar typed value and doesn't provide any function To modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective this is untenable. Similarly, using a `histogram` pattern to implement `count_if`, is unlikely to provide a well performing result in the end, as contention should be far higher, and `reduce` is a very well matched pattern performance-wise.
 
 ### parallel_for
 
-Parallel_for is an interesting pattern in that it is very generic and should allow us to do what we want to do for histogram ultimately. However we cannot simply use it without any added infrastructure. If we were to just run it through a parallel_for there would be a race condition between threads when incrementing the values in the output histogram.I believe parallel_for will be a part of our implementation but it requires some way to synchronize and accumulate between threads.
+`parallel_for` is an interesting pattern in that it is very generic and embarassingly parallel.  This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation but it requires some way to synchronize and accumulate between threads.
 
 ## Proposal
-I propose to add a new pattern specific to histogram which goes as follows:
-1) Determine the number of threads that we will use, perhaps add some method to do this generically based on back end.
-2) Create temporary data for the number of threads minus one copies of the histogram output sequence.
-3) Run a parallel_for pattern which performs a histogram on The input sequence where each thread accumulates into its own copy of the histogram using both the temporary storage and provided output sequence to remove any race conditions.
-4) Run a second parallel for over the histogram sequence which accumulates all temporary copies of the histogram into the output histogram sequence.
+I believe there are two competing options for `histogram`, which may both have utility in the final implementation depending on the use case.
+
+### Implementation One (Embarassingly Parallel)
+This method uses temporary storage and a pair of embarassingly parallel `parallel_for` loops to accomplish the `histogram`.
+1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on back end.
+2) Create temporary data for the number of threads minus one copies of the histogram output sequence.  Thread zero can use the user provided output data.
+3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into its own copy of the ouput sequence using the temporary storage to remove any race conditions.
+4) Run a second `parallel_for` over the `histogram` output sequence sequence which accumulates all temporary copies of the histogram into the output histogram sequence.  This step is also embarassingly parallel.
+5) Deallocate temporary storage
+
+New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the specific backends which would allow a unified implementation for all host backends.
+
+### Implementation Two (Atomics)
+This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the output histogram data, we can merely run a `parallel_for` pattern.
 
-New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using. Ideally these can be generic wrappers around the specific back ends which would allow a unified implementation for all host backends.
+To deal with atomics appropriately, we have some limitations.  We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however this can only provide atomicicity for data which is created with atomics in mind.  This means allocating temporary data and then copying to the output data.  `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.
+
+It remains to be seen if atomics are worth their overhead and contention from a performance perspective, and may depend on the different approaches available.
+
+
+## Selecting Between Algorithms
+It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, number of output bins, and backend / atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
 
-## Alternative Option
-One alternative way to provide a parallel histogram which would minimize memory footprint would be to use atomic operations to remove the race conditions during accumulation. The user provides the output sequence It won't be a atomic variable. Open MP does provide wrappers around generic memory to provide atomic operations within an open MP parallel section however I do not know of a way to provide this within the tbb back end. We can Alternatively allocate a copy of the histogram as atomic variables and use them however this would require us to add a copy from the atomic copy of the histogram to the output sequence provided by the user. With large enough histogram bin counts relative to the number of threads, atomics may be an attractive solution because contention on the atomics will be relatively low. It also limits the requirement for extra temporary storage. Especially for open MP it may make sense to explore this option and compare performance.
 
 ## Open Questions
-If we had access to std::atomic_ref from C++20, atomics may be a better option for many cases, without the need for extra allocation or copies / accumulation.
 Would it be worthwhile to add our own implementation of atomic_ref for C++17? I believe this would require specializations for each of our supported compilers.
 
 What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
 
-Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation?
\ No newline at end of file
+Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation?
+
+How will vectorized bricks perform, in what situations will it be advatageous to use or not use vector instructions?
+
+What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
\ No newline at end of file

From d518a14e4985ebc0d9e8e1df2e822e1c500f457b Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 14:53:33 -0400
Subject: [PATCH 04/31] Formatting, minor

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index afdab5723b7..119cf5881b4 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -66,12 +66,10 @@ It may be the case that multiple aspects may provide an advantage to either algo
 
 
 ## Open Questions
-Would it be worthwhile to add our own implementation of atomic_ref for C++17? I believe this would require specializations for each of our supported compilers.
+* Would it be worthwhile to add our own implementation of atomic_ref for C++17? I believe this would require specializations for each of our supported compilers.
 
-What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
+* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
 
-Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation?
+* Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
 
-How will vectorized bricks perform, in what situations will it be advatageous to use or not use vector instructions?
-
-What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
\ No newline at end of file
+* How will vectorized bricks perform, in what situations will it be advatageous to use or not use vector instructions?

From 6e03468e75700f726224934cedf2afbb1c89b20d Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 15:04:03 -0400
Subject: [PATCH 05/31] spelling and grammar

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 48 +++++++++----------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 119cf5881b4..6106e2b843c 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,12 +1,12 @@
 # Host backends support for the histogram APIs
 
 ## Introduction
-In version 2022.6.0 two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release, and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs.  In short, they take an input sequence and classifies them either evenly distributed or user defined bins via a list of separating values and count the number of values in each bin, writing to a user provided output histogram sequence. 
-Currently `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends.
-The serial implementation is straightforward and is not worth discussing in much length here. We will add it but there is not much to discuss within the RFC, as its implementation will be straighforward.
+In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and classifies them into either evenly distributed or user-defined bins via a list of separating values and count the number of values in each bin, writing to a user-provided output histogram sequence. 
+Currently, `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends.
+The serial implementation is straightforward and is not worth discussing in much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be straightforward.
 
 ## Motivations
-Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other back ends for all APIs.  Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
+Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
 
 ## Design considerations
 
@@ -18,58 +18,56 @@ Backends: `serial`, `tbb`, `openmp`
 Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which they can select from when using oneDPL. It is important that all combinations of these options have support for the `histogram` APIs.
 
 ### Performance
-As with all algorithms in oneDPL, our goal is to make them a performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries will likely have a high impact on overall performance.
+As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely have a high impact on overall performance.
 
 ### Memory Footprint
-There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because as mentioned above this will be very likely to be a memory bandwidth bound API.
-In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins.  We may be able to use that to our advantage.
+There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because, as mentioned above, this will very likely be a memory bandwidth-bound API.
+In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins. We may be able to use that to our advantage.
 
 ### Code Reuse
 Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
 
 ### unseq backend
-As mentioned above histogram looks to be a memory bandwidth dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram is may not provide much benefit especially when we account for extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrant streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.
+As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may not provide much benefit, especially when we account for the extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.
 
 ## Existing patterns
 
 ### count_if
 
-`histogram` is similar to `count_if` in that it is conditionally incrementing a number of counters based upon the data in a sequence. `count_if` returns a scalar typed value and doesn't provide any function To modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective this is untenable. Similarly, using a `histogram` pattern to implement `count_if`, is unlikely to provide a well performing result in the end, as contention should be far higher, and `reduce` is a very well matched pattern performance-wise.
+`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `reduce` is a very well-matched pattern performance-wise.
 
 ### parallel_for
 
-`parallel_for` is an interesting pattern in that it is very generic and embarassingly parallel.  This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation but it requires some way to synchronize and accumulate between threads.
+`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way to synchronize and accumulate between threads.
 
 ## Proposal
 I believe there are two competing options for `histogram`, which may both have utility in the final implementation depending on the use case.
 
-### Implementation One (Embarassingly Parallel)
-This method uses temporary storage and a pair of embarassingly parallel `parallel_for` loops to accomplish the `histogram`.
-1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on back end.
-2) Create temporary data for the number of threads minus one copies of the histogram output sequence.  Thread zero can use the user provided output data.
-3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into its own copy of the ouput sequence using the temporary storage to remove any race conditions.
-4) Run a second `parallel_for` over the `histogram` output sequence sequence which accumulates all temporary copies of the histogram into the output histogram sequence.  This step is also embarassingly parallel.
-5) Deallocate temporary storage
+### Implementation One (Embarrassingly Parallel)
+This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the `histogram`.
+1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the backend.
+2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can use the user-provided output data.
+3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into its own copy of the output sequence using the temporary storage to remove any race conditions.
+4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the histogram into the output histogram sequence. This step is also embarrassingly parallel.
+5) Deallocate temporary storage.
 
 New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the specific backends which would allow a unified implementation for all host backends.
 
 ### Implementation Two (Atomics)
 This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the output histogram data, we can merely run a `parallel_for` pattern.
 
-To deal with atomics appropriately, we have some limitations.  We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however this can only provide atomicicity for data which is created with atomics in mind.  This means allocating temporary data and then copying to the output data.  `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.
-
-It remains to be seen if atomics are worth their overhead and contention from a performance perspective, and may depend on the different approaches available.
+To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.
 
+It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend on the different approaches available.
 
 ## Selecting Between Algorithms
-It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, number of output bins, and backend / atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
-
+It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
 
 ## Open Questions
-* Would it be worthwhile to add our own implementation of atomic_ref for C++17? I believe this would require specializations for each of our supported compilers.
+* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require specializations for each of our supported compilers.
 
 * What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
 
-* Is it worthwhile to have separate implementations for tbb and openMP because they may differ in the best performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
+* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
 
-* How will vectorized bricks perform, in what situations will it be advatageous to use or not use vector instructions?
+* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector instructions?

From 10c4e50b915f8f070c99bfa91a48a91aee55df85 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 15:11:51 -0400
Subject: [PATCH 06/31] Minor improvements

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 23 ++++++++-----------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 6106e2b843c..486d3be00c4 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,19 +1,17 @@
-# Host backends support for the histogram APIs
+# Host Backends Support for the Histogram APIs
 
 ## Introduction
-In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and classifies them into either evenly distributed or user-defined bins via a list of separating values and count the number of values in each bin, writing to a user-provided output histogram sequence. 
-Currently, `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends.
-The serial implementation is straightforward and is not worth discussing in much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be straightforward.
+In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and classify them into either evenly distributed or user-defined bins via a list of separating values and count the number of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be straightforward.
 
 ## Motivations
 Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
 
-## Design considerations
+## Design Considerations
 
-### Key requirements
+### Key Requirements
 Provide support for the `histogram` APIs with the following policies and backends:
-Policies: `seq`, `unseq`, `par`, `par_unseq`
-Backends: `serial`, `tbb`, `openmp`
+- Policies: `seq`, `unseq`, `par`, `par_unseq`
+- Backends: `serial`, `tbb`, `openmp`
 
 Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which they can select from when using oneDPL. It is important that all combinations of these options have support for the `histogram` APIs.
 
@@ -21,23 +19,20 @@ Users have a choice of execution policies when calling oneDPL APIs. They also ha
 As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely have a high impact on overall performance.
 
 ### Memory Footprint
-There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because, as mentioned above, this will very likely be a memory bandwidth-bound API.
-In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins. We may be able to use that to our advantage.
+There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins. We may be able to use that to our advantage.
 
 ### Code Reuse
 Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
 
-### unseq backend
+### unseq Backend
 As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may not provide much benefit, especially when we account for the extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.
 
-## Existing patterns
+## Existing Patterns
 
 ### count_if
-
 `histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `reduce` is a very well-matched pattern performance-wise.
 
 ### parallel_for
-
 `parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way to synchronize and accumulate between threads.
 
 ## Proposal

From efa7c9b09067b421a0d6705f98cbb25c99cb99ec Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 15:13:29 -0400
Subject: [PATCH 07/31] subsection

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 486d3be00c4..d44d9df2842 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -55,7 +55,7 @@ To deal with atomics appropriately, we have some limitations. We must either use
 
 It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend on the different approaches available.
 
-## Selecting Between Algorithms
+### Selecting Between Algorithms
 It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
 
 ## Open Questions

From 1ac82fd8ae5513ede6f58edafd2fb60b56a24a2f Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 15:23:00 -0400
Subject: [PATCH 08/31] Adding some alternative approaches

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index d44d9df2842..d8b696c0126 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -58,6 +58,11 @@ It remains to be seen if atomics are worth their overhead and contention from a
 ### Selecting Between Algorithms
 It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
 
+## Alternative Approaches
+* One could consider some sort of locking approach either which locks mutexes for subsections of the output histogram prior to modifying them. Its possible such an approach could provide a similar approach to atomics, but with different overhead tradeoffs.  It seems quite likely that this would result in more overhead, but it could be worth exploring.
+
+* Another possible approach could be to do something like proposed implementation one, but with some sparse representation of output data. However, I think the general assumptions we can make about the normal case make this less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach does not provide a good fallback.
+
 ## Open Questions
 * Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require specializations for each of our supported compilers.
 

From 02523c46c11c0c285286675725fdfc7c27467186 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 1 Nov 2024 15:25:41 -0400
Subject: [PATCH 09/31] minor improvements

Signed-off-by: Dan Hoeflinger
<dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index d8b696c0126..874c61093d9 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -59,9 +59,9 @@ It remains to be seen if atomics are worth their overhead and contention from a
 It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
 
 ## Alternative Approaches
-* One could consider some sort of locking approach either which locks mutexes for subsections of the output histogram prior to modifying them. Its possible such an approach could provide a similar approach to atomics, but with different overhead tradeoffs.  It seems quite likely that this would result in more overhead, but it could be worth exploring.
+* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to modifying them. It's possible such an approach could provide a similar approach to atomics, but with different overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
 
-* Another possible approach could be to do something like proposed implementation one, but with some sparse representation of output data. However, I think the general assumptions we can make about the normal case make this less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach does not provide a good fallback.
+* Another possible approach could be to do something like the proposed implementation one, but with some sparse representation of output data. However, I think the general assumptions we can make about the normal case make this less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach does not provide a good fallback.
 
 ## Open Questions
 * Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require specializations for each of our supported compilers.

From ac7b654e994c2016a0b455955c19cd6b426dc07f Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Mon, 4 Nov 2024 10:04:51 -0500
Subject: [PATCH 10/31] line widths

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 126 ++++++++++++++----
 1 file changed, 98 insertions(+), 28 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 874c61093d9..16c79883761 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,10 +1,23 @@
 # Host Backends Support for the Histogram APIs
 
 ## Introduction
-In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should be present in the 1.4 specification. Please see the [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and classify them into either evenly distributed or user-defined bins via a list of separating values and count the number of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of `histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be straightforward.
+In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device
+policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should
+be present in the 1.4 specification. Please see the
+[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
+for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and
+classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
+of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
+with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
+`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
+much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
+straightforward.
 
 ## Motivations
-Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
+Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
+to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect
+that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec
+compliant with the oneAPI specification.
 
 ## Design Considerations
 
@@ -13,61 +26,118 @@ Provide support for the `histogram` APIs with the following policies and backend
 - Policies: `seq`, `unseq`, `par`, `par_unseq`
 - Backends: `serial`, `tbb`, `openmp`
 
-Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which they can select from when using oneDPL. It is important that all combinations of these options have support for the `histogram` APIs.
+Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends
+which they can select from when using oneDPL. It is important that all combinations of these options have support for
+the `histogram` APIs.
 
 ### Performance
-As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case. Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely have a high impact on overall performance.
+As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a
+low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case.
+Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely
+have a high impact on overall performance.
 
 ### Memory Footprint
-There are no guidelines here from the standard library as this is an extension API. However, we should always try to minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for histogram is for the number of elements in the input sequence to be far greater than the number of output histogram bins. We may be able to use that to our advantage.
+There are no guidelines here from the standard library as this is an extension API. However, we should always try to
+minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here
+because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for
+histogram is for the number of elements in the input sequence to be far greater than the number of output histogram
+bins. We may be able to use that to our advantage.
 
 ### Code Reuse
-Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
+Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been
+reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
 
 ### unseq Backend
-As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may not provide much benefit, especially when we account for the extra memory footprint required or synchronization required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make sense to decline to add vectorized operations within histogram depending on the implementation used, and based on performance results.
+As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable
+from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case
+also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector
+length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may
+not provide much benefit, especially when we account for the extra memory footprint required or synchronization
+required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make
+sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
+performance results.
 
 ## Existing Patterns
 
 ### count_if
-`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `reduce` is a very well-matched pattern performance-wise.
+`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a
+sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being
+incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for
+each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a
+`histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention
+should be far higher, and `reduce` is a very well-matched pattern performance-wise.
 
 ### parallel_for
-`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use `parallel_for` alone, there would be a race condition between threads when incrementing the values in the output histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way to synchronize and accumulate between threads.
+`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what
+we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use
+`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output
+histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
+to synchronize and accumulate between threads.
 
 ## Proposal
-I believe there are two competing options for `histogram`, which may both have utility in the final implementation depending on the use case.
+I believe there are two competing options for `histogram`, which may both have utility in the final implementation
+depending on the use case.
 
 ### Implementation One (Embarrassingly Parallel)
-This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the `histogram`.
-1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the backend.
-2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can use the user-provided output data.
-3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into its own copy of the output sequence using the temporary storage to remove any race conditions.
-4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the histogram into the output histogram sequence. This step is also embarrassingly parallel.
-5) Deallocate temporary storage.
-
-New machinery that will be required here is the ability to query how many threads will be used and also the machinery to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the specific backends which would allow a unified implementation for all host backends.
+This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
+`histogram`.
+1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the
+2) backend.
+3) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
+4) use the user-provided output data.
+5) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
+6) its own copy of the output sequence using the temporary storage to remove any race conditions.
+7) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
+8) histogram into the output histogram sequence. This step is also embarrassingly parallel.
+9) Deallocate temporary storage.
+
+New machinery that will be required here is the ability to query how many threads will be used and also the machinery
+to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the
+specific backends which would allow a unified implementation for all host backends.
 
 ### Implementation Two (Atomics)
-This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the output histogram data, we can merely run a `parallel_for` pattern.
+This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
+output histogram data, we can merely run a `parallel_for` pattern.
 
-To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.
+To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics
+specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
+only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
+copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
+data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own
+`atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic
+operations, but that is only available for the OpenMP backend.
 
-It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend on the different approaches available.
+It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
+on the different approaches available.
 
 ### Selecting Between Algorithms
-It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the better atomics should do vs redundant copies of the output.
+It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API
+has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs
+between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the
+other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the
+better atomics should do vs redundant copies of the output.
 
 ## Alternative Approaches
-* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to modifying them. It's possible such an approach could provide a similar approach to atomics, but with different overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
+* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
+* modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
+* overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
 
-* Another possible approach could be to do something like the proposed implementation one, but with some sparse representation of output data. However, I think the general assumptions we can make about the normal case make this less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach does not provide a good fallback.
+* Another possible approach could be to do something like the proposed implementation one, but with some sparse
+* representation of output data. However, I think the general assumptions we can make about the normal case make this
+* less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
+* percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
+* threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
+* does not provide a good fallback.
 
 ## Open Questions
-* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require specializations for each of our supported compilers.
+* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
+* specializations for each of our supported compilers.
 
-* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than merely having extra copies of the histogram and accumulating?
+* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
+* merely having extra copies of the histogram and accumulating?
 
-* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
+* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing
+* implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
 
-* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector instructions?
+* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector
+* instructions?

From 506fb623947519d4490a0b604d8f57f14700ee72 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 6 Nov 2024 10:08:54 -0500
Subject: [PATCH 11/31] fixing numbering.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 16c79883761..62c408cb9cd 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -82,14 +82,14 @@ depending on the use case.
 This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
 `histogram`.
 1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the
-2) backend.
-3) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
-4) use the user-provided output data.
-5) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-6) its own copy of the output sequence using the temporary storage to remove any race conditions.
-7) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-8) histogram into the output histogram sequence. This step is also embarrassingly parallel.
-9) Deallocate temporary storage.
+    backend.
+2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
+   use the user-provided output data.
+3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
+   its own copy of the output sequence using the temporary storage to remove any race conditions.
+4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
+   histogram into the output histogram sequence. This step is also embarrassingly parallel.
+5) Deallocate temporary storage.
 
 New machinery that will be required here is the ability to query how many threads will be used and also the machinery
 to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the

From 1c6cb47d8ced3c78bf35690cf90ea9212d408850 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 6 Nov 2024 10:18:17 -0500
Subject: [PATCH 12/31] putting in specifics for TBB / OpenMP more formatting
 fixes

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 39 +++++++++++--------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 62c408cb9cd..a1bfd28edce 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -81,8 +81,9 @@ depending on the use case.
 ### Implementation One (Embarrassingly Parallel)
 This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
 `histogram`.
-1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the
-    backend.
+
+#### OpenMP:
+1) Determine the number of threads that we will use locally
 2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
    use the user-provided output data.
 3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
@@ -91,9 +92,15 @@ This method uses temporary storage and a pair of embarrassingly parallel `parall
    histogram into the output histogram sequence. This step is also embarrassingly parallel.
 5) Deallocate temporary storage.
 
-New machinery that will be required here is the ability to query how many threads will be used and also the machinery
-to check what thread the current execution is using within a brick. Ideally, these can be generic wrappers around the
-specific backends which would allow a unified implementation for all host backends.
+#### TBB
+For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
+provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
+the index. This allows us to operate in a compose-able manner while keeping the same conceptual implementation.
+1) Embarassingly parallel accumulation to thread local storage
+2) Embarassingly parallel aggregate to output data
+
+I believe the challenge here may be to properly provide the heuristics to choose between this implementation and the
+other implementation.  However, we should be able to have some reasonable division.
 
 ### Implementation Two (Atomics)
 This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
@@ -119,25 +126,25 @@ better atomics should do vs redundant copies of the output.
 
 ## Alternative Approaches
 * One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
-* modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
-* overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
+  modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
+  overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
 
 * Another possible approach could be to do something like the proposed implementation one, but with some sparse
-* representation of output data. However, I think the general assumptions we can make about the normal case make this
-* less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
-* percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
-* threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
-* does not provide a good fallback.
+  representation of output data. However, I think the general assumptions we can make about the normal case make this
+  less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
+  percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
+  threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
+  does not provide a good fallback.
 
 ## Open Questions
 * Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
-* specializations for each of our supported compilers.
+  specializations for each of our supported compilers.
 
 * What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
-* merely having extra copies of the histogram and accumulating?
+  merely having extra copies of the histogram and accumulating?
 
 * Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing
-* implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
+  implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
 
 * How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector
-* instructions?
+  instructions?

From ceee3e38a864eb06de07073dbb354d18555ef98e Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Tue, 12 Nov 2024 09:13:15 -0500
Subject: [PATCH 13/31] Update Atomic strategy

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index a1bfd28edce..c328097e138 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -110,9 +110,11 @@ To deal with atomics appropriately, we have some limitations. We must either use
 specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
 only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
-data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own
-`atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic
-operations, but that is only available for the OpenMP backend.
+data in an atomic wrapper, but we cannot assume `C++17` for all users. OpenMP provides atomic
+operations, but that is only available for the OpenMP backend.  The working plan is to implement a macro like
+`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available , and alternatively uses compiler builtins
+like `InterlockedAdd` or `__atomic_fetch_add_n`.  It needs to be investigated if we need to have any version which
+needs to turn off the atomic implementation, due to lack of support by the compiler (I think this is unlikely).
 
 It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
 on the different approaches available.

From 0711090d7a18ac6a7cd8ca10b941e02b6963b0c7 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Tue, 12 Nov 2024 09:19:03 -0500
Subject: [PATCH 14/31] more clarity about serial backend and policy

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index c328097e138..3bce013ecb8 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -9,9 +9,7 @@ for a full definition of the semantics of the histogram APIs. In short, they tak
 classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
 of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
 with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
-`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
-much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
-straightforward.
+`histogram` for these host-side backends.
 
 ## Motivations
 Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
@@ -57,6 +55,10 @@ required to overcome the race conditions which we add from the additional concur
 sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
 performance results.
 
+### Serial Backend
+We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
+policies types, but always provide a serial unvectorized implementation.
+
 ## Existing Patterns
 
 ### count_if
@@ -139,8 +141,6 @@ better atomics should do vs redundant copies of the output.
   does not provide a good fallback.
 
 ## Open Questions
-* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
-  specializations for each of our supported compilers.
 
 * What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
   merely having extra copies of the histogram and accumulating?

From 3c5ad1282d37e39f7e4e4c446f24a7247f52d305 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Tue, 12 Nov 2024 09:24:51 -0500
Subject: [PATCH 15/31] minor corrections

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 3bce013ecb8..c7b163a6b30 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -97,9 +97,9 @@ This method uses temporary storage and a pair of embarrassingly parallel `parall
 #### TBB
 For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
 provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
-the index. This allows us to operate in a compose-able manner while keeping the same conceptual implementation.
-1) Embarassingly parallel accumulation to thread local storage
-2) Embarassingly parallel aggregate to output data
+the index. This allows us to operate in a composable manner while keeping the same conceptual implementation.
+1) Embarrassingly parallel accumulation to thread local storage
+2) Embarrassingly parallel aggregate to output data
 
 I believe the challenge here may be to properly provide the heuristics to choose between this implementation and the
 other implementation.  However, we should be able to have some reasonable division.
@@ -114,7 +114,7 @@ only provide atomicity for data which is created with atomics in mind. This mean
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
 data in an atomic wrapper, but we cannot assume `C++17` for all users. OpenMP provides atomic
 operations, but that is only available for the OpenMP backend.  The working plan is to implement a macro like
-`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available , and alternatively uses compiler builtins
+`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins
 like `InterlockedAdd` or `__atomic_fetch_add_n`.  It needs to be investigated if we need to have any version which
 needs to turn off the atomic implementation, due to lack of support by the compiler (I think this is unlikely).
 
@@ -131,7 +131,7 @@ better atomics should do vs redundant copies of the output.
 ## Alternative Approaches
 * One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
   modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
-  overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
+  overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
 
 * Another possible approach could be to do something like the proposed implementation one, but with some sparse
   representation of output data. However, I think the general assumptions we can make about the normal case make this

From 06a734f55f3014fe91ca3762dd0a9e0a50340614 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 13 Nov 2024 15:12:37 -0500
Subject: [PATCH 16/31] c++17 -> c++20 fix

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index c7b163a6b30..30c62019665 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -112,7 +112,7 @@ To deal with atomics appropriately, we have some limitations. We must either use
 specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
 only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
-data in an atomic wrapper, but we cannot assume `C++17` for all users. OpenMP provides atomic
+data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic
 operations, but that is only available for the OpenMP backend.  The working plan is to implement a macro like
 `_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins
 like `InterlockedAdd` or `__atomic_fetch_add_n`.  It needs to be investigated if we need to have any version which

From b858a0ee8a14414380c167c81341e9e0ed56b345 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Mon, 16 Dec 2024 17:10:29 -0500
Subject: [PATCH 17/31] Updates after some experimentation and thought SIMD +
 implementation

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 125 ++++++++++--------
 1 file changed, 69 insertions(+), 56 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 30c62019665..13439650ec4 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -46,14 +46,31 @@ Our goal here is to make something maintainable and to reuse as much as we can w
 reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
 
 ### unseq Backend
+Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
+loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
+provided by OpenMP.
+
 As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable
-from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case
-also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector
-length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may
-not provide much benefit, especially when we account for the extra memory footprint required or synchronization
-required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make
-sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
-performance results.
+from vector instructions as they provide assistance mostly in speeding up computation.
+
+For histogram, there are a few things to consider. First, lets consider the calculation to determine which bin to
+increment. There are two APIs, even and custom range which have significantly different methods to determine the bin to
+increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
+each input has the same mathematical operations applied to each. However, for the custom range API, each input element
+uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
+different length and control flow based upon each input element and will be very difficult to vectorize.
+
+Second, lets consider the increment operation itself. This operation increments a data dependant bin location, and may
+result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
+more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
+generally available, and certainly not available via OpenMP SIMD. Alternatively, we can multiply our number of temporary
+histogram copies by a factor of the vector width, but we will need to determine if this is worth the overhead, memory
+footprint, and extra accumulation at the end. OpenMP SIMD does provide an `ordered` structured block which we can use to
+exempt the increment from SIMD operations as well. It must be determined if SIMD is beneficial in either API variety. It
+seems only possible to be beneficial for the even bin API, but more investigation is required.
+
+Finally, for our below proposed implementation, there is the task of combining temporary histogram data into the global
+output histogram. This is directly vectorizable via our existing brick_walk implementation.
 
 ### Serial Backend
 We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
@@ -76,35 +93,10 @@ we need for `histogram`. However, we cannot simply use it without any added infr
 histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
 to synchronize and accumulate between threads.
 
-## Proposal
-I believe there are two competing options for `histogram`, which may both have utility in the final implementation
-depending on the use case.
-
-### Implementation One (Embarrassingly Parallel)
-This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
-`histogram`.
 
-#### OpenMP:
-1) Determine the number of threads that we will use locally
-2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
-   use the user-provided output data.
-3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-   its own copy of the output sequence using the temporary storage to remove any race conditions.
-4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-   histogram into the output histogram sequence. This step is also embarrassingly parallel.
-5) Deallocate temporary storage.
-
-#### TBB
-For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
-provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
-the index. This allows us to operate in a composable manner while keeping the same conceptual implementation.
-1) Embarrassingly parallel accumulation to thread local storage
-2) Embarrassingly parallel aggregate to output data
-
-I believe the challenge here may be to properly provide the heuristics to choose between this implementation and the
-other implementation.  However, we should be able to have some reasonable division.
+## Alternative Approaches
 
-### Implementation Two (Atomics)
+### Atomics
 This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
 output histogram data, we can merely run a `parallel_for` pattern.
 
@@ -113,22 +105,29 @@ specific to a backend, or custom atomics specific to a compiler. `C++17` provide
 only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
 data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic
-operations, but that is only available for the OpenMP backend.  The working plan is to implement a macro like
+operations, but that is only available for the OpenMP backend.  The working plan was to implement a macro like
 `_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins
-like `InterlockedAdd` or `__atomic_fetch_add_n`.  It needs to be investigated if we need to have any version which
-needs to turn off the atomic implementation, due to lack of support by the compiler (I think this is unlikely).
-
-It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
-on the different approaches available.
-
-### Selecting Between Algorithms
-It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API
-has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs
-between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the
-other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the
-better atomics should do vs redundant copies of the output.
-
-## Alternative Approaches
+like `InterlockedAdd` or `__atomic_fetch_add_n`. In a proof of concept implementation,this seemed to work, but does
+reach more into details than compiler / OS specifics than is desired for implementations prior to `C++20`.
+
+After experimenting with a proof of concept implementation of this implementation, it seems that the atomic
+implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with
+number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation
+outperformed the proposed implementation (below). However, this was generally limited to some specific cases where
+the number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases
+with a small number for input elements relative to number of bins. This makes sense because the atomic implementation
+is able to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when
+the number of bins is large compared to the number of input elements. With many bins, contention on atomics is also
+limited as compared to the embarassingly parallel proposal which does experience this contention.
+
+When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important
+use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to
+characterize the result. Cases for which there are similar or more bins than input elements are not very practical in
+practice. The maintenance and complexity cost associated with supporting and maintaining a second implementation to
+serve this subset of cases does not seem to be justified. Therefore, this implementation has been discarded at this
+time.
+
+### Other Unexplored Approaches
 * One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
   modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
   overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
@@ -140,13 +139,27 @@ better atomics should do vs redundant copies of the output.
   threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
   does not provide a good fallback.
 
-## Open Questions
+## Proposal
+After exploring the above implementation for `histogram`, it seems the following proposal better represents the use
+cases which are important, and provides reasonable performance for most cases.
 
-* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
-  merely having extra copies of the histogram and accumulating?
+### Embarrassingly Parallel Via Temporary Histograms
+This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
+`histogram`.
 
-* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing
-  implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
+#### OpenMP:
+1) Determine the number of threads that we will use locally
+2) In parallel, create and initialize temporary data for the number of threads copies of the histogram output sequence.
+3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
+   its own copy of the output sequence using the temporary storage to remove any race conditions.
+4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
+   histogram into the output histogram sequence. This step is also embarrassingly parallel.
+5) Deallocate temporary storage.
+
+#### TBB
+For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
+provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
+the index. This allows us to operate in a composable manner while keeping the same conceptual implementation.
+1) Embarrassingly parallel accumulation to thread local storage
+2) Embarrassingly parallel aggregate to output data
 
-* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector
-  instructions?

From 53f4643e490fce45caa3be7ebd9ab4cbe39a8112 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 14:24:36 -0500
Subject: [PATCH 18/31] improvements from feedback

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 70 ++++++++-----------
 1 file changed, 31 insertions(+), 39 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 13439650ec4..21d7fe92c5c 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,21 +1,12 @@
 # Host Backends Support for the Histogram APIs
 
 ## Introduction
-In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device
-policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should
-be present in the 1.4 specification. Please see the
+The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
 [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
-for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and
-classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
-of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
-with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
-`histogram` for these host-side backends.
+for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends.
 
 ## Motivations
-Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
-to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect
-that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec
-compliant with the oneAPI specification.
+There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
 
 ## Design Considerations
 
@@ -29,52 +20,47 @@ which they can select from when using oneDPL. It is important that all combinati
 the `histogram` APIs.
 
 ### Performance
-As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a
-low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case.
-Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely
-have a high impact on overall performance.
+With little computation, a histogram algorithm is likely a memory-bound algorithm. So, the implementation prioritize
+reducing memory accesses and minimizing temporary memory traffic.
 
 ### Memory Footprint
-There are no guidelines here from the standard library as this is an extension API. However, we should always try to
-minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here
-because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for
-histogram is for the number of elements in the input sequence to be far greater than the number of output histogram
-bins. We may be able to use that to our advantage.
+There are no guidelines here from the standard library as this is an extension API. Still, we will minimize memory
+footprint where possible.
 
 ### Code Reuse
-Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been
-reviewed within oneDPL. With everything else, this must be balanced with performance considerations.
+It is a priority to reuse as much as we can which already exists and has been reviewed within oneDPL. We want to
+minimize adding requirements for parallel backends to implement, and lift as much as possible to the algorithm
+implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual backends, and
+instead rely upon `__parallel_for`.
 
 ### unseq Backend
 Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
 loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
 provided by OpenMP.
 
-As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable
-from vector instructions as they provide assistance mostly in speeding up computation.
-
-For histogram, there are a few things to consider. First, lets consider the calculation to determine which bin to
-increment. There are two APIs, even and custom range which have significantly different methods to determine the bin to
+There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment
+there are two APIs, even and custom range which have significantly different methods to determine the bin to
 increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
 each input has the same mathematical operations applied to each. However, for the custom range API, each input element
 uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
 different length and control flow based upon each input element and will be very difficult to vectorize.
 
-Second, lets consider the increment operation itself. This operation increments a data dependant bin location, and may
+Next, lets consider the increment operation itself. This operation increments a data dependant bin location, and may
 result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
 more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
-generally available, and certainly not available via OpenMP SIMD. Alternatively, we can multiply our number of temporary
-histogram copies by a factor of the vector width, but we will need to determine if this is worth the overhead, memory
-footprint, and extra accumulation at the end. OpenMP SIMD does provide an `ordered` structured block which we can use to
-exempt the increment from SIMD operations as well. It must be determined if SIMD is beneficial in either API variety. It
-seems only possible to be beneficial for the even bin API, but more investigation is required.
+available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the
+vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which
+we can use to exempt the increment from SIMD operations as well.  However, this often results in vectorization being
+refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop.
 
-Finally, for our below proposed implementation, there is the task of combining temporary histogram data into the global
-output histogram. This is directly vectorizable via our existing brick_walk implementation.
+Last, for our below proposed implementation there is the task of combining temporary histogram data into the global
+output histogram. This is directly vectorizable via our existing brick_walk implementation, and will be vectorized when
+a vector policy is used.
 
 ### Serial Backend
 We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
-policies types, but always provide a serial unvectorized implementation.
+policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other
+approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram.
 
 ## Existing Patterns
 
@@ -140,14 +126,20 @@ time.
   does not provide a good fallback.
 
 ## Proposal
-After exploring the above implementation for `histogram`, it seems the following proposal better represents the use
+After exploring the above implementation for `histogram`, the following proposal better represents the use
 cases which are important, and provides reasonable performance for most cases.
 
 ### Embarrassingly Parallel Via Temporary Histograms
 This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
 `histogram`.
 
-#### OpenMP:
+Create a generic `__thread_enumerable_storage` struct which will be defined by all parallel backends, which provides
+the following:
+* constructor which specifies the storage to be held per thread and a method to initialize it
+* `get()` returns an iterator to the beginning of the current thread's temporary vector
+* `get_with_id(int i)` returns an iterator to the beginning of temporary vector with index provided
+* `size()` returns number of temporary arrays
+
 1) Determine the number of threads that we will use locally
 2) In parallel, create and initialize temporary data for the number of threads copies of the histogram output sequence.
 3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into

From d718e0e861b75da9b3ae79363b4e39c40fb28d94 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 15:00:04 -0500
Subject: [PATCH 19/31] thread enumerable storage + address feedback

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 37 +++++++++----------
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 21d7fe92c5c..c97e54025ab 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -133,25 +133,22 @@ cases which are important, and provides reasonable performance for most cases.
 This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
 `histogram`.
 
-Create a generic `__thread_enumerable_storage` struct which will be defined by all parallel backends, which provides
+For this algorithm, each parallel backend will add a  `__thread_enumerable_storage<_StoredType>` struct which provides
 the following:
-* constructor which specifies the storage to be held per thread and a method to initialize it
-* `get()` returns an iterator to the beginning of the current thread's temporary vector
-* `get_with_id(int i)` returns an iterator to the beginning of temporary vector with index provided
-* `size()` returns number of temporary arrays
-
-1) Determine the number of threads that we will use locally
-2) In parallel, create and initialize temporary data for the number of threads copies of the histogram output sequence.
-3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-   its own copy of the output sequence using the temporary storage to remove any race conditions.
-4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-   histogram into the output histogram sequence. This step is also embarrassingly parallel.
-5) Deallocate temporary storage.
-
-#### TBB
-For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
-provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
-the index. This allows us to operate in a composable manner while keeping the same conceptual implementation.
-1) Embarrassingly parallel accumulation to thread local storage
-2) Embarrassingly parallel aggregate to output data
+* constructor which takes a variadic list of args to pass to the constructor of each thread's object
+* `get()` returns reference to the current threads stored object
+* `get_with_id(int i)` returns reference to the stored object for an index
+* `size()` returns number of stored objects
+
+In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, this will either pre-allocate
+and initialize an object for each possible thread in parallel, or build functionality similar to
+`enumerable_thread_specific` which will create storage on demand upon first use within a thread. This will be determined
+within the histogram PR. The serial backend will merely create a single copy of the temporary object for use.
+
+With this new structure we will use the following algorithm:
+
+1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
+   its own temporary histogram returned by `__thread_enumerable_storage`.
+2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
+   histogram created within `__thread_enumerable_storage` into the output histogram sequence.
 

From bb9e6f9c0c70bd212bfbaa653abda93eccbb5cc9 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 15:04:16 -0500
Subject: [PATCH 20/31] remove general language keep specifics to histogram

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index c97e54025ab..9bb93ef26cc 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -28,10 +28,9 @@ There are no guidelines here from the standard library as this is an extension A
 footprint where possible.
 
 ### Code Reuse
-It is a priority to reuse as much as we can which already exists and has been reviewed within oneDPL. We want to
-minimize adding requirements for parallel backends to implement, and lift as much as possible to the algorithm
-implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual backends, and
-instead rely upon `__parallel_for`.
+We want to minimize adding requirements for parallel backends to implement, and lift as much as possible to the
+algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual
+backends, and instead rely upon `__parallel_for`.
 
 ### unseq Backend
 Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across

From 17e0510eac1eaa3784591faabb39ae2ce08e8bfa Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 15:06:17 -0500
Subject: [PATCH 21/31] SIMD naming

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 9bb93ef26cc..b47e75bb447 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -32,7 +32,7 @@ We want to minimize adding requirements for parallel backends to implement, and
 algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual
 backends, and instead rely upon `__parallel_for`.
 
-### unseq Backend
+### SIMD/openMP SIMD Implementation
 Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
 loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
 provided by OpenMP.

From 961420982ad160336e511c0bb8754bc10548bf88 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 15:08:58 -0500
Subject: [PATCH 22/31] spelling

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index b47e75bb447..bd7008366fa 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -44,7 +44,7 @@ each input has the same mathematical operations applied to each. However, for th
 uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
 different length and control flow based upon each input element and will be very difficult to vectorize.
 
-Next, lets consider the increment operation itself. This operation increments a data dependant bin location, and may
+Next, lets consider the increment operation itself. This operation increments a data dependent bin location, and may
 result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
 more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
 available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the

From 2964a9eaa2a67ee60ff8b505dfb0afe2688e8dfe Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Fri, 20 Dec 2024 15:33:12 -0500
Subject: [PATCH 23/31] clarifying thread enumerable storage

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index bd7008366fa..77cd59655a1 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -1,12 +1,15 @@
 # Host Backends Support for the Histogram APIs
 
 ## Introduction
-The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
+The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These
+APIs are defined in the oneAPI Specification 1.4. Please see the
 [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
-for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends.
+for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending
+histogram support to these backends.
 
 ## Motivations
-There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification.
+There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding
+the support is simply to be spec compliant with the oneAPI specification.
 
 ## Design Considerations
 
@@ -139,10 +142,9 @@ the following:
 * `get_with_id(int i)` returns reference to the stored object for an index
 * `size()` returns number of stored objects
 
-In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, this will either pre-allocate
-and initialize an object for each possible thread in parallel, or build functionality similar to
-`enumerable_thread_specific` which will create storage on demand upon first use within a thread. This will be determined
-within the histogram PR. The serial backend will merely create a single copy of the temporary object for use.
+In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, this will pre-allocate
+and initialize an object for each possible thread in parallel. The serial backend will merely create a single copy of
+the temporary object for use.
 
 With this new structure we will use the following algorithm:
 

From 9287fd218c0aba9f7a42e8f2a7f60be1168e0f46 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Mon, 30 Dec 2024 15:11:50 -0500
Subject: [PATCH 24/31] minor improvements

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 44 ++++++++++++-------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 77cd59655a1..f4cec26131c 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -62,17 +62,20 @@ a vector policy is used.
 ### Serial Backend
 We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
 policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other
-approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram.
+approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In
+our benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly
+into the output global histogram. There seems to be no performance motivated reason to special case the serial
+algorithm to use the global histogram directly.
 
-## Existing Patterns
+## Existing APIs / Patterns
 
 ### count_if
 `histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a
-sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being
-incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for
-each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a
-`histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention
-should be far higher, and `reduce` is a very well-matched pattern performance-wise.
+sequence. `count_if` relies upon the `transform_reduce` pattern internally, and returns a scalar-typed value and doesn't
+provide any function to modify the variable being incremented. Using `count_if` without significant modification would
+require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth
+perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched
+pattern performance-wise.
 
 ### parallel_for
 `parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what
@@ -132,24 +135,31 @@ After exploring the above implementation for `histogram`, the following proposal
 cases which are important, and provides reasonable performance for most cases.
 
 ### Embarrassingly Parallel Via Temporary Histograms
-This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
-`histogram`.
+This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the
+`histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra
+histogram-specific patterns in the implementation of a backend.
 
-For this algorithm, each parallel backend will add a  `__thread_enumerable_storage<_StoredType>` struct which provides
-the following:
+This algorithm does however require that each parallel backend will add a  `__thread_enumerable_storage<_StoredType>`
+struct which provides the following:
 * constructor which takes a variadic list of args to pass to the constructor of each thread's object
-* `get()` returns reference to the current threads stored object
+* `get_for_current_thread()` returns reference to the current thread's stored object
 * `get_with_id(int i)` returns reference to the stored object for an index
 * `size()` returns number of stored objects
 
-In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, this will pre-allocate
-and initialize an object for each possible thread in parallel. The serial backend will merely create a single copy of
-the temporary object for use.
+In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, we implement our own similar
+thread local storage which will allocate and initialize the thread local storage at the first usage for each active
+thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial
+backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we
+use a single copy of histogram. In practice, our benchmarking reports little difference in performance between this
+implementation and the original, which directly accumulated to the output histogram.
 
 With this new structure we will use the following algorithm:
 
 1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-   its own temporary histogram returned by `__thread_enumerable_storage`.
+   its own temporary histogram returned by `__thread_enumerable_storage`. The parallelism is divided on the input
+   element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composibility.
 2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-   histogram created within `__thread_enumerable_storage` into the output histogram sequence.
+   histogram created within `__thread_enumerable_storage` into the output histogram sequence. The parallelism is divided
+   on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the output
+   histogram.
 

From cdf50929da2e3c3fe9042ec72796676086d0c6a5 Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Mon, 30 Dec 2024 15:14:40 -0500
Subject: [PATCH 25/31] spelling

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index f4cec26131c..8fda6effa44 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -157,7 +157,7 @@ With this new structure we will use the following algorithm:
 
 1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
    its own temporary histogram returned by `__thread_enumerable_storage`. The parallelism is divided on the input
-   element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composibility.
+   element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
 2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
    histogram created within `__thread_enumerable_storage` into the output histogram sequence. The parallelism is divided
    on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the output

From 215c2b791acdca554e19d122501308695e8c0a2c Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Mon, 30 Dec 2024 15:26:46 -0500
Subject: [PATCH 26/31] adding link to implementation

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 8fda6effa44..21c7982ac40 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -7,6 +7,8 @@ APIs are defined in the oneAPI Specification 1.4. Please see the
 for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending
 histogram support to these backends.
 
+The pull request for the proposed implementation exists [here](https://github.com/oneapi-src/oneDPL/pull/1974).
+
 ## Motivations
 There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding
 the support is simply to be spec compliant with the oneAPI specification.

From 04d5127a7ab82fd183b35953bb5c7a2d8f37cebd Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 15 Jan 2025 08:49:44 -0500
Subject: [PATCH 27/31] rename to __enumerable_thread_local_storage

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 21c7982ac40..9aa1b2f5d22 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -76,7 +76,8 @@ algorithm to use the global histogram directly.
 sequence. `count_if` relies upon the `transform_reduce` pattern internally, and returns a scalar-typed value and doesn't
 provide any function to modify the variable being incremented. Using `count_if` without significant modification would
 require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth
-perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched
+perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a
+well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched
 pattern performance-wise.
 
 ### parallel_for
@@ -141,8 +142,8 @@ This method uses temporary storage and a pair of calls to backend specific `para
 `histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra
 histogram-specific patterns in the implementation of a backend.
 
-This algorithm does however require that each parallel backend will add a  `__thread_enumerable_storage<_StoredType>`
-struct which provides the following:
+This algorithm does however require that each parallel backend will add a
+`__enumerable_thread_local_storage<_StoredType>` struct which provides the following:
 * constructor which takes a variadic list of args to pass to the constructor of each thread's object
 * `get_for_current_thread()` returns reference to the current thread's stored object
 * `get_with_id(int i)` returns reference to the stored object for an index
@@ -158,10 +159,10 @@ implementation and the original, which directly accumulated to the output histog
 With this new structure we will use the following algorithm:
 
 1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-   its own temporary histogram returned by `__thread_enumerable_storage`. The parallelism is divided on the input
+   its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input
    element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
 2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-   histogram created within `__thread_enumerable_storage` into the output histogram sequence. The parallelism is divided
-   on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the output
-   histogram.
+   histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is
+   divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
+   output histogram.
 

From fe1efa20cf05a20b7cebc26bcab53b47714d096b Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 15 Jan 2025 09:03:05 -0500
Subject: [PATCH 28/31] Added sections on complexity

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 9aa1b2f5d22..f193f42febe 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -166,3 +166,16 @@ With this new structure we will use the following algorithm:
    divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
    output histogram.
 
+### Temporary Memory Requirements
+Both algorithms should have temporary memory complexity of `O(num_bins)`, and specifically will allocate `num_bins`
+output histogram typed elements for each thread used. Depending on the number of input elements, all available threads
+may not be used.
+
+### Computational Complexity
+#### Even Bin API
+The proposed algorithm should have `O(N) + O(num_bins)` operations where `N` is the number of input elements, and
+`num_bins` is the number of histogram bins.
+
+#### Custom Range Bin API
+The proposed algorithm should have `O(N * log(num_bins)) + O(num_bins)` operations where `N` is the number of input
+elements, and `num_bins` is the number of histogram bins.
\ No newline at end of file

From 60ec0e5165ee877643c6bf1a76775220eb0405af Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 15 Jan 2025 09:15:20 -0500
Subject: [PATCH 29/31] spelling

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index f193f42febe..001fe130afe 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -112,7 +112,7 @@ the number of bins was very large (~1 Million), and even for this subset signifi
 with a small number for input elements relative to number of bins. This makes sense because the atomic implementation
 is able to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when
 the number of bins is large compared to the number of input elements. With many bins, contention on atomics is also
-limited as compared to the embarassingly parallel proposal which does experience this contention.
+limited as compared to the embarrassingly parallel proposal which does experience this contention.
 
 When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important
 use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to

From 54e16b6d9c3bad092d3f79c90204eb00fa94d27c Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 15 Jan 2025 16:10:12 -0500
Subject: [PATCH 30/31] wording adjustments

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 rfcs/proposed/host_backend_histogram/README.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 001fe130afe..4ebf1235bbb 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -39,8 +39,7 @@ backends, and instead rely upon `__parallel_for`.
 
 ### SIMD/openMP SIMD Implementation
 Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
-loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
-provided by OpenMP.
+loop iterations, oneDPL does not directly use any intrinsics.
 
 There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment
 there are two APIs, even and custom range which have significantly different methods to determine the bin to

From 77435a313467c41a992d4c00ecdd53dc885cf82b Mon Sep 17 00:00:00 2001
From: Dan Hoeflinger <dan.hoeflinger@intel.com>
Date: Wed, 15 Jan 2025 16:28:31 -0500
Subject: [PATCH 31/31] minor formatting

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
---
 .../proposed/host_backend_histogram/README.md | 89 +++++++++----------
 1 file changed, 44 insertions(+), 45 deletions(-)

diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
index 4ebf1235bbb..00036d8c2f1 100644
--- a/rfcs/proposed/host_backend_histogram/README.md
+++ b/rfcs/proposed/host_backend_histogram/README.md
@@ -4,8 +4,8 @@
 The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These
 APIs are defined in the oneAPI Specification 1.4. Please see the
 [oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
-for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending
-histogram support to these backends.
+for details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram
+support to these backends.
 
 The pull request for the proposed implementation exists [here](https://github.com/oneapi-src/oneDPL/pull/1974).
 
@@ -20,12 +20,12 @@ Provide support for the `histogram` APIs with the following policies and backend
 - Policies: `seq`, `unseq`, `par`, `par_unseq`
 - Backends: `serial`, `tbb`, `openmp`
 
-Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends
-which they can select from when using oneDPL. It is important that all combinations of these options have support for
-the `histogram` APIs.
+Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which
+they can select from when using oneDPL. It is important that all combinations of these options have support for the
+`histogram` APIs.
 
 ### Performance
-With little computation, a histogram algorithm is likely a memory-bound algorithm. So, the implementation prioritize
+With little computation, a histogram algorithm is likely a memory-bound algorithm. So, the implementation prioritizes
 reducing memory accesses and minimizing temporary memory traffic.
 
 ### Memory Footprint
@@ -42,18 +42,18 @@ Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is
 loop iterations, oneDPL does not directly use any intrinsics.
 
 There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment
-there are two APIs, even and custom range which have significantly different methods to determine the bin to
-increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
-each input has the same mathematical operations applied to each. However, for the custom range API, each input element
-uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
-different length and control flow based upon each input element and will be very difficult to vectorize.
+there are two APIs, even and custom range which have significantly different methods to determine the bin to increment.
+For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as each input
+has the same mathematical operations applied to each. However, for the custom range API, each input element uses a
+binary search through a list of bin boundaries to determine the selected bin. This operation will have a different
+length and control flow based upon each input element and will be very difficult to vectorize.
 
-Next, lets consider the increment operation itself. This operation increments a data dependent bin location, and may
+Next, let's consider the increment operation itself. This operation increments a data dependent bin location, and may
 result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
 more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
 available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the
 vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which
-we can use to exempt the increment from SIMD operations as well.  However, this often results in vectorization being
+we can use to exempt the increment from SIMD operations as well. However, this often results in vectorization being
 refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop.
 
 Last, for our below proposed implementation there is the task of combining temporary histogram data into the global
@@ -63,10 +63,10 @@ a vector policy is used.
 ### Serial Backend
 We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
 policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other
-approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In
-our benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly
-into the output global histogram. There seems to be no performance motivated reason to special case the serial
-algorithm to use the global histogram directly.
+approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In our
+benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly into
+the output global histogram. There seems to be no performance motivated reason to special case the serial algorithm to
+use the global histogram directly.
 
 ## Existing APIs / Patterns
 
@@ -86,7 +86,6 @@ we need for `histogram`. However, we cannot simply use it without any added infr
 histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
 to synchronize and accumulate between threads.
 
-
 ## Alternative Approaches
 
 ### Atomics
@@ -97,21 +96,21 @@ To deal with atomics appropriately, we have some limitations. We must either use
 specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
 only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
-data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic
-operations, but that is only available for the OpenMP backend.  The working plan was to implement a macro like
-`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins
-like `InterlockedAdd` or `__atomic_fetch_add_n`. In a proof of concept implementation,this seemed to work, but does
-reach more into details than compiler / OS specifics than is desired for implementations prior to `C++20`.
+data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic operations, but that is
+only available for the OpenMP backend. The working plan was to implement a macro like `_ONEDPL_ATOMIC_INCREMENT(var)`
+which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins like `InterlockedAdd` or
+`__atomic_fetch_add_n`. In a proof of concept implementation, this seemed to work, but does reach more into details than
+compiler / OS specifics than is desired for implementations prior to `C++20`.
 
 After experimenting with a proof of concept implementation of this implementation, it seems that the atomic
 implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with
 number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation
-outperformed the proposed implementation (below). However, this was generally limited to some specific cases where
-the number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases
-with a small number for input elements relative to number of bins. This makes sense because the atomic implementation
-is able to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when
-the number of bins is large compared to the number of input elements. With many bins, contention on atomics is also
-limited as compared to the embarrassingly parallel proposal which does experience this contention.
+outperformed the proposed implementation (below). However, this was generally limited to some specific cases where the
+number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases with a
+small number for input elements relative to number of bins. This makes sense because the atomic implementation is able
+to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when the number of
+bins is large compared to the number of input elements. With many bins, contention on atomics is also limited as
+compared to the embarrassingly parallel proposal which does experience this contention.
 
 When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important
 use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to
@@ -122,19 +121,19 @@ time.
 
 ### Other Unexplored Approaches
 * One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
-  modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
-  overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
+modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
+overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
 
 * Another possible approach could be to do something like the proposed implementation one, but with some sparse
-  representation of output data. However, I think the general assumptions we can make about the normal case make this
-  less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
-  percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
-  threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
-  does not provide a good fallback.
+representation of output data. However, I think the general assumptions we can make about the normal case make this
+less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
+percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads.
+This could be explored if we find temporary storage is too large for some cases and the atomic approach does not
+provide a good fallback.
 
 ## Proposal
-After exploring the above implementation for `histogram`, the following proposal better represents the use
-cases which are important, and provides reasonable performance for most cases.
+After exploring the above implementation for `histogram`, the following proposal better represents the use cases which
+are important, and provides reasonable performance for most cases.
 
 ### Embarrassingly Parallel Via Temporary Histograms
 This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the
@@ -148,7 +147,7 @@ This algorithm does however require that each parallel backend will add a
 * `get_with_id(int i)` returns reference to the stored object for an index
 * `size()` returns number of stored objects
 
-In the TBB backend, this will use `enumerable_thread_specific` internally.  For OpenMP, we implement our own similar
+In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, we implement our own similar
 thread local storage which will allocate and initialize the thread local storage at the first usage for each active
 thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial
 backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we
@@ -158,12 +157,12 @@ implementation and the original, which directly accumulated to the output histog
 With this new structure we will use the following algorithm:
 
 1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
-   its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input
-   element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
+its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input
+element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
 2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
-   histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is
-   divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
-   output histogram.
+histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is
+divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
+output histogram.
 
 ### Temporary Memory Requirements
 Both algorithms should have temporary memory complexity of `O(num_bins)`, and specifically will allocate `num_bins`
@@ -177,4 +176,4 @@ The proposed algorithm should have `O(N) + O(num_bins)` operations where `N` is
 
 #### Custom Range Bin API
 The proposed algorithm should have `O(N * log(num_bins)) + O(num_bins)` operations where `N` is the number of input
-elements, and `num_bins` is the number of histogram bins.
\ No newline at end of file
+elements, and `num_bins` is the number of histogram bins.