Skip to content

Commit

Permalink
Merge pull request #117 from wmotion/Episode02
Browse files Browse the repository at this point in the history
Episode02: fix variable names and shallow editing elsewhere
  • Loading branch information
isazi authored Nov 27, 2024
2 parents b28cc41 + 3dba7df commit b0a7d3b
Showing 1 changed file with 28 additions and 27 deletions.
55 changes: 28 additions & 27 deletions episodes/cupy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ From now on we can also use the word *host* to refer to the CPU on the laptop, d
We start by generating an image using Python and NumPy code.
We want to compute a convolution on this input image once on the host and once on the device, and then compare both the execution times and the results.

In an iPython shell or a Jupyter notebook, we can write and execute the following code on the host.
The pixel values will be zero everywhere except for a regular grid of single pixels having value one, very much like a Dirac's delta function; hence the input image is named `deltas`.
We can write and execute the following code on the host in an iPython shell or a Jupyter notebook.
The pixel values will be zero everywhere except for a regular grid of single pixels having value one, very much like a Dirac delta function; hence the input image is named `diracs`.

~~~python
import numpy as np
Expand All @@ -39,7 +39,7 @@ diracs = np.zeros((2048, 2048))
diracs[8::16,8::16] = 1
~~~

We can display the top-left corner of the input image to get a feeling of how it looks like, as follows:
We can display the top-left corner of the input image to get a feel for how it looks like, as follows:

~~~python
import pylab as pyl
Expand All @@ -54,21 +54,21 @@ pyl.show()

and you should obtain the following image:

![Grid of delta functions](./fig/diracs.png){alt='Grid of delta functions.'}
![Grid of Dirac delta functions](./fig/diracs.png){alt='Grid of Dirac delta functions.'}

## Gaussian convolutions

The illustration below shows an example of convolution (courtesy of Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons).
Looking at the terminology in the illustration, be forewarned that the word *kernel* happens to have different meanings that, inconveniently, apply to both mathematical convolution and coding on a GPU device.
To know more about convolutions, we encourage you to check out [this GitHub repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations.
Looking at the terminology in the illustration, be forewarned that, inconveniently, the meaning of the word *kernel* is different when talking of mathematical convolutions and of codes for programming a GPU device.
To know more about convolutions, we encourage you to check out [this GitHub repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin, who show great animations.

![Example of animated convolution.](./fig/2D_Convolution_Animation.gif){alt="Example of animated convolution"}

In this course section, we will convolve our image with a 2D Gaussian function, having the general form:

$$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right)$$

where $x$ and $y$ are distances from the origin, and $\sigma$ controls the width of the Gaussian curve.
where $x$ and $y$ are distances from the origin, and $\sigma$ controls the width of the curve.
Since we can think of an image as a matrix of color values, the convolution of that image with a kernel generates a new matrix with different color values.
In particular, convolving images with a 2D Gaussian kernel changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image.

Expand Down Expand Up @@ -105,7 +105,7 @@ pyl.imshow(gauss)
pyl.show()
~~~

The code above produces this image of a symmetrical two-dimensional Gaussian:
The code above produces this image of a symmetrical two-dimensional Gaussian surface:

![Two-dimensional Gaussian.](./fig/gauss.png){alt="Two-dimensional Gaussian"}

Expand All @@ -129,9 +129,9 @@ We expect that to be in the region of a couple of seconds, as shown in the timin
2.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
~~~

Displaying just a corner of the image shows that the Gaussian has so much blurred the original pattern of ones surrounded by zeros that we end up with a regular pattern of Gaussians.
Displaying just a corner of the image shows that the Gaussian filter has so much blurred the original pattern of ones surrounded by zeros that we end up with a regular pattern of Gaussians.

![Grid of Gaussians in the convoluted image.](./fig/convolved_image.png){alt="Grid of Gaussians in the convoluted image"}
![Grid of Gaussian surfaces in the convoluted image.](./fig/convolved_image.png){alt="Grid of Gaussians in the convoluted image"}

## Convolution on the GPU Using CuPy

Expand All @@ -142,7 +142,7 @@ This picture depicts the different components of CPU and GPU and how they are co
![CPU and GPU are separate entities with an own memory.](./fig/CPU_and_GPU_separated.png){alt="CPU and GPU are separate entities with an own memory"}

This means that the array created with NumPy is physically stored in a memory of the host's and, therefore, is only available to the CPU.
Since our input image and convolution filter are not yet present in the device memory, we need to copy both data to the GPU before executing any code on it.
Since our input image and convolution filter are not present in the device memory yet, we need to copy them to the GPU before executing any code on it.
In practice, we use CuPy to copy the arrays `diracs` and `gauss` from the host's Random Access Memory (RAM) to the GPU memory as follows:

~~~python
Expand All @@ -154,10 +154,11 @@ gauss_gpu = cp.asarray(gauss)

Now it is time to compute the convolution on our GPU.
Inconveniently, SciPy does not offer methods running on GPUs.
Hence, we import the convolution function from a CuPy package aliased as `cupyx`, whose sub-package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html) performs a selection of the SciPy operations.
We will soon verify that the GPU convolution function of `cupyx` works out the same calculations as the CPU convolution function of SciPy.
In general, CuPy proper and NumPy are so similar as are the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU for computing.
For now, let's again record the execution time on the device for the same convolution as the host, and can compare the respective performances.
Hence, we import the convolution function from a CuPy package aliased as `cupyx`.
Its sub-package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html) performs a selection of the SciPy operations.
We will soon verify that the convolution function of `cupyx` works out the same calculations on the GPU as the convolution function of SciPy on the CPU.
In general, CuPy proper and NumPy are so similar one to another as are the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU for computing.
For now, let's again record the execution time on the device for the same convolution as the host, and compare the respective performances.

~~~python
from cupyx.scipy.signal import convolve2d as convolve2d_gpu
Expand All @@ -166,8 +167,8 @@ convolved_image_gpu = convolve2d_gpu(diracs_gpu, gauss_gpu)
%timeit -n 7 -r 1 convolved_image_gpu = convolve2d_gpu(diracs_gpu, gauss_gpu)
~~~

Also the execution time of the GPU convolution will depend very much on the hardware used, as seen for the host.
The timing using a NVIDIA Tesla T4 on Google Colab was:
The execution time of the GPU convolution will depend very much on the hardware used, just as seen for the host.
The timing using a NVIDIA Tesla T4 at Google Colab was:

~~~output
98.2 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each)
Expand All @@ -178,20 +179,20 @@ Impressive, but is that true?

## Measuring performance

So far we used `timeit` to measure the performance of our Python code, no matter whether it was running on the CPU or was GPU-accelerated.
However, the execution on the GPU is *asynchronous*: the Python interpreter takes back control of the program execution immediately, while the GPU is still executing the task.
So far we used `timeit` to measure the performance of our Python code, regardless of whether it was running on the CPU or was GPU-accelerated.
However, the execution on the GPU is *asynchronous*: the Python interpreter immediately takes back control of the program execution, while the GPU is still executing the task.
Therefore, the timing of `timeit` is not reliable.

Conveniently, `cupyx` provides the function `benchmark()` that measures the actual execution time in the GPU.
The following code executes `convolve2d_gpu()` with the appropriate arguments ten times, and stores inside the `.gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds.
The following code executes `convolve2d_gpu()` with the appropriate arguments ten times, and stores inside the `.gpu_times` attribute of the variable `benchmark_gpu` the execution time of each run in seconds.

~~~python
from cupyx.profiler import benchmark

benchmark_gpu = benchmark(convolve2d_gpu, (diracs_gpu, gauss_gpu), n_repeat=10)
~~~

These measurements are also more stable and representative, because `benchmark()` disregards the compile time and the repetitions warm up the GPU.
These measurements are also more stable and representative, because `benchmark()` disregards the compile time and because the repetitions warm up the GPU.
We can then average the execution times, as follows:

~~~python
Expand All @@ -215,7 +216,7 @@ If this works, it will save us the time and effort of transferring the arrays `d

::::::::::::::::::::::::::::::::::::: solution

We can call the GPU convolution function `convolve2d_gpu()` directly with `deltas` and `gauss` as argument:
We can call the GPU convolution function `convolve2d_gpu()` directly with `diracs` and `gauss` as argument:

~~~python
convolve2d_gpu(diracs, gauss)
Expand Down Expand Up @@ -254,7 +255,7 @@ Hint: use the `cp.asnumpy()` method to copy a CuPy array back to the host.

:::::::::::::::::::::::::::::::::::::: solution

A convenient strategy is to time the execution of a single Python function that groups the transfers to and from the GPU and the convolution, as follows:
An effective strategy is to time the execution of a single Python function that groups the transfers to and from the GPU and the convolution, as follows:

~~~python
def push_compute_pull():
Expand Down Expand Up @@ -331,7 +332,7 @@ print(f"{gpu_execution_avg:.6f} s")

You may be surprised that these commands do not throw any error.
Contrary to SciPy, NumPy routines accept CuPy arrays as input, even though the latter exist only in GPU memory.
Indeed, can you recall when we validated our codes using a NumPy and a CuPy array as input of `np.allclose()`?
Indeed, can you recall that we validated our codes using a NumPy and a CuPy array as input of `np.allclose()`?
That worked for the same reason.
[The CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) explains why NumPy routines can handle CuPy arrays.

Expand Down Expand Up @@ -568,7 +569,7 @@ The number of sources in the image at the 5σ level is 185.
Fastest CPU CCL time = 2.546e+01 ms.
~~~

Let's not just accept the answer, but also do a sanity check.
Let's not just accept the answer, and let's do a sanity check.
What are the values in the labeled image?

~~~python
Expand Down Expand Up @@ -633,7 +634,7 @@ all_integrated_fluxes = sl_cpu(data, labeled_image,
range(1, number_of_sources_in_image+1))
~~~

Which yields, on my machine:
which yields, on my machine:

~~~output
797 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Expand All @@ -649,7 +650,7 @@ fastest_source_measurements_CPU = timing_source_measurements_CPU.best
print(f"Fastest CPU set of source measurements = {1000 * fastest_source_measurements_CPU:.3e} ms.")
~~~

Which yields
which yields

~~~output
Fastest CPU set of source measurements = 7.838e+02 ms.
Expand Down

0 comments on commit b0a7d3b

Please sign in to comment.