Skip to content

Commit

Permalink
Add numba episode.
Browse files Browse the repository at this point in the history
  • Loading branch information
isazi committed May 26, 2023
1 parent 0e7e39f commit 6eac9ce
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 69 deletions.
2 changes: 1 addition & 1 deletion config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ contact: '[email protected]'
episodes:
- introduction.Rmd
- cupy.Rmd
# - numba.Rmd
- numba.Rmd
- gpu_introduction.Rmd
- first_program.Rmd
- global_local_memory.Rmd
Expand Down
121 changes: 53 additions & 68 deletions episodes/numba.Rmd
Original file line number Diff line number Diff line change
@@ -1,21 +1,17 @@
---
title: "Accelerate your Python code with Numba"

teaching: 45

exercises: 15
---

questions:
:::::::::::::::::::::::::::::::::::::: questions
- "How can I run my own Python functions on the GPU?"
::::::::::::::::::::::::::::::::::::::

objectives:
:::::::::::::::::::::::::::::::::::::: objectives
- "Learn how to use Numba decorators to improve the performance of your Python code."
- "Run your first application on the GPU."

keypoints:
- "Numba can be used to run your own Python functions on the GPU."
- "Functions may need to be changed to run correctly on a GPU."
---
::::::::::::::::::::::::::::::::::::::

# Using Numba to execute Python code on the GPU

Expand All @@ -24,7 +20,7 @@ keypoints:
We want to compute all [prime numbers](https://en.wikipedia.org/wiki/Prime_number) - i.e. numbers that have only 1 or themselves as exact divisors - between 1 and 10000 on the CPU and see if we can speed it up, by deploying a similar algorithm on a GPU.
This is code that you can find on many websites. Small variations are possible, but it will look something like this:

~~~
~~~python
def find_all_primes_cpu(upper):
all_prime_numbers = []
for num in range(0, upper):
Expand All @@ -37,25 +33,22 @@ def find_all_primes_cpu(upper):
all_prime_numbers.append(num)
return all_prime_numbers
~~~
{: .language-python}

Calling `find_all_primes_cpu(10_000)` will return all prime numbers between 1 and 10000 as a list. Let us time it:

~~~
~~~python
%timeit -n 10 -r 1 find_all_primes_cpu(10_000)
~~~
{: .language-python}

You will probably find that `find_all_primes_cpu` takes several hundreds of milliseconds to complete:

~~~
~~~output
176 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
~~~
{: .output}

As a quick sidestep, add Numba's JIT (Just in Time compilation) decorator to the `find_all_primes_cpu` function. You can either add it to the function definition or to the call, so either in this way:

~~~
~~~python
from numba import jit

@jit(nopython=True)
Expand All @@ -73,24 +66,21 @@ def find_all_primes_cpu(upper):

%timeit -n 10 -r 1 find_all_primes_cpu(10_000)
~~~
{: .language-python}

or in this way:

~~~
~~~python
from numba import jit

upper = 10_000
%timeit -n 10 -r 1 jit(nopython=True)(find_all_primes_cpu)(upper)
~~~
{: .language-python}

which can give you a timing result similar to this:

~~~
~~~output
69.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
~~~
{: .output}

So twice as fast, by using a simple decorator. The speedup is much larger for `upper = 100_000`, but that takes a little too much waiting time for this course.
Despite the `jit(nopython=True)` decorator the computation is still performed on the CPU.
Expand All @@ -99,7 +89,7 @@ There are a number of ways to achieve this, one of them is the usage of the `jit
Let us write our first GPU kernel which checks if a number is a prime, using the `cuda.jit` decorator, so different from the `jit` decorator for CPU computations.
It is essentially the inner loop of `find_all_primes_cpu`:

~~~
~~~python
from numba import cuda

@cuda.jit
Expand All @@ -110,11 +100,10 @@ def check_prime_gpu_kernel(num, result):
result[0] = 0
break
~~~
{: .language-python}

A number of things are worth noting. CUDA kernels do not return anything, so you have to supply for an array to be modified. All arguments have to be arrays, if you work with scalars, make them arrays of length one. This is the case here, because we check if a single number is a prime or not. Let us see if this works:

~~~
~~~python
import numpy as np

result = np.zeros((1), np.int32)
Expand All @@ -123,63 +112,59 @@ print(result[0])
check_prime_gpu_kernel[1, 1](12, result)
print(result[0])
~~~
{: .language-python}

If we have not made any mistake, the first call should return "11", because 11 is a prime number, while the second call should return "0" because 12 is not a prime:

~~~
~~~output
11
0
~~~
{: .output}

Note the extra arguments in square brackets - `[1, 1]` - that are added to the call of `check_prime_gpu_kernel`: these indicate the number of threads we want to run on the GPU.
While this is an important argument, we will explain it later and for now we can keep using `1`.

> ## Challenge: compute prime numbers
>
> Write a new function `find_all_primes_cpu_and_gpu` that uses `check_prime_gpu_kernel` instead of the inner loop of `find_all_primes_cpu`.
> How long does this new function take to find all primes up to 10000?
>
> > ## Solution
> >
> > One possible implementation of this function is the following one.
> >
> > ~~~
> > def find_all_primes_cpu_and_gpu(upper):
> > all_prime_numbers = []
> > for num in range(0, upper):
> > result = np.zeros((1), np.int32)
> > check_prime_gpu_kernel[1,1](num, result)
> > if result[0] > 0:
> > all_prime_numbers.append(num)
> > return all_prime_numbers
> >
> > %timeit -n 10 -r 1 find_all_primes_cpu_and_gpu(10_000)
> > ~~~
> > {: .language-python}
> > ~~~
> > 6.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
> > ~~~
> > {: .output}
> >
> > As you may have noticed, `find_all_primes_cpu_and_gpu` is much slower than the original `find_all_primes_cpu`.
> > The reason is that the overhead of calling the GPU, and transferring data to and from it, for each number of the sequence is too large.
> > To be efficient the GPU needs enough work to keep all of its cores busy.
> {: .solution}
{: .challenge}
:::::::::::::::::::::::::::::::::::::: challenge
## Challenge: compute prime numbers

Write a new function `find_all_primes_cpu_and_gpu` that uses `check_prime_gpu_kernel` instead of the inner loop of `find_all_primes_cpu`.
How long does this new function take to find all primes up to 10000?

Let us give the GPU a work load large enough to compensate for the overhead of data transfers to and from the GPU. For this example of computing primes we can best use the `vectorize` decorator for a new `check_prime_gpu` function that takes an array as input instead of `upper` in order to increase the work load. This is the array we have to use as input for our new `check_prime_gpu` function, instead of upper, a single integer:
::::::::::::::::::::::::::::::::::::: solution

One possible implementation of this function is the following one.

~~~python
def find_all_primes_cpu_and_gpu(upper):
all_prime_numbers = []
for num in range(0, upper):
result = np.zeros((1), np.int32)
check_prime_gpu_kernel[1,1](num, result)
if result[0] 0:
all_prime_numbers.append(num)
return all_prime_numbers

%timeit -n 10 -r 1 find_all_primes_cpu_and_gpu(10_000)
~~~

~~~output
6.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
~~~

As you may have noticed, `find_all_primes_cpu_and_gpu` is much slower than the original `find_all_primes_cpu`.
The reason is that the overhead of calling the GPU, and transferring data to and from it, for each number of the sequence is too large.
To be efficient the GPU needs enough work to keep all of its cores busy.
:::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::

Let us give the GPU a work load large enough to compensate for the overhead of data transfers to and from the GPU. For this example of computing primes we can best use the `vectorize` decorator for a new `check_prime_gpu` function that takes an array as input instead of `upper` in order to increase the work load. This is the array we have to use as input for our new `check_prime_gpu` function, instead of upper, a single integer:

~~~python
numbers = np.arange(0, 10_000, dtype=np.int32)
~~~
{: .language-python}

So that input to the new `check_prime_gpu` function is simply the array of numbers we need to check for primes. `check_prime_gpu` looks similar to `check_prime_gpu_kernel`, but it is not a kernel, so it can return values:

~~~
~~~python
import numba as nb

@nb.vectorize(['int32(int32)'], target='cuda')
Expand All @@ -189,24 +174,24 @@ def check_prime_gpu(num):
return 0
return num
~~~
{: .language-python}

where we have added the `vectorize` decorator from Numba. The argument of `check_prime_gpu` seems to be defined as a scalar (single integer in this case), but the `vectorize` decorator will allow us to use an array as input. That array should consist of 4B (bytes) or 32b (bit) integers, indicated by `(int32)`. The return array will also consist of 32b integers, with zeros for the non-primes. The nonzero values are the primes.

Let us run it and record the elapsed time:

~~~
~~~python
%timeit -n 10 -r 1 check_prime_gpu(numbers)
~~~
{: .language-python}

which should show you a significant speedup:

~~~
~~~output
5.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
~~~
{: .output}

This amounts to a speedup of our code of a factor 11 compared to the `jit(nopython=True)` decorated code on the CPU.

{% include links.md %}
:::::::::::::::::::::::::::::::::::::: keypoints
- "Numba can be used to run your own Python functions on the GPU."
- "Functions may need to be changed to run correctly on a GPU."
::::::::::::::::::::::::::::::::::::::

0 comments on commit 6eac9ce

Please sign in to comment.