Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents codes from different threads to access or manipulate the interpreter. The mechanism alleviates the risk of race condition, but makes multi-threading program “single-threaded” as well. Sadly, there’s no way to release the lock from pure Python.

OK. So what about not using pure Python? Shall we make an extension to bypass the mechanism? The answer is yes, and that’s what most of scientific libaries do.

As of writing extensions, Cython is a good choice, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a block using the with nogil: syntax. Will it better utilize multi-core CPU? Let’s have a try.

We will use a toy example, say, a naive matrix multiplication, for benchmarking. Start with a C-only version:

#cython: boundscheck=False
#cython: wraparound=False
#cython: nonecheck=False
#cython: cdivision=True
#cython: languagelevel=3

import numpy as np
cimport numpy as np

cdef void _matmul(
np.float_t[:, :] Av,
np.float_t[:, :] Bv,
np.float_t[:, :] Cv,
int M, int N, int P,
) nogil:
cdef int i, j, k

for i in range(M):
for j in range(P):
for k in range(N):
Cv[i, j] += Av[i, k] * Bv[k, j]

The function above is straight-forward. We then create a wrapper for it, so that it can be called by Python code:

cpdef matmul(
np.ndarray[dtype=np.float_t, ndim=2] A,
np.ndarray[dtype=np.float_t, ndim=2] B,
object use_gil,
):
cdef int M = A.shape[0]
cdef int N = A.shape[1]
cdef int P = B.shape[1]

C = np.zeros((M, P))
cdef np.float_t[:, :] Av = A, Bv = B, Cv = C

if use_gil:
_matmul(Av, Bv, Cv, M, N, P)
else:
with nogil:
_matmul(Av, Bv, Cv, M, N, P)
return C

Now the Cython part is ready. We then create a script for benchmarking:

import timeit
import threading
import itertools

import pyximport
import numpy as np

pyximport.install(setup_args={"include_dirs": np.get_include()}, inplace=True, language_level=3)
import matmul

N = 1200
A = np.random.rand(N, N)
B = np.random.rand(N, N)

def runner(nthreads: int, use_gil: bool) -> None:
args = (A, B, use_gil)

threads = []
for _ in range(nthreads):
threads.append(threading.Thread(target=matmul.matmul, args=args))
threads[-1].start()

for thread in threads:
thread.join()

def make_grid(**kwargs):
space = [[(name, v) for v in lst] for name, lst in kwargs.items()]
return map(dict, itertools.product(*space))

for kw in make_grid(
nthreads=[1, 2],
use_gil=[True, False],
):
print(kw)
print(timeit.timeit("runner(**kw)", globals=dict(runner=runner, kw=kw), number=1))

Two matrices are as input, each with a rather large size 1200 x 1200, and matmul will be tested with four settings. The result is listed as below:

nthreads GIL time (s)
1 N 3.47
1 Y 3.51
2 N 3.78
2 Y 6.96

The first two rows show that, with single thread, matmul has comparable performance whether or not releasing GIL. This is desired, since GIL should not lead to performance degrade in single-threaded scene. But things change when it comes to multi-threading. If not releasing GIL, the time doubles when two computing threads run in parallel, whilst in another setting (GIL released), the performance remains unchanged.

We may step further to investigate the behavior of prange. prange is a facility provided by Cython for more convenient parallel computing, using the famous OpenMP as backend. We can write a prange version _matmul with only minor modification:

cdef void _matmul_p(
np.float_t[:, :] Av,
np.float_t[:, :] Bv,
np.float_t[:, :] Cv,
int M, int N, int P,
) nogil:
cdef int i, j, k, ij

cdef int MP = M * P
for ij in prange(MP, schedule='guided'):
i = ij // P
j = ij % P
for k in range(N):
Cv[i, j] += Av[i, k] * Bv[k, j]

and the wrapper matmul:

cpdef matmul(
np.ndarray[dtype=np.float_t, ndim=2] A,
np.ndarray[dtype=np.float_t, ndim=2] B,
object use_gil,
object parallel,
):
cdef int M = A.shape[0]
cdef int N = A.shape[1]
cdef int P = B.shape[1]

C = np.zeros((M, P))
cdef np.float_t[:, :] Av = A, Bv = B, Cv = C

if use_gil:
if parallel:
_matmul_p(Av, Bv, Cv, M, N, P)
else:
_matmul(Av, Bv, Cv, M, N, P)
else:
if parallel:
_matmul_p(Av, Bv, Cv, M, N, P)
else:
with nogil:
_matmul(Av, Bv, Cv, M, N, P)
return C

and also, the benchmark script:

def runner(nthreads: int, use_gil: bool, parallel: bool) -> None:
args = (A, B, use_gil, parallel)
if nthreads == 0:
matmul.matmul(*args)
return

threads = []
for _ in range(nthreads):
threads.append(threading.Thread(target=matmul.matmul, args=args))
threads[-1].start()

for thread in threads:
thread.join()
for kw in make_grid(
nthreads=[0, 1, 2],
use_gil=[True, False],
parallel=[True, False],
):
print(kw)
print(timeit.timeit("runner(**kw)", globals=dict(runner=runner, kw=kw), number=1))

OpenMP requires extra compilation flags, so a .pyxbld file is needed:

# matmul.pyxbld
from setuptools import Extension
from Cython.Build import cythonize


def make_ext(modname, pyxfilename):
ext = Extension(
modname,
[pyxfilename],
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp'],
)
return cythonize(
[ext],
language_level=3,
annotate=True,
)[0]

nthreads GIL time w/o par. (s) time w/ par. (s)
1 N 3.47 0.82
1 Y 3.51 0.89
2 N 3.78 1.79
2 Y 6.96 1.81

prange comes with an amazing boost in performance! _matmul_p is 3~4x faster in single-threaded settings. The number might vary on your own computer, depending on the CPU cores you have. In settings with two threads, running time doubles, which means prange does efficiently eat up all CPU resources.

We can also notice that, the GIL switch seemingly does not affect prange. The reason is prange requires GIL to be released, and it actually does by default.

Cython supports native parallelism through the cython.parallel module. To use this kind of parallelism, the GIL must be released (see Releasing the GIL). It currently supports OpenMP, but later on more backends might be supported. – Using Parallelism

Conclusion

  • If there’s no need to acquire GIL, just release it. This happens when you are manipulating some C data structures, and not attempting to interfere with the interpreter.
  • If there’s massive looping in your Cython code, feel free to accelerate with prange. prange will effeciently schedule the computation on all CPU resources.
  • If there’s some macro tasks which are not easy to parallelize in Cython, schedule them via threading module. threading sucks for most of the time, but if the tasks not always acquiring GIL, it should work well just like threads in other languages.