Cython and Threads

2021-05-07 | Tech

Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents accessing or manipulating interpreter from different threads concurrently. The mechanism alleviates the risk of race condition, but sequentializes multi-threading program ¹ as well. Sadly, there’s no way to release the lock from pure Python.

Alright. So what about beyond pure Python? Shall we bypass the mechanism within an extension? The answer is yes, and that’s what most of scientific computing libaries do.

Cython is a good choice for writing extensions, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a code block using the with nogil: syntax. Will it release the true power of multi-core CPU? We should have a try.

We adopt a toy example, say, a naive matrix multiplication, for benchmarking. Start with a C-only version:

#cython: boundscheck=False
#cython: wraparound=False
#cython: nonecheck=False
#cython: cdivision=True
#cython: languagelevel=3

import numpy as np
cimport numpy as np

cdef void _matmul(
    np.float_t[:, :] Av,
    np.float_t[:, :] Bv,
    np.float_t[:, :] Cv,
    int M, int N, int P,
) nogil:
    cdef int i, j, k

    for i in range(M):
        for j in range(P):
            for k in range(N):
                Cv[i, j] += Av[i, k] * Bv[k, j]

The function above is straight-forward. We then create a wrapper for it, so that it can be called by Python code:

cpdef matmul(
    np.ndarray[dtype=np.float_t, ndim=2] A,
    np.ndarray[dtype=np.float_t, ndim=2] B,
    object use_gil,
):
    cdef int M = A.shape[0]
    cdef int N = A.shape[1]
    cdef int P = B.shape[1]

    C = np.zeros((M, P))
    cdef np.float_t[:, :] Av = A, Bv = B, Cv = C

    if use_gil:
        _matmul(Av, Bv, Cv, M, N, P)
    else:
        with nogil:
            _matmul(Av, Bv, Cv, M, N, P)
    return C

Now the Cython part is ready. Below a script for benchmarking:

import timeit
import threading
import itertools

import pyximport
import numpy as np

pyximport.install(setup_args={"include_dirs": np.get_include()}, inplace=True, language_level=3)
import matmul

N = 1200
A = np.random.rand(N, N)
B = np.random.rand(N, N)

def runner(nthreads: int, use_gil: bool) -> None:
    args = (A, B, use_gil)

    threads = []
    for _ in range(nthreads):
        threads.append(threading.Thread(target=matmul.matmul, args=args))
        threads[-1].start()

    for thread in threads:
        thread.join()

def make_grid(**kwargs):
    space = [[(name, v) for v in lst] for name, lst in kwargs.items()]
    return map(dict, itertools.product(*space))

for kw in make_grid(
        nthreads=[1, 2],
        use_gil=[True, False],
):
    print(kw)
    print(timeit.timeit("runner(**kw)", globals=dict(runner=runner, kw=kw), number=1))

Two matrices with a rather large size 1200 x 1200 are supplied as input, and we test matmul against four settings. The result is listed as below:

nthreads	GIL	time (s)
1	N	3.47
1	Y	3.51
2	N	3.78
2	Y	6.96

The first two rows show that, with single thread, matmul has comparable performance no matter releasing GIL or not. This is desired behavior, since GIL should not lead to performance degradation in single-threading scene. But things change when it comes to multi-threading. With two computing threads running in parallel, the time doubles if holding GIL, whilst in another setting (GIL released), the performance remains unchanged.

We may step further to investigate the behavior of prange. prange is provided by Cython for more convenient parallel computing, adopting the famous OpenMP as backend. Writing a prange version _matmul should take minor modification:

cdef void _matmul_p(
    np.float_t[:, :] Av,
    np.float_t[:, :] Bv,
    np.float_t[:, :] Cv,
    int M, int N, int P,
) nogil:
    cdef int i, j, k, ij

    cdef int MP = M * P
    for ij in prange(MP, schedule='guided'):
        i = ij // P
        j = ij % P
        for k in range(N):
            Cv[i, j] += Av[i, k] * Bv[k, j]

plus the wrapper matmul:

cpdef matmul(
    np.ndarray[dtype=np.float_t, ndim=2] A,
    np.ndarray[dtype=np.float_t, ndim=2] B,
    object use_gil,
    # hl: begin
    object parallel,
    # hl: end
):
    cdef int M = A.shape[0]
    cdef int N = A.shape[1]
    cdef int P = B.shape[1]

    C = np.zeros((M, P))
    cdef np.float_t[:, :] Av = A, Bv = B, Cv = C

    if use_gil:
        # hl: begin
        if parallel:
            _matmul_p(Av, Bv, Cv, M, N, P)
        else:
            _matmul(Av, Bv, Cv, M, N, P)
        # hl: end
    else:
        # hl: begin
        if parallel:
            _matmul_p(Av, Bv, Cv, M, N, P)
        else:
            with nogil:
                _matmul(Av, Bv, Cv, M, N, P)
        #hl: end
    return C

and also, the benchmark script:

# hl: begin
def runner(nthreads: int, use_gil: bool, parallel: bool) -> None:
    args = (A, B, use_gil, parallel)
    # hl: end
    if nthreads == 0:
        matmul.matmul(*args)
        return

    threads = []
    for _ in range(nthreads):
        threads.append(threading.Thread(target=matmul.matmul, args=args))
        threads[-1].start()

    for thread in threads:
        thread.join()

for kw in make_grid(
        nthreads=[0, 1, 2],
        use_gil=[True, False],
        # hl: begin
        parallel=[True, False],
        # hl: end
):
    print(kw)
    print(timeit.timeit("runner(**kw)", globals=dict(runner=runner, kw=kw), number=1))

OpenMP requires extra compilation flags, so a .pyxbld file is needed:

# matmul.pyxbld
from setuptools import Extension
from Cython.Build import cythonize


def make_ext(modname, pyxfilename):
    ext = Extension(
        modname,
        [pyxfilename],
        extra_compile_args=['-fopenmp'],
        extra_link_args=['-fopenmp'],
    )
    return cythonize(
        [ext],
        language_level=3,
        annotate=True,
    )[0]

nthreads	GIL	time w/o par. (s)	time w/ par. (s)
1	N	3.47	0.82
1	Y	3.51	0.89
2	N	3.78	1.79
2	Y	6.96	1.81

We can see that prange brings an amazing boost in performance! _matmul_p is 3~4x faster in single-threading setting. The number might vary across different hardwares, depending on the number of CPU cores. In the setting of two threads, the running time doubles, which indicates that prange does efficiently use up all available CPU resources.

We can also notice that, whether to release GIL or not seemingly does not affect prange ². The reason is prange requires GIL to be released, which is automatically done by default.

Cython supports native parallelism through the cython.parallel module. To use this kind of parallelism, the GIL must be released (see Releasing the GIL). It currently supports OpenMP, but later on more backends might be supported. – Using Parallelism

Conclusion

If there’s no need to hold GIL, just release it. This happens when you are manipulating some C data structures, and not attempting to disturb the interpreter.

If there’s massive looping in your Cython code, feel free to accelerate it with prange. prange will effeciently schedule the computation onto all CPU resources.

If there’s some macro tasks ³ which could not be easily parallelized in Cython, schedule them via threading module. threading sucks for most of the time, but if the tasks not always acquiring GIL, it should be fine just like threads in other languages.

so that it behaves just like a single-threading version
0.82s vs 0.89s
routines consisting of large pieces of logic

Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

Cython

OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.