Pure Python sucks in the scene of parallel computing, due to the existence of the Global Interpreter Lock (aka GIL). GIL prevents codes from different threads to access or manipulate the interpreter. The mechanism alleviates the risk of race condition, but makes multi-threading program “single-threaded” as well. Sadly, there’s no way to release the lock from pure Python.
OK. So what about not using pure Python? Shall we make an extension to bypass the mechanism? The answer is yes, and that’s what most of scientific libaries do.
As of writing extensions, Cython is a good choice, less verbose, and more similar to Python syntactically. In Cython, one can release GIL temporarily for a block using the
with nogil: syntax. Will it better utilize multi-core CPU? Let’s have a try.
We will use a toy example, say, a naive matrix multiplication, for benchmarking. Start with a C-only version:
The function above is straight-forward. We then create a wrapper for it, so that it can be called by Python code:
Now the Cython part is ready. We then create a script for benchmarking:
Two matrices are as input, each with a rather large size
1200 x 1200, and
matmul will be tested with four settings. The result is listed as below:
The first two rows show that, with single thread,
matmul has comparable performance whether or not releasing GIL. This is desired, since GIL should not lead to performance degrade in single-threaded scene. But things change when it comes to multi-threading. If not releasing GIL, the time doubles when two computing threads run in parallel, whilst in another setting (GIL released), the performance remains unchanged.
We may step further to investigate the behavior of
prange is a facility provided by Cython for more convenient parallel computing, using the famous OpenMP as backend. We can write a
_matmul with only minor modification:
cdef void _matmul_p(
and the wrapper
and also, the benchmark script:
for kw in make_grid(
OpenMP requires extra compilation flags, so a
.pyxbld file is needed:
|nthreads||GIL||time w/o par. (s)||time w/ par. (s)|
prange comes with an amazing boost in performance!
_matmul_p is 3~4x faster in single-threaded settings. The number might vary on your own computer, depending on the CPU cores you have. In settings with two threads, running time doubles, which means
prange does efficiently eat up all CPU resources.
We can also notice that, the GIL switch seemingly does not affect
prange. The reason is
prange requires GIL to be released, and it actually does by default.
Cython supports native parallelism through the cython.parallel module. To use this kind of parallelism, the GIL must be released (see Releasing the GIL). It currently supports OpenMP, but later on more backends might be supported. – Using Parallelism
- If there’s no need to acquire GIL, just release it. This happens when you are manipulating some C data structures, and not attempting to interfere with the interpreter.
- If there’s massive looping in your Cython code, feel free to accelerate with
prangewill effeciently schedule the computation on all CPU resources.
- If there’s some macro tasks which are not easy to parallelize in Cython, schedule them via
threadingsucks for most of the time, but if the tasks not always acquiring GIL, it should work well just like threads in other languages.