Python GIL: a case study.

Saswata Chakravarty
4 min readMay 16, 2021

Python is infamous for its GIL — the global interpreter lock. The GIL restricts the python interpreter to execute only one thread at a time. On modern multi-core CPUs, this is a problem as the program will not be able to utilize more than one core. However despite this limitation, python has emerged as a top language from doing backend of web apps to AI/ML and scientific computing.

The GIL limitation is not a constraint for most backend web apps which are generally I/O bound. Much of the time in these apps is just waiting for inputs to arrive from users / databases or downstream services. It is enough for the system to be concurrent , but not necessarily parallel. The python interpreter releases the GIL when performing an I/O operation, so that when the thread waits for I/O to complete, it gives another thread a chance to acquire the GIL and execute.

The GIL limitation does not affect most compute bound AI/ML and scientific computing workloads because the core of the popular frameworks like numpy, tensorflow and pytorch are actually implemented in C++ and have just a python API interface. The bulk of the computation can happen without acquiring the GIL. The underlying C/C++ kernel libraries which these framework use like OpenBLAS or Intel MKL can utilize multiple cores without being subject to the GIL.

What happens when we have a mix of I/O and compute tasks?

Compute task using pure python

Specifically, let us consider these simple tasks two tasks —

Here we simulate an io bound task by sleeping for a second, then waking up and printing for how long it slept and then sleep again. The count_py is a compute bound task that does a simple count to number n. What happens if we run these two tasks together ?

We get the following output -

woke after: 1.0063529014587402
woke after: 1.009704828262329
woke after: 1.0069530010223389
woke after: 1.0066332817077637
compute time: 4.311860084533691

It takes about 4.3s for the count_py to count down to 1 million. But the io_task runs at the same time unaffected, waking up after 1 second approximately as expected. Although the compute task takes 4.3s seconds, the python interpreter can pre-emptively releases the GIL from the main thread running the compute task and give a chance to the io_thread to acquire the GIL and run.

Compute Task using numpy

Lets now implement the count function in numpy and run the same experiment as before, but this time counting down to 10 million as numpy implementation is more efficient.

We see the following output —

woke after: 1.0001161098480225
woke after: 1.0008511543273926
woke after: 1.0004539489746094
woke after: 1.1320469379425049
compute time: 4.1334803104400635

This shows similar results as the last experiment. In this case though, it is not the python interpreter pre-emptively releasing the GIL, but numpy itself voluntarily releasing the GIL.

Does this mean that it is always safe to run a I/O task along with a compute task in separate threads?

Compute Task using custom c++ extension

Let us now implement the count function as a python c++ extension.

The extension can be build by running python setup.py build using the following setup.py

Then we run our experiment using the count function implemented as custom extension

We get the following -

woke after: 4.414866924285889
compute time: 4.414893865585327

In this case, the compute task is holding the GIL, and preventing the i/o thread from running.

The python interpreter can pre-emptively release the GIL only in between two python bytecode instructions, in case of extensions, it is up to the extension implementation to voluntarily release the GIL.

In this case, we are doing a trivial computation without affecting any python objects, so we can release the GIL in the c++ count function by using the macros Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS -

With this implementation, when we re-run the experiment we get

woke after: 1.0026037693023682
woke after: 1.003467082977295
woke after: 1.0028629302978516
woke after: 1.1772480010986328
compute time: 4.186192035675049

Conclusion

When working with python, it is good to be aware of the GIL. In most common scenarios, you will likely not encounter its limitations. But if you are using third party python packages that wrap c/c++ libraries (besides the standard ones numpy, scipy tensorflow or pytorch), you may run into issues, specially when any heavy computation is involved. While developing custom extensions, it is always a good idea to release the GIL before doing a heavy computation, so that other python threads have a chance to run.

To know more about the behavior of the python GIL refer to the excellent slides by David Beazley — https://www.dabeaz.com/python/NewGIL.pdf

--

--