Threads and processes
Last updated on 2023-01-06 | Edit this page
Estimated time: 90 minutes
Overview
Questions
- What is the Global Interpreter Lock (GIL)?
- How do I use multiple threads in Python?
Objectives
- Understand the GIL.
- Understand the difference between the python
threadingandmultiprocessinglibrary
Threading
Another possibility for parallelization is to use the
threading module. This module is built into Python. In this
section, we’ll use it to estimate pi once again.
Using threading to speed up your code:
PYTHON
%%time
n = 10**7
t1 = Thread(target=calc_pi, args=(n,))
t2 = Thread(target=calc_pi, args=(n,))
t1.start()
t2.start()
t1.join()
t2.join()
Discussion: where’s the speed-up?
While mileage may vary, parallelizing calc_pi,
calc_pi_numpy and calc_pi_numba this way will
not give the expected speed-up. calc_pi_numba should give
some speed-up, but nowhere near the ideal scaling over the
number of cores. This is because Python only allows one thread to access
the interperter at any given time, a feature also known as the Global
Interpreter Lock.
A few words about the Global Interpreter Lock
The Global Interpreter Lock (GIL) is an infamous feature of the Python interpreter. It both guarantees inner thread sanity, making programming in Python safer, and prevents us from using multiple cores from a single Python instance. When we want to perform parallel computations, this becomes an obvious problem. There are roughly two classes of solutions to circumvent/lift the GIL:
- Run multiple Python instances:
multiprocessing - Have important code outside Python: OS operations, C++ extensions, cython, numba
The downside of running multiple Python instances is that we need to
share program state between different processes. To this end, you need
to serialize objects. Serialization entails converting a Python object
into a stream of bytes, that can then be sent to the other process, or
e.g. stored to disk. This is typically done using pickle,
json, or similar, and creates a large overhead. The
alternative is to bring parts of our code outside Python. Numpy has many
routines that are largely situated outside of the GIL. The only way to
know for sure is trying out and profiling your application.
To write your own routines that do not live under the GIL there are
several options: fortunately numba makes this very
easy.
We can force the GIL off in Numba code by setting
nogil=True in the numba.jit decorator.
PYTHON
@numba.jit(nopython=True, nogil=True)
def calc_pi_nogil(N):
M = 0
for i in range(N):
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
if x**2 + y**2 < 1:
M += 1
return 4 * M / N
The nopython argument forces Numba to compile the code
without referencing any Python objects, while the nogil
argument enables lifting the GIL during the execution of the
function.
Use nopython=True or
@numba.njit
It’s generally a good idea to use nopython=True with
@numba.jit to make sure the entire function is running
without referencing Python objects, because that will dramatically slow
down most Numba code. There’s even a decorator that has
nopython=True by default: @numba.njit
Now we can run the benchmark again, using calc_pi_nogil
instead of calc_pi.
Exercise: try threading on a Numpy function
Many Numpy functions unlock the GIL. Try to sort two randomly
generated arrays using numpy.sort in parallel.
Multiprocessing
Python also allows for using multiple processes for parallelisation
via the multiprocessing module. It implements an API that
is superficially similar to threading:
PYTHON
from multiprocessing import Process
def calc_pi(N):
...
if __name__ == '__main__':
n = 10**7
p1 = Process(target=calc_pi, args=(n,))
p2 = Process(target=calc_pi, args=(n,))
p1.start()
p2.start()
p1.join()
p2.join()
However under the hood processes are very different from threads. A new process is created by creating a fresh “copy” of the python interpreter, that includes all the resources associated to the parent. There are three different ways of doing this (spawn, fork, and forkserver), which depends on the platform. We will use spawn as it is available on all platforms, you can read more about the others in the Python documentation. As creating a process is resource intensive, multiprocessing is beneficial under limited circumstances - namely, when the resource utilisation (or runtime) of a function is measureably larger than the overhead of creating a new process.
The non-intrusive and safe way of starting a new process is acquire a
context, and working within the context. This ensures your
application does not interfere with any other processes that might be in
use.
PYTHON
import multiprocessing as mp
def calc_pi(N):
...
if __name__ == '__main__':
# mp.set_start_method("spawn") # if not using a context
ctx = mp.get_context("spawn")
...
Passing objects and sharing state
We can pass objects between processes by using Queues
and Pipes. Multiprocessing queues behave similarly to
regular queues: - FIFO: first in, first out -
queue_instance.put(<obj>) to add -
queue_instance.get() to retrieve
Exercise: reimplement calc_pi to
use a queue to return the result
PYTHON
import multiprocessing as mp
import random
def calc_pi(N, que):
M = 0
for i in range(N):
# Simulate impact coordinates
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
M += 1
que.put((4 * M / N, N)) # result, iterations
if __name__ == "__main__":
ctx = mp.get_context("spawn")
que = ctx.Queue()
n = 10**7
p1 = ctx.Process(target=calc_pi, args=(n, que))
p2 = ctx.Process(target=calc_pi, args=(n, que))
p1.start()
p2.start()
for i in range(2):
print(que.get())
p1.join()
p2.join()
Process pool
The Pool API provides a pool of worker processes that
can execute tasks. Methods of the pool object offer various convenient
ways to implement data parallelism in your program. The most convenient
way to create a pool object is with a context manager, either using the
toplevel function multiprocessing.Pool, or by calling the
.Pool() method on the context. With the pool object, tasks
can be submitted by calling methods like .apply(),
.map(), .starmap(), or their
.*_async() versions.
Exercise: adapt the original exercise to submit tasks to a pool
- Use the original
calc_pifunction (without the queue) - Submit batches of different sample size (different values of
N). - As mentioned earlier, creating a new process has overhead. Try a wide range of sample sizes and check if runtime scaling supports that claim.
PYTHON
from itertools import repeat
import multiprocessing as mp
import random
from timeit import timeit
def calc_pi(N):
M = 0
for i in range(N):
# Simulate impact coordinates
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
M += 1
return (4 * M / N, N) # result, iterations
def submit(ctx, N):
with ctx.Pool() as pool:
pool.starmap(calc_pi, repeat((N,), 4))
if __name__ == "__main__":
ctx = mp.get_context("spawn")
for i in (1_000, 100_000, 10_000_000):
res = timeit(lambda: submit(ctx, i), number=5)
print(i, res)
Key Points
- If we want the most efficient parallelism on a single machine, we need to circumvent the GIL.
- If your code releases the GIL, threading will be more efficient than multiprocessing.
- If your code does not release the GIL, some of your code is still in Python, and you’re wasting precious compute time!