Multithreading¶
In Python, there are two built-in libraries dedicated to doing tasks in parallel: threading and multiprocessing. Threading, however, does not make things go faster: if you give it 4 tasks, each task will take 25% of the CPU. Multiprocessing copies most of threading's functionality, but allows the user to use the full capacity all cores a machine has. This article intends to give you a basic idea of how multiprocessing works. For a full overview, see the documention pages.
Simple example¶
Let's say we have a function that generates a file with one million randomly generated words.
def generate_pseudowordfile(filename):
f = open(filename,'a')
for i in range(1000000):
word = random.choice(['b','d','v','l']) + random.choice(['a','o','i']) + random.choice(['p','k','l'])
f.write(word+'\n')
If we want to run this function 32 times, we could of course do this:
This works, but will be relatively slow, because we are doing the 32 tasks sequentially. A solution to do them in parallel, and thus make use of all the cores, could be to put our function in a .py script, and run that script 32 times with an ampersand. However, if your function is part of a larger script where a lot of other things are happening, that is not always easy, let alone elegant. This is where the multiprocessing module comes in. The parallel version of the code aboven:
for i in range(32):
p = multiprocessing.Process(target=generate_pseudowordfile,args=['output'+str(i)])
p.start()
Because this uses the 32 cores at the same time, this code will be 32 times faster (if 32 cores are available).
Being nice¶
To make sure your processes do not slow down the processes of others,
you can adjust its nice value. A high value means the process has a
low priority, which means your process will not disturb others. In
Linux, you can change the niceness by adding nice
before your command,
but you can also do it from your Python script with this code:
To adjust the nice value of individual processes, simply add this code to the target function of your process.
Collecting the output¶
Because of the parrallel processing we cannot use return
and collect
the output somewhere. While it is possible to write all your output to a
file and read the results like we did above, this might in a lot of
cases be very inconvenient; what if your function produces multiple
large dictionaries? The solution is multiprocessing.Queue, an object all
processes can correspond with. In our case, we can use it as a place
where the working processes can put their output, and where the main
process can get that output. This can go on until the main process has
enough results:
def generate_pseudowords(q):
for i in range(1000000):
word = random.choice(['b','d','v','l']) + random.choice(['a','o','i']) + random.choice(['p','k','l'])
q.put(word)
q = multiprocessing.Queue()
for i in range(32):
p = multiprocessing.Process(target=generate_pseudowords,args=[q])
p.start()
results = []
while len(results) < 32000000:
results.append(q.get())
Here, we run the function generate_pseudoword()
32 times in parallel.
Each of these 32 processes will put a million pseudowords in the queue.
In the main process, we empty this queue at the same time, and save the
results in a list named results
. If there are 32 x a million
pseudowords in this list, we know all processes are ready.
Pools¶
In a lot of cases, you will find that you want the server to do more than 32 tasks. Have a look at another version of our pseudoword generation function, for instance:
def generate_pseudowords(starting_letter):
words = []
for i in range(1000000):
words.append(starting_letter + random.choice(['a','o','i']) + random.choice(['p','k','l']))
return words
This function returns a million pseudowords with a predefined starting letter. Now imagine you want to this for every letter in this string.
string = 'thisisaverylongstringthatcontainsalotofwordsandwewantamillionrandomwordsgeneratedforeachletter'
If we want to do this with the concepts we already know, we should cut the string in 32 parts and give each part to a process. The multiprocessing module, however, also provides a higher level solution, taking a lot of this kind of \'management work\' out of your hands: the Pool object. It works like this:
p = multiprocessing.Pool(32) #Start a pool with 32 cores
result = p.map(generate_pseudowords,string)
This code starts a Pool with 32 Processes, distributes the work, and
collects the results automatically. The first argument of map()
is the
function you want to do multiple times in parallel, the second argument
is a list (or other iterable) of inputs for this function. So if the
list from the second argument contains 1337 items, the function from the
first argument will be executed 1337 times.
In this example, the Pool will iterate over the letters, give each
letter to one of the 32 processes, and tell it to use it as the input
for generate_pseudowords()
. When all cores are busy, it will wait
until one is done. The results will be appended to a list, which will be
returned by map()
when everything is done. Because our input string
contains 97 letters, this list will contain 97 items. Each item will be
a list of a million pseudoword starting with the corresponding letter.