A Parallel Mandelbrot Set
The Mandelbrot Set is usually an early example in any course on parallel programming. It is ideally trivial, in that every pixel is calculated independently of every other pixel. However, parallelism in python is not entirely straight-forward.
The trival: numba and prange
If using numba, there is
one trivial approach. Given that each pixel in a column is
calculated independently, replacing range
by prange
in the mandel_calc
function
works quite well. One also needs to change the @jit
line to
@git(parallel=True)
and add prange
to the import
from numba import jit,prange
As only those three lines need modifying, the modified code is not given here, merely a link to the serial Mandelbrot Set code.
On UNIX systems the diff command exists to show differences between text files, with the < and > lines in the output highlighting the changes, < showing the contents of the left-hand argument and > of the right-hand argument. Each difference is prefixed by its line-number. Line numbers tend to be approximate, as many trivial changes to source files disrupt them.
$ diff mandel_tk_prange.py mandel_tk.py 35c35 < from numba import jit,prange --- > from numba import jit 45c45 < @jit(nogil=True,parallel=True) --- > @jit(nogil=True) 49c49 < for j in prange(0,size): --- > for j in range(0,size):
Harder: multiprocessing and shared memory
Python's multiprocessing module is a little trickier to use. It ends up effectively running multiple copies of the same program, but allows some arrays to be shared between them. The implementation is quite different between UNIX systems on the one hand, and Windows and MacOS on the other (MacOS unusually being "not UNIX" in this respect). UNIX systems use fork, and the others use full process spawning. Using fork is faster, and leads to slightly different sharing of variables. One has to take care to code in such a fashion that one's code works properly whichever method python uses.
The first implication of this is that the main program (i.e. the part outside of function definitions) needs to be protected so that it is run on the controlling process only, and not on multiple processes. This is done by enclosing it in an if statement reading
if __name__ == '__main__':
Once that is done, from the programming perspective the parallelism offered is that of calling a function repeatedly for many different values of a single parameter. This fits with what is needed here, for that function can be one which calculates a whole column of the Mandelbrot Set. We could choose to let the function return something, which could be automatically gathered into a large array of somethings. But instead we choose to have the function write its column directly into the array in which the whole 2D set is stored, and we do that by setting up some shared memory for that array.
The code for the function which calculates a column of the set, and writes it into a shared array, is
def mandel_write_column(i,x,xstep,ymin,ymax,size,maxiter): global MS_data MS=np.frombuffer(MS_data,dtype=np.float64).reshape(size,size) x+=i*xstep MS[:,i]=mandel_column(x,ymin,ymax,size,maxiter)
We really want this to be a function of a single integer, to make
it easy to call with the pool.map()
function, and
python's partial
function makes this easy, for
The RawArray
function sets up a shared memory buffer,
and then this gets converted to a numpy array using
numpy's frombuffer
function. This continues to use the
original buffer for storing the array's data, and elsewhere is
stored the array's metadata which records its size, shape, type, and
where its data are to be found. Python is fussy about how these
shared arrays are defined, and one acceptable solution is to have
them as global variables which are set in the initialiser of the
function parallelised. That is what is done here, with the
initialiser being simply.
def init_mandel_write_column(MS_store): global MS_data MS_data=MS_store
Note that there are no issues with locking here, for each element of this array is written just once by just one process.
We really want this to be a function of a single integer, to make
it easy to call with the Pool.map()
function, and
python's partial
function makes this easy, for it
reduces the number of arguments to a function, setting others to
default values.
We want to calculate the columns in chunks, so that after each
chunk we can update the display, and a fixed chunk size of ten might
not be appropriate, depending on how many processes this chunk is
then divided amongst. So the mandel_calc
function is
now a little more complicated.
def mandel_calc(): global M,calculating,q,maxiter,cmap,abort_calc,nproc xstep=(xmax-xmin)/size chunk=nproc*int(10/nproc) if (chunk==0): chunk=nproc if (chunk>size): chunk=size t=time() ppm_col_header=b"P6\n%i %i\n255\n"%(chunk,size) with Pool(processes=nproc, initializer=init_mandel_write_column,initargs=(M.data,)) as pool: for i in range (0,size-chunk+1,chunk): if abort_calc: break res=pool.map(partial(mandel_write_column,x=xmin,xstep=xstep,ymin=ymin,ymax=ymax,size=size,maxiter=maxiter),range(i,i+chunk),1) if (i>0): sblock=np.array(255*(cmap(M[:,i+1-chunk:i+1])[:,:,0:3]), dtype=np.uint8) ppm_col_bytes=ppm_col_header+sblock.tobytes() tim=time()-t q.put(lambda i=i, bytes=ppm_col_bytes, tim=tim: col_draw(i,bytes,tim)) remainder=size%chunk if ((remainder>0)and(not abort_calc)): i+=chunk pool.map(partial(mandel_write_column,x=xmin,xstep=xstep,ymin=ymin,ymax=ymax,size=size,maxiter=maxiter),range(i,i+remainder)) t=time()-t q.put(lambda t=t,maxiter=maxiter, size=size: finished_calc(t,maxiter,size))
The final result is mandel_tk_par.py.
There are large overheads to starting a parallel pool
(with Pool
), where "large" may mean several tenths
of a second, and certainly will be many tenths of a second if numba
is involved. We cannot afford to do this once per chunk. Fortunately
within a pool the overheads involved in starting and stopping the
workers with Pool.map
are much more modest. So here we
start the pool once per image, but may then use a hundred or so
maps if the image is broken into a hundred or so chunks.
The speedup may be negligible (or a slow-down!) on the default 600x600 pixel image with maxiter set to 255. On one computer I tested, the serial version was over three times faster than the parallelised version running with two workers on this test. However, increase maxiter to 5000 and the resolution to 900x900 and the parallel version running on two cores is over 65% faster than the serial version. This is a common phenomenon. Parallel codes like big problems, not trivially small ones.
Note too that the parallel version running on a single process is likely to be slower than the serial version. The code above allows one to change the number of processes via the GUI. One can even try more processes than one's computer has cores.