A Parallel Mandelbrot Set

The Mandelbrot Set is usually an early example in any course on parallel programming. It is ideally trivial, in that every pixel is calculated independently of every other pixel. However, parallelism in python is not entirely straight-forward.

The trival: numba and prange

If using numba, there is one trivial approach. Given that each pixel in a column is calculated independently, replacing range by prange in the mandel_calc function works quite well. One also needs to change the @jit line to

@git(parallel=True)

and add prange to the import

from numba import jit,prange

As only those three lines need modifying, the modified code is not given here, mere a link to the serial Mandelbrot Set code.

On UNIX systems the diff command exists to show differences between text files, with the < and > lines in the output highlighting the changes, < showing the contents of the left-hand argument and > of the right-hand argument. Each difference is prefixed by its line-number. Line numbers tend to be approximate, as many trivial changes to source files disrupt them.

$ diff mandel_tk_prange.py mandel_tk.py
 35c35
< from numba import jit,prange
---
> from numba import jit
45c45
< @jit(nogil=True,parallel=True)
---
> @jit(nogil=True)
49c49
<     for j in prange(0,size):
---
>     for j in range(0,size):

Harder: multiprocessing and shared memory

Python's multiprocessing module is a little trickier to use. It ends up effectively running multiple copies of the same program, but allows some arrays to be shared between them. The implementation is quite different between UNIX systems on the one hand, and Windows and MacOS on the other (MacOS unusually being "not UNIX" in this respect). UNIX systems use fork, and the others use full process spawning. Using fork is faster, and leads to slightly different sharing of variables. One has to take care to code in such a fashion that one's code works properly whichever method python uses.

The first implication of this is that the main program (i.e. the part outside of function definitions) needs to be protected so that it is run on the controlling process only, and not on multiple processes. This is done by enclosing it in an if statement reading

if __name__ == '__main__':

Once that is done, from the programming perspective the parallelism offered is that of calling a function repeatedly for many different values of a single parameter. This fits with what is needed here, for that function can be one which calculates a whole column of the Mandelbrot Set. We could choose to let the function return something, which could be automatically gathered into a large array of somethings. But instead we choose to have the function write its column directly into the array in which the whole 2D set is stored, and we do that by setting up some shared memory for that array.

The code for the function which calculates a column of the set, and writes it into a shared array, is

def mandel_write_column(i,x,xstep,ymin,ymax,size,maxiter):
    global MS_data
    MS=np.frombuffer(MS_data,dtype=np.float64).reshape(size,size)
    x+=i*xstep
    MS[:,i]=mandel_column(x,ymin,ymax,size,maxiter)

We really want this to be a function of a single integer, to make it easy to call with the pool.map() function, and python's partial function makes this easy, for

The RawArray function sets up a shared memory buffer, and then this gets converted to a numpy array using numpy's frombuffer function. This continues to use the original buffer for storing the array's data, and elsewhere is stored the array's metadata which records its size, shape, type, and where its data are to be found. Python is fussy about how these shared arrays are defined, and one acceptable solution is to have them as global variables which are set in the initialiser of the function parallelised. That is what is done here, with the initialiser being simply.

def init_mandel_write_column(MS_store):
    global MS_data
    MS_data=MS_store

Note that there are no issues with locking here, for each element of this array is written just once by just one process.

We really want this to be a function of a single integer, to make it easy to call with the Pool.map() function, and python's partial function makes this easy, for it reduces the number of arguments to a function, setting others to default values.

We want to calculate the columns in chunks, so that after each chunk we can update the display, and a fixed chunk size of ten might not be appropriate, depending on how many processes this chunk is then divided amongst. So the mandel_calc function is now a little more complicated.

def mandel_calc():
    global M,calculating,q,maxiter,cmap,abort_calc,nproc
    xstep=(xmax-xmin)/size
    chunk=nproc*int(10/nproc)
    if (chunk==0): chunk=nproc
    if (chunk>size): chunk=size
    t=time()

    ppm_col_header=b"P6\n%i %i\n255\n"%(chunk,size)
    
    with Pool(processes=nproc, initializer=init_mandel_write_column,initargs=(M.data,)) as pool:
        for i in range (0,size-chunk+1,chunk):
            if abort_calc: break
            res=pool.map(partial(mandel_write_column,x=xmin,xstep=xstep,ymin=ymin,ymax=ymax,size=size,maxiter=maxiter),range(i,i+chunk),1)
            if (i>0):
                sblock=np.array(255*(cmap(M[:,i+1-chunk:i+1])[:,:,0:3]),
                                dtype=np.uint8)
                ppm_col_bytes=ppm_col_header+sblock.tobytes()
                tim=time()-t
                q.put(lambda i=i, bytes=ppm_col_bytes, tim=tim:
                      col_draw(i,bytes,tim))

        remainder=size%chunk
        if ((remainder>0)and(not abort_calc)):
            i+=chunk
            pool.map(partial(mandel_write_column,x=xmin,xstep=xstep,ymin=ymin,ymax=ymax,size=size,maxiter=maxiter),range(i,i+remainder))
        
            
    t=time()-t
    q.put(lambda t=t,maxiter=maxiter, size=size:
                 finished_calc(t,maxiter,size))

The final result is mandel_tk_par.py.

There are large overheads to starting a parallel pool (with Pool), where "large" may mean several tenths of a second, and certainly will be many tenths of a second if numba is involved. We cannot afford to do this once per chunk. Fortunately within a pool the overheads involved in starting and stopping the workers with Pool.map are much more modest. So here we start the pool once per image, but may then use a hundred or so maps if the image is broken into a hundred or so chunks.

The speedup may be negligible (or a slow-down!) on the default 600x600 pixel image with maxiter set to 255. On one computer I tested, the serial version was over three times faster than the parallelised version running with two workers on this test. However, increase maxiter to 5000 and the resolution to 900x900 and the parallel version running on two cores is over 65% faster than the serial version. This is a common phenomenon. Parallel codes like big problems, not trivially small ones.

Note too that the parallel version running on a single process is likely to be slower than the serial version. The code above allows one to change the number of processes via the GUI. One can even try more processes than one's computer has cores.