A Better Mandelbrot Set
How well optimised is the Mandelbrot Set code? What is meant by "well optimised"?
Too often one thinks solely of speed when considering optimisation. But perhaps there are other issues too which should be considered.
- Human interface
- Algorithmic efficiency
- Efficiency of implimentation of algorithm, including parallelisation
For the above code, the human interface is poor (need to edit the source and re-run to change any parameters), the accuracy is possibly not an issue, the algorithmic efficiency is rather basic, and so is the efficiency of its implimentation.
Is the code a complete mess? Not quite: it could be claimed that it was optimised for brevity and readability. And the human interface is not a complete disaster, for the mouse cursor is tracked with co-ordinates, and there is an option to save the image. Some Mandelbrot set generators simply produce an image file and expect one to find another program to view this image, and this code is a step above that, particularly as it updates the on-screen image whilst calculating.
But it could be better, and, as a small step in the direction of "better" the following is offered.
(Any error on running about colorama being insufficiently recent can be ignored, and arises from an inconsistency between the packages in Raspbian.)
What is better? There are three areas of improvement, and one of regression.
- The user interface now allows one to select a new region to calculate by dragging the mouse, and it has text entry boxes for adjusting the resolution and maximum number of iterations. It also now times itself. So the user interface is better. This is the only change in the MacOS version.
- The user interface is still based on matplotlib, but it is called in a slightly different way which produces a small speed improvement.
- The core function to perform the maths of the Mandelbrot set calculation is now compiled, rather than interpreted. More on this below.
- The code has grown by about a factor of three in length. This cannot be considered an improvement.
How much faster is this new code, and why? We can separate the time spent computing the set, and the time spent updating the display, by comparing times with the normal code, and with the if statement which controls the screen updates changed from
This will never be true, so no update to the screen will occur until the whole set is calculated. The original code runs in about 38s on a Pi 4, or 31s with the updates removed, suggesting that the calculation takes around 31s, and the graphics around 7s. Having the helpful updates in the output add almost a quarter to the run-time is not ideal, but just about acceptable.
The improved code runs in 5.6s with the screen updates, or 0.35s without them. So the speed of the graphics has been improved marginally from around 7s to around 5.3s, but the calculation time has been reduced from 31s to 0.35s. How?
Python is an interpreted language. C, C++ and Fortran are compiled languages. Compiled languages are converted into the language of the CPU which is to execute them before they are run, and then the CPU runs this native code. The same piece of C++ might be translated into 32 bit ARM code for running on a Raspberry Pi, or 64 bit x86_64 code for running on a PC. Either way, the CPU is presented with its native language, and runs fast.
Python is not like that. Although it is translated from the human-readable source to a more computer-friendly "byte code", the same byte code is executed on Pi, PC, or any other computer. The CPU has to run an interpreter to execute these foreign (to it) instructions in some virtual python machine. If the individual instructions involve a lot of processing, this extra translation step is not too costly. As an extreme example, on the Linpack page there is a single line of python,
which takes seconds to execute just once. The extra overhead of
presenting that as byte code rather than in native
instructions will be tiny, and the code used
linalg.solve will be native. The core of the
Mandelbrot algorithm is the opposite.
while (i<maxiter)and(z.real**2+z.imag**2<4): z=z*z+z0 i=i+1
Each iteration of that loop will require much more effort to interpret than the computational effort required to perform the operations represented. In a compiled language, the compiler translates to native code before the program is first run, and the compiler will compile that loop once, and then it will execute hundreds of thousands of times as native code. In an interpreted language, each iteration sees the overhead of interpretation. The amount of computation in those three line is quite small: six floating point multiplies, four floating point adds, one integer add, one floating point comparison, and one integer comparison.
Python has a trick to address this. The module
is able to compile some functions, and it will attempt to compile
those "decorated" by adding
@jit on the preceeding
line. However, numba is not in python.org's distribution, so Mac
users have to face a more complicated installation process, not
described here, before they can use it. Those using a conda
environment on MacOS may well have this package.
This leaves a code where the overhead of the graphics is not around
22%, but around 1500%. There is no point trying to optimise the
calculation part further unless the graphics can be speeded up
considerably, and one might well wish to
(The compilation performed by numba occurs the first time a
function is called. It seems that if one runs this program, and then
presses recalculate to recalculate the same image once it has
finished, the recalculated image appears about 0.7s faster. Remove
@jit line, and this time difference between the
first and subsequent calculations disappears. So one might conclude
that numba took around 0.7s to process the one function it compiles.)
Yet Faster Graphics
I failed to find a matplotlib solution which was significantly faster than this, and which updated the screen whilst it calculated. So eventually I abandoned matplotlib, and rewrote the code to use GTK3 directly for its graphics. The aim of making it faster was certainly achieved: the 5.6s benchmark above dropped in execution time to about 0.65s.