256 bit Fixed Point Mandelbrot Set

This code currently requires a 64 bit verison of gcc or clang for the non-standard __int128 type. So it won't work on Pis running a 32 bit OS.

So far we have seen an arbitary precision fixed point Mandelbrot set using GMP, and one using 128 bit ints which is limited to a bit shift of 61. The 128 bit int version has the disadvantages of requiring a non-standard C extension, and being limited to shifts of just 61 bits, which is only a little more than double precision floating point at around 53. The GMP version requires one to have the development version of a slightly unusual library, and it is slow, about ten times slower than the 128 bit integer version.

Can we produce something a bit faster than the GMP version, but with a bit more precision than our current 128 bit integer version?

Our 128 bit integer version used 64 bit integers most of the time, but 128 bit integers to hold the product of two 64 bit numbers so as to guarantee that no overflow occured. The GMP version use the slow, complicated, GMP integer type for everything. Assuming that we have a fast 128 bit integer type, can we use that most of the time, and just use something slow for the result of products, noting that a multiplication is always followed by a shift which will reduce the result back down to 128 bits?

This is fairly easy to do, by splitting our 128 bit numbers into two halves, the high bits and the low bits.

  a = a_high * 2^64 + a_low
    = a_high << 64 + a_low

where ^ means exponentiation, and << means a leftward shift of the bits. We can now write

  a * b = (a_high<<64 + a_low) * (b_high<<64 + b_low)
        = (a_high*b_high)<<128 + (a_high*b_low+b_high*a_low)<<64 + a_low*b_low

but the important thing to note is that none of the individual products will overflow a 128 bit integer, as the muliplicands are only 64 bits. We have generated the full 256  bit product, which we cannot store in a 128 bit integer, but, as we wish to shift it immediately, this does not matter.

  (a * b)>>bshift = (a_high*b_high)<<(128-bshift)
                    + (a_high*b_low+b_high*a_low)<<(64-bshift)
                    + (a_low*b_low)>>bshift

If the product after shifting still does not fit in a 128 bit integer, then we have exhausted what we can do using 128 bit integers to store the real and imaginary components.

Again to compile one has to download the three files above and then type

python3 setup.py build_ext --inplace

to build the module, before running the program as

python3 mandel_tk_int256_par_col.py

GCC emulates 128 bit integers using a pair of 64 bit integers. This is because most processors which have 128 bit integer registers do not actually support basic arithmetic operations such as addition on them. A 128 bit addition thus becomes two 64 bit additions, and, in assembler the complications of a potential carry from one half to the other are dealt with without increasing the instruction count, whereas in a high-level language such as C they would be more painful.

Although simply writing a routine to multiply two 128 bit numbers would have sufficed, we have written a separate routine for squaring them, as various simple optimisation can be made in this case, and of the three multiplies in the code, two are actually squarings. When squaring we know that the answer will not be negative, we don't have to pass two arguments, and a_high*b_low=b_high*a_low so we can avoid one operation.

In the earlier 128 bit code, x and y were stored as 64 bit integers, and their products as 128 integers. In that case the largest bit-shift which worked was 61, leaving two bits for the integer part of x and y, and one bit for their signs. Now we are storing x and y as 128 bit integers, calculating their products as 256 bit integers, but storing the products as just 128 bits after shifting. The largest bit-shift which seems to work is 122, which leaves five bits for the integer part of x and y and their products, and one bit for their signs, but six bits for their squares, which lack the sign bit.

What is needed? Consider a starting point on the real axis just less than two. On step zero its modulus will be calculated as just less than four, and so it passes to the next iteration. On the first iteration its value becomes just under six, and its modulus just under 36. So really we need three bits for the integer part of x and y as 6=1102, and six for the integer part holding squares and products, as 36=100100. But provided a six-bit square is correctly represented, the loop will exit before the signed product is attempted.

So the maximum shift of 122 for the 256 bit code is constrained by the calculations of the products, and a shift of 123 mostly works, and is most obviously wrong in a region around z=+2 which is of little interest. For the 128 bit code the limit of 61 is a slight surprise, as one might guess it should be just 60. The constraint comes from the storage of x and y, and the error when these are just one bit too large in practice does not matter. If x should be in the range 4 to 6, this erroneously maps to the range -4 to -2. But all that matters is that x2 is calculated to be greater than four, so that the algorithm exits. Whether x is stored correctly as, say, 5, or incorrectly as -3, this will be the case.

Results

Times for default 600x600 image, single core, maximum iterations 2500, bit shift=50.

Double precision floating point:2.1s
128 bit fixed point:6.2s
256 bit fixed point:17.5s
GMP fixed point:57.5s

Times for default 600x600 image, single core, maximum iterations 2500, bit shift=120.

Double precision floating point:N/A
128 bit fixed point:N/A
256 bit fixed point:17.5s
GMP fixed point:96.0s

(The timings used a Pi 4 running a 64  version of Raspberry Pi OS.)

The results are as expected. The double-precision version is fastest, but its maximum bit shift is effectively about 53. The version using 128 bit integer products gives a little more precision, around 61 bits, but is about three times slower. Depending quite where one is looking in the set, it might be possible to zoom in by an extra factor of 100 after paying this three fold speed penalty.

The version using 256 bit integer products is almost ten times slower than the double precision version, but it supports bit shifts up to about 122. So not just 100x as much zoom, but around 270 times as much zoom, that is around 1 000 000 000 000 000 000 000x as much zoom.

Finally the GMP version is the slowest, and, although it can zoom without limit, it does get increasingly slow the greater the bit-shift. The 256 bit experiment was worthwhile, for it can beat GMP by a factor of between three and five, provided that one's compiler supports the 128 bit integer type.

A really clever code would contain all four algorithms, and switch between them as appropriate, including automatically adjusting the bit shift depending on the zoom level requested.

Short cuts

Our bit-shifting code does contain a slight short-cut. It always rounds towards zero, rather than the nearest integer, when simulating a division by a shift. The 256 bit code is worse, because it does three shifts, each rounding towards zero, in the code for a single multiplication. We could improve on this, by adding a suitable constant before the shift. One needs to add that number which would be a half after the shift, i.e. 1<<(shift-1). But this would slow the code for little gain.

Because we round towards zero, rather than perhaps always rounding down, we retain perfect symmetry between +y and -y. If we rounded 2.7 to 2, but -2.7 to -3, we would lose symmetry.