256 bit Fixed Point Mandelbrot Set
This code currently requires a 64 bit verison of gcc or clang
for the non-standard __int128
type. So it won't work
on Pis running a 32 bit OS.
So far we have seen an arbitary precision fixed point Mandelbrot set using GMP, and one using 128 bit ints which is limited to a bit shift of 61. The 128 bit int version has the disadvantages of requiring a non-standard C extension, and being limited to shifts of just 61 bits, which is only a little more than double precision floating point at around 53. The GMP version requires one to have the development version of a slightly unusual library, and it is slow, about ten times slower than the 128 bit integer version.
Can we produce something a bit faster than the GMP version, but with a bit more precision than our current 128 bit integer version?
Our 128 bit integer version used 64 bit integers most of the time, but 128 bit integers to hold the product of two 64 bit numbers so as to guarantee that no overflow occured. The GMP version use the slow, complicated, GMP integer type for everything. Assuming that we have a fast 128 bit integer type, can we use that most of the time, and just use something slow for the result of products, noting that a multiplication is always followed by a shift which will reduce the result back down to 128 bits?
This is fairly easy to do, by splitting our 128 bit numbers into two halves, the high bits and the low bits.
a = a_high * 2^64 + a_low = a_high << 64 + a_low
where ^ means exponentiation, and << means a leftward shift of the bits. We can now write
a * b = (a_high<<64 + a_low) * (b_high<<64 + b_low) = (a_high*b_high)<<128 + (a_high*b_low+b_high*a_low)<<64 + a_low*b_low
but the important thing to note is that none of the individual products will overflow a 128 bit integer, as the muliplicands are only 64 bits. We have generated the full 256 bit product, which we cannot store in a 128 bit integer, but, as we wish to shift it immediately, this does not matter.
(a * b)>>bshift = (a_high*b_high)<<(128-bshift) + (a_high*b_low+b_high*a_low)<<(64-bshift) + (a_low*b_low)>>bshift
If the product after shifting still does not fit in a 128 bit integer, then we have exhausted what we can do using 128 bit integers to store the real and imaginary components.
Again to compile one has to download the three files above and then type
python3 setup.py build_ext --inplace
to build the module, before running the program as
python3 mandel_tk_int256_par_col.py
GCC emulates 128 bit integers using a pair of 64 bit integers. This is because most processors which have 128 bit integer registers do not actually support basic arithmetic operations such as addition on them. A 128 bit addition thus becomes two 64 bit additions, and, in assembler the complications of a potential carry from one half to the other are dealt with without increasing the instruction count, whereas in a high-level language such as C they would be more painful.
Although simply writing a routine to multiply two 128 bit
numbers would have sufficed, we have written a separate routine for
squaring them, as various simple optimisation can be made in this
case, and of the three multiplies in the code, two are actually
squarings. When squaring we know that the answer will not be
negative, we don't have to pass two arguments,
and a_high*b_low=b_high*a_low
so we can avoid one
operation.
In the earlier 128 bit
code, x
and y
were stored as
64 bit integers, and their products as 128 integers. In
that case the largest bit-shift which worked was 61, leaving two
bits for the integer part of x
and y
,
and one bit for their signs. Now we are storing x
and y
as 128 bit integers, calculating their
products as 256 bit integers, but storing the products as
just 128 bits after shifting. The largest bit-shift which
seems to work is 122, which leaves five bits for the integer part
of x
and y
and their products, and one
bit for their signs, but six bits for their squares, which lack
the sign bit.
What is needed? Consider a starting point on the real axis just
less than two. On step zero its modulus will be calculated as just
less than four, and so it passes to the next iteration. On the first
iteration its value becomes just under six, and its modulus just
under 36. So really we need three bits for the integer part
of x
and y
as 6=1102, and six
for the integer part holding squares and products, as 36=100100. But
provided a six-bit square is correctly represented, the loop will
exit before the signed product is attempted.
So the maximum shift of 122 for the 256 bit code is
constrained by the calculations of the products, and a shift of 123
mostly works, and is most obviously wrong in a region around z=+2
which is of little interest. For the 128 bit code the limit of
61 is a slight surprise, as one might guess it should be just
60. The constraint comes from the storage of x
and y
, and the error when these are just one bit too
large in practice does not matter. If x
should be in
the range 4 to 6, this erroneously maps to the range -4 to -2. But
all that matters is that x2 is calculated to be greater
than four, so that the algorithm exits. Whether x
is
stored correctly as, say, 5, or incorrectly as -3, this will be the
case.
Results
Times for default 600x600 image, single core, maximum iterations 2500, bit shift=50.
Double precision floating point: | 2.1s |
128 bit fixed point: | 6.2s |
256 bit fixed point: | 17.5s |
GMP fixed point: | 57.5s |
Times for default 600x600 image, single core, maximum iterations 2500, bit shift=120.
Double precision floating point: | N/A |
128 bit fixed point: | N/A |
256 bit fixed point: | 17.5s |
GMP fixed point: | 96.0s |
(The timings used a Pi 4 running a 64 version of Raspberry Pi OS.)
The results are as expected. The double-precision version is fastest, but its maximum bit shift is effectively about 53. The version using 128 bit integer products gives a little more precision, around 61 bits, but is about three times slower. Depending quite where one is looking in the set, it might be possible to zoom in by an extra factor of 100 after paying this three fold speed penalty.
The version using 256 bit integer products is almost ten times slower than the double precision version, but it supports bit shifts up to about 122. So not just 100x as much zoom, but around 270 times as much zoom, that is around 1 000 000 000 000 000 000 000x as much zoom.
Finally the GMP version is the slowest, and, although it can zoom without limit, it does get increasingly slow the greater the bit-shift. The 256 bit experiment was worthwhile, for it can beat GMP by a factor of between three and five, provided that one's compiler supports the 128 bit integer type.
A really clever code would contain all four algorithms, and switch between them as appropriate, including automatically adjusting the bit shift depending on the zoom level requested.
Short cuts
Our bit-shifting code does contain a slight short-cut. It always rounds towards zero, rather than the nearest integer, when simulating a division by a shift. The 256 bit code is worse, because it does three shifts, each rounding towards zero, in the code for a single multiplication. We could improve on this, by adding a suitable constant before the shift. One needs to add that number which would be a half after the shift, i.e. 1<<(shift-1). But this would slow the code for little gain.
Because we round towards zero, rather than perhaps always rounding down, we retain perfect symmetry between +y and -y. If we rounded 2.7 to 2, but -2.7 to -3, we would lose symmetry.