# Single Mersenne Primes: Cray vs Pi

The Lucas-Lehmer code given on the faster Mersenne Primes page can easily be modified to test just a single Mersenne exponent.

```#!/usr/bin/python3
from math import sqrt
from sys import exit

p=int(input())

# See if p is prime

if (p%2==0):
print("%d is not prime"%(p))
exit(1)

for f in range (3,int(sqrt(p))+1,2):
if (p%f==0):
print("%d is not prime"%(p))
exit(1)
Mp=(1<<p)-1
s=4
for i in range (1,p-1):
s=(s*s-2)
if (s<0): s+=Mp
s=(s&Mp)+(s>>p)
s=(s&Mp)+(s>>p)
if (s==Mp): s=0
if (s==0):
print("2^%d-1 is prime!"%(p))
else:
print("2^%d-1 is not prime!"%(p))

```

This can then be used to re-test some historical finds.

The two Mersenne primes found using a Cray 1 had exponents 44497 and 86243. The first of these takes about 135 seconds to confirm with the above code, the second 730 seconds. The Cray 1 took about 20 minutes to show 244497-1 prime, having spent over 300 hours eliminating other candidates (Computerworld, 2nd July 1979).

In the early 1980s when those primes were first discovered, python had not been invented. Python first appeared in 1990, but did not become that popular until after the appearance of Python 2 in 2000. Most numerical programming in the early 1980s was in Fortran, and not the modern Fortran 90 style, but the older and less friendly Fortran 77 (or, worse, Fortran 66).

Fortran does not offer arbitrary precision arithmetic, so one has to code one's own. Python switches between a multiplication algorithm which takes order n2 time to multiply two n-digit numbers to one which is only order n1.59. But for very large numbers it is worth switching to yet another algorithm which is only order n log(n) (but with a very large pre-factor which makes other algorithms faster for moderate sizes).

As I already had some Fortran 77 code which performs Mersenne testing, and as it is in the style that would have been used in the 1980s, it seems reasonable to present it here.

mersenne_single.f

This code does not include code for the FFT it needs for its multiplication step, so to compile it not only does one need a Fortran compiler, but also the development version of the FFTW3 library.

```  \$ sudo apt-get install gfortran libfftw3-dev
```

should install all the required software on a Pi or a Linux machine. (Mac users are out of luck at this point.)

```  \$ curl -O http://www.sci-pi.org.uk/maths/mersenne_single.f
\$ gfortran -O3 -ffast-math mersenne_single.f -lfftw3
\$ ./a.out
```

This code takes 14 seconds to confirm 44497, and 61 seconds for 86243. So not only is is signficantly faster than the python version at 44497, but whereas the python version took over five times longer to move up to 86243, the FFT-based version takes just under five times longer.

The next largest exponent, 110503, is more interesting, for there is a good record that it took an NEC SX/2 686 seconds to test in 1988. Our python code will not beat that on a Pi. The Fortran code does: 84 seconds. This Mersenne prime was found out of sequence, in that two larger ones had been found earlier. It had taken over three hours on a Cray X-MP in 1985 to prove 2^216,091-1 prime. The Pi 4 takes only six minutes.

There is a better historical record for the exponent 756,839. It was found in 1992 on a Cray 2 in Harwell. It took 19 hours to test. The test was then repeated on the new Cray C90 in the US. That took just over three hours to repeat the result (ref). The Pi 4 can achieve it in just over one hour and forty-five minutes.

The exponent 859,433 takes just under two hours to test on a Pi 4, but is recorded as 7.2 hours on a Cray C90. In a letter dated 1994 and quoted in The New Book of Prime Number Records (by Paulo Ribenboim) David Slowinski writes "M859433 was tested in a total of 25,924 cpu seconds. With all sixteen processors working together I can test M859433 in under a half-hour. To my knowledge, this is still the fastest implementation of the Lucas-Lehmer test in the world even though there are massively parallel computers with peak megaflops an order of magnitude higher than our 1991 C916." Our code on the Pi 4 makes no attempt to use multiple cores (and using multiple CPUs is rather easier on the cacheless C90).

The exponent 1,257,787 gives the last Mersenne prime to be found by a Cray. A test running on a single CPU of a Cray T90 took about six hours to prove primality in 1996. The above code on a Pi 4 takes seven and a half hours. So the Pi is beaten at last. But not by very much. If as much time had been spent optimising the above code for the Pi as was spent on the code used on the Cray, and if the Pi were switched to running in 64 bit mode, it might catch up.

## Cray vs Pi

What is one to make of the above comparisons? The code used for the Cray was more carefully tuned and optimised (mostly by David Slowinski) than that I have written for the Pi. In some ways the C90 and the Pi 4 are similar. Both have a maximum memory capacity of 8GB (or 1024MW in Cray terminology, where a "word" is eight bytes). The C90 has sixteen CPUs running at 244MHz each capable of two adds and two multiplies per clock cycle, for a peak performance of 16x0.244x4=15.6GFLOPS. The Pi  has just four cores, but each is also capable of four floating point operations per clock cycle, and runs at 1.5GHz, so, in terms of peak performance, it is a little ahead at 24GFLOPS. However, a single core of the Pi 4 would be expected to be slower than all sixteen CPUs of a C90, and that seems to be the case, at least for testing M859433.

However, in one area the C90 wins easily: memory bandwidth. Its Stream result is just over 100GB/s with 16 CPUs, whereas the Pi 4 manages about 5.5GB/s. Even using just a single CPU the C90 can achieve around 9.5GB/s.

The other areas in which the C90 wins are not ones to be very proud of. It weighed six tons, consumed around 500kW of electricity, and had a list price of around \$30m. The cost, at today's prices in most of the world, of the electricity it consumed in proving 2^756839-1 prime would buy a Pi 4!

It seems reasonable to conclude that almost any calculation a scientist could perform on a Cray C90 someone can now perform on a Raspberry Pi 4. This is particularly true as a C90 is likely to be shared with hundreds of others, whereas one is likely to be able to access a larger fraction of a Pi.

The Cray T90 series was in many ways very similar to the C90, save that the clock speed (usually quoted as a period in ns, rather than a frequency in MHz) was 2.2ns rather than 4.1ns.

Finally one should be aware of the impact of caches on these comparisons. The Crays had none, whereas the Pi 4 has a 1MB cache. This means that the Pi may become significantly slower when working on data sets larger than 1MB in size, whereas the Crays will not. The larger tests above are starting to enter the danger zone for the Pi, as the code uses five double-precision arrays (eight bytes per element), and their length for testing 1,257,787 is 65,536 elements. So each individual array is below 1MB, but their sum is not. The smaller exponent of 859,433 required arrays of length 49,152 elements.

## Cray vs PC

The last prime found by a Cray (or any supercomputer) was that with the exponent 1,257,787 found in 1996 after a six hour test on a single CPU of a Cray T94. Since then the largest primes have all been found by PCs, starting with a 90MHz Pentium which took 88 hours to test 1,398,269, with many of the others taking months to test.

Running the Lucas-Lehmer test is much cheaper on a PC than on a Cray, and some major changes in the floating point performance of PCs facilitated this. A 90MHz Pentium from 1994 might sound only about three times faster than a 33MHz i386 with floating point coprocessor from 1985. However, the i386, even with its coprocessor, will fail to sustain 1 MFLOPS. The interface between the processor and the coprocessor was so poor that almost no floating point instruction completed in under 20 clock cycles, including even simple register to register copies. Multiplication took nearly 40 clock cycles. The Pentium had something much closer to a current floating point unit design. It could issue one floating point instruction every cycle, even if those instructions took three or four cycles to complete. So the floating point performance of a 90MHz Pentium was almost a hundred times that of a 33MHz i386 with coprocessor, and the performance gap with a Cray CPU was closing.

## Cray vs 64 bit Pis

At the time of writing, the official 64 bit verison of Raspberry Pi OS is only beta testing, and still a little rough at the edges. It is expected to be faster when using large integers in python, as the internal representation of those numbers will use half as many digits. And indeed it is faster. Testing 44,497 drops from 135s to 57s, and 86,243 drops from 730s to 314s. So it is worth seeing if 110,503 can be tested in under the 686s of the NEC SX/2. It can, now taking 612s.

It is more surprising that the Fortran code also runs about twice as fast in 64 bit mode. Testing 110,503 drops from 84s to 39s. The exponent 216,091 is dispatched in 3 minutes 8 seconds.

I claim that the peak performance of a C90's processor is 0.976 GFLOPS, whereas for a core of the Pi 4 it is 6 GFLOPS. So one might expect about a factor of six difference in the runtime for similar code. The C90 took 432 minutes to test 859,433 on a single CPU. The Pi 4 in 64 bit mode takes 72 minutes. That the ratio of run-times is quite so close to the expected figure is co-incidence.

It might seem that the "about six hours" that the T90 took to test 1,257,787 will easily be beaten in 64 bit mode if it took seven and a half hours in 32 bit mode. So far, 64 bit mode appears to be about twice as fast. Whilst the Pi 4 does complete the test in under six hours, it is less than ten minutes under six hours, and not the under four hours one might have anticipated.

Testing 110,503 was 2.1x faster in 64 bit mode, testing 859,433 was only about 1.6x faster, and testing 1,257,787 is about 1.3x faster. There could be many reasons for this, including the larger numbers being more constrained by memory performance rather than CPU performance. In this last test, a 6 GFLOPS Pi 4 core achieves parity with a 1.8 GFLOPS Cray T90 CPU.

## Pi vs PC

I have access to a 3.4GHz 64 bit Pentium 4, released in 2005. The clock speed advantage over the 1.5GHz Pi 4 is significant, but it completes 44497 in 63s to the Pi 4's 57s in 64 bit mode.

The FFT-based Fortran version also favours the Pi 4, with the Pentium 4 taking 60s over 110,503, whereas the Pi 4 took 39s.

The slightly more recent Core2 does better, despite a lower clock speed. The python code for 44497 completes in 50s, and the Fortran code for 110,503 takes 37s, for a CPU running at just 2.13GHz.

One of the fastest machines for serial code I currently have access to is an AMD Ryzen 5 3600 (Zen2-based). The python test is dispatched in under 17s, and the Fortran test in 12s. Its clock speed can reach 4.2GHz, and its performance advantage over the other processors here is more than the clock speed increase alone would suggest.