A Supercomputer?

The widely-accepted ranking of supercomputers is that given in the Top 500 list. This list is published twice each year, and first appeared in June 1993. It was based on an older list published by Dongerra in 1987.

So what is the Top 500 list? It simply ranks computers by performance on a single benchmark, Linpack. And what is Linpack? It is a benchmark to solve a set of simultaneous linear equations, such as

   x +  y + 2z = 6
  2x - 3y -  z = 2
  -x + 2y - 3z = 4

The vendor is free to choose the size of the problem. The original 1987 list fixed it at a set of a thousand equations and hence a thousand variables (rather than the three in the above example). The use of double precision arithmetic is expected.

So that the result can be checked easily, the coefficients on the left-hand side are generated randomly, but the constants on the right-hand side are then calculated with all the unknowns set to one. So the correct answer is that all the unknowns, expressed as a vector x[0] to x[n-1], are one.

Linpack on Pi

Can we run Linpack on a Raspberry Pi? Yes! Linpack is available in many languages, and here we will run it in Python.

  ~$ mkdir linpack
  ~$ cd linpack
  ~/linpack$ curl -O http://www.sci-pi.org.uk/bench/linpack.py
  ~/linpack$ chmod +x linpack.py
  ~/linpack$ ./linpack.py
  Running Linpack  2000 x 2000 

Residual is  4.256150987203e-12
Normalised residual is  19.171851934146027
Machine epsilon is  2.22e-16
x[0]-1 is  1.0795808691455022e-12
x[n-1]-1 is  -7.106537580625627e-13
Time is  6.576431035995483
MFLOPS:  812.1933164201133

So above we made a new directory, changed directory to work within it, downloaded a copy of the Linpack benchmark, marked it as an executable file, and executed it. We achieved, on a standard Pi 4, 812 MFLOPS.

If we look at the 1987 table, the top two entries are an NEC SX-2 at 885 MFLOPS, and a Cray X-MP/4 at 713 MFLOPS (the /4 denotes the number of processors). The closest thing to a desktop computer in the 1987 list is a Sun 3/260 with optional Motorola 68881 maths co-processor. It took over an hour to run the benchmark, and achieved 0.17 MFLOPS.

If we look at the 1993 table, 0.8 GFLOPS would place the Pi 4 at about number 330 in the world.

In terms of Intel CPUs, this performance is similar to the fastest of the Pentium III generation (first released in 1999), but would easily be beaten by a Pentium 4 (first released late 2000).

Did we do well?

Is 812 MFLOPS a reasonable score for a Pi 4 released in 2019, when a single core 1.5GHz Pentium 4 released in 2000 scores about 2 GFLOPS? The only area where the Pi is winning is power consumption, for the Pentium 4 would have used around 50W for the CPU alone, whereas the Pi is managing under 15W for everything.

The Linpack benchmark we are using is actually a single call to a standard maths library called Lapack, which is supplied as part of the Raspbian OS. So the question is perhaps is the default version of the Lapack library well optimised?

One indication comes from running top. Open a second window on the Pi, and in it type top. Then, in the first window, re-run linpack, this time as

  ./linpack.py 3000

so that it takes a little longer (about twenty seconds). Note that top shows the linpack.py consuming 100% of a CPU, and also that the third line of the screen starts:

%Cpu(s): 25.0 us,  0.2 sy,  0.0 ni, 74.7 id

The 100% of a CPU figure actually means 100% of a CPU core, and the 74.7% idle figure refers to the fact that three of the four cores in the Pi are, indeed, idle. (One is presumably running top, but that will not take much time.

(To quit top, type a q in its window.)

So 800 MFLOPS has been achieved using a single core. Still perhaps not brilliant, for the core runs at 1.5GHz, so it is taking almost two clock cycles for each floating point operation. Are there better versions of the Lapack library?

Upgrading Lapack

  $ sudo apt-get install libopenblas-dev

and type 'y' to confirm the installation of a couple of extra packages.

Now:

~/test $ ./linpack.py 
Running Linpack  2000 x 2000 

Residual is  1.2789769243681803e-12
Normalised residual is  5.761161493020757
Machine epsilon is  2.22e-16
x[0]-1 is  1.2856382625159313e-12
x[n-1]-1 is  2.1316282072803006e-14
Time is  1.699167013168335
MFLOPS:  3143.501075490907

And, so see what is happening by running top at the same time, it is reasonable to run as linpack.py 5000.

Now one should see linpack.py showing almost 400 in "%CPU" as it fully-occupies all four cores. One should also find that the larger size of the benchmark produces a faster score, possibly around 6,500 MFLOPS. Well, it did in 2020. With the recent release of a non-beta 64-bit version of Raspberry Pi OS, one can do a lot better. The Pi 4 gives 9950 MFLOPS, and the Pi 400 would surely exceed 10,000 MFLOPS.

New position

A score of 6,500 MFLOPS is sufficient to beat any single-core Pentium 4, for a single-core 3.4GHz Pentium 4 scores around 5,400 MFLOPS. It would improve the Pi's position on the 1993 list of the world's fastest computers from around 330 to around thirty-three. It now beats an eight-processor Cray Y-MP/8 by a factor of three, and would be fast enough to be on the Top500 list of November 1996.

Of course, we have not proved that this new Lapack library is as fast as it could possibly be on a Pi 4. It probably isn't. But it can be almost eight times faster than the default version.

One can experiment further, and run the new library on just a single core by using:

  $ OMP_NUM_THREADS=1 ./linpack.py 5000

Or one can choose to use two or three cores by substituting "2" or "3" for the "1". (One can check how many cores are in use with top, but note that the first part of the benchmark, setting up the array of co-efficients, always runs on just a single core. This part is not part of the timed section.) Results may be something like 2,450 MFLOPS on one core, 4,200 MFLOPS on two cores, 5,800 MFLOPS on three cores, and 6,550 MFLOPS on all four. Not perfect parallelism, and, indeed, rather little gain in moving from three cores to four.

Theory

Consideration of the floating point execution units in the A72 core suggests that the single core performance cannot exceed 6 GFLOPS when running at 1.5GHz, and therefore that all four cores cannot exceed 24 GFLOPS. For single precision, the theoretical performance is doubled. So far the best double precisions scores I have managed are 4.3 GFLOPS on a single core, and 11 GFLOPS on all four, which would place the Pi 4 at number 22 on the 1993 Top500 list, and it would not drop out of the Top500 list until June 1998.

Other Results

A contemporary Intel CPU, such as a 3GHz quad core Kaby Lake, can achieve about 165 GFLOPS on Linpack, and 44 GLFOPS using a single core. Its clock speed is twice as high, its vector registers hold four double precision numbers not two, and whilst, like the Pi, it has two independent floating point execution units, those in the Kaby Lake can sustain one fused multiply-add operation per clock cycle on a four element vector. Those in the Pi 4 can sustain one fused multiply-add operation every other clock cycle on a two element vector. So the theoretical performance is eight times higher.

With a power rating of over 60W for the Intel CPU, its performance per watt is not better than the Pi's, given that the Pi's 15W includes the onboard peripherals and some extra power for USB devices too, as well as the basic CPU.

Reverting

It is hard to see why one would wish to revert to the default version of the BLAS and Lapack libraries, but for completeness the unmemorable commands are:

sudo update-alternatives --config liblapack.so.3-arm-linux-gnueabihf
sudo update-alternatives --config libblas.so.3-arm-linux-gnueabihf

It is possible that not setting both to the same version will cause issues.

(Those running Linux on x86_64 PCs will wish to replace -arm-linux-gnueabihf with -x86_64-linux-gnu, and those running a 64 bit Linux on ARM will wish to use the suffix -aarch64-linux-gnu.)