Whereas Linpack is a venerable benchmark measuring floating point performance, a similarly old benchmark which measures memory performance is Stream by Dr John McCalpin. It has results dating back to 1991.

Stream measures how rapidly large arrays can be streamed from main memory, through the floating point unit, and back to memory. It runs four different kernels:

  COPY:       a(i) = b(i)
  SCALE:      a(i) = q*b(i)
  SUM:        a(i) = b(i)+c(i)
  TRIAD:      a(i) = b(i)+q*c(i)   

where the arrays are of double-precision floating point numbers, and are chosen to be large enough not to fit in any cache.

The code is available in both Fortran and C, and here the C version will be used, to save having to install a Fortran compiler.

$ curl -O
$ gcc -O3 -o stream stream.c
$ ./stream
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5527.3     0.028977     0.028947     0.029019
Scale:           5561.9     0.028814     0.028767     0.028869
Add:             4855.2     0.049469     0.049432     0.049523
Triad:           4844.1     0.049590     0.049545     0.049665

One usually quotes the fastest of the four kernels (and which is the fastest will differ on different computers), so here we would claim to have achieved 5560MB/s.

Shared Video Memory

Your results might not be as good. Why? I used a Pi with no screen attached, logging in remotely from another computer. If we look at a rough hardware diagram for the Pi we find the GPU has no dedicated memory at all, and uses the main memory for everything, including its frame buffer (the part of the video memory which simple stores the pixels visible on the screen, rather than textures and GL code).

In order to create a 60Hz video signal with a 1920x1080 resolution display, and with 24 bits (3 bytes) per pixel, one needs a memory bandwidth of 60*1920*1080*3 bytes/s, or 370MB/s. When using a Pi which was also producing a video signal my best Stream result dropped to 5160MB/s, a drop of 400MB/s. The Pi 4 can drive two 4K displays, but this will surely impact its performance.

This effect is not novel. I have seen it in some ASUS PCs from around 2006, and some early home computers had more serious problems mixing video output and computation. Here we have a performance drop of under 10% on a benchmark designed to be almost maximally sensitive to the issue. Many benchmarks will show no performance drop at all.

(If you are not able to log in remotely, it is possible to turn off the video output with

  $ xset dpms force off

It will resume on any mouse movement or keypress. I found the above a little unreliable, but was successful with this benchmark by launching it as:

  $ xset dpms force off; sleep 5; xset dpms force off; sleep 1; ./stream

and then waiting twenty seconds before tapping the mouse to resume the video output. Note that modern mice are very sensitive, so the slightest vibration may disturb this experiment by re-enabling the video output.)

Memory or Floating Point?

How do we know that we have really measured the memory performance, and not the floating point performance? The "scale" operation, which achieved the best result here, has one floating point operation per loop iteration, along with one memory read, and one memory write. Given that a double precision number is eight bytes long, that is sixteen bytes of memory traffic per floating point operation. So we achieved 5560MB/s and 350 MFLOPS using one core.

The Pi 4 has a 32 bit data bus to its memory, which is DDR4/3200. This means that its theoretical memory bandwidth is 3200 Mega Transfers per second times four bytes per transfer, so 12,800MB/s. It has a clock speed of 1.5GHz and is theoretically capable of sustaining two multiplications per clock cycle per core, so 3 GFLOPS for pure multiplications. So this benchmark has achieved about 44% of the theoretical memory bandwidth and just under 12% of the theoretical floating point performance. It is more likely that the memory is the constraint than the floating point performance. And, indeed, removing the floating point operation to result in a pure copy produced no increase in speed.

Why so slow?

The bus from the memory to the SoC is 32 bits and 3200MT/s. What of the bus from the memory controller on the SoC back to the L2 cache on the SoC? I have not seen any clear description of this. Could it be 128 bits and 500MT/s? This would be 8GB/s, so is sufficient to produce the Stream result of 5.56GB/s, and it would also be consistent with on-line comments that overclocking the RAM on a Pi4 is pointless.

More cores?

With one core we have already exceeded 44% of the theoretical maximum bandwidth for the whole CPU. However we use four cores we are not going to be able to make it four times faster.

The version of the benchmark used can use multiple threads, and thus multiple cores. It just needs to be recompiled.

$ gcc -fopenmp -O3 -o stream_mp stream.c
$ ./stream_mp
Number of Threads requested = 4
Number of Threads counted = 4
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3720.7     0.044148     0.043003     0.045270
Scale:           3773.6     0.043703     0.042400     0.044798
Add:             3266.5     0.074496     0.073473     0.075900
Triad:           3350.2     0.072127     0.071637     0.072924

Oh. The extra cores simply get in each other's way, and the result is significantly slower.

$ OMP_NUM_THREADS=2 ./stream_mp 
Number of Threads requested = 2
Number of Threads counted = 2
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            4570.4     0.036084     0.035008     0.038689
Scale:           3665.6     0.044036     0.043649     0.044909
Add:             3976.7     0.060383     0.060351     0.060438
Triad:           4069.4     0.059031     0.058976     0.059130

So two cores are better than four, but not as good as one. And one can try the single core again with

$ OMP_NUM_THREADS=1 ./stream_mp 

Other Results

If we consider early supercomputers, the Stream result is not as good as the Linpack one. A Cray Y-MP/8, using all eight of its processors, scored around 2.1 GFLOPS on Linpack, something which the Pi 4 can beat using just one core, and beat by at least factor of three when using all four. But the Stream score of the Y-MP/8 was 26,800MB/s, almosts five times faster than the Pi 4.

If we consider Intel-based computers, the Stream result is in a similar category to the Linpack one. A Pentium III achieved around 400MB/s on Stream, so is easily beaten. A Pentium 4 could achieve up to about 4500MB/s in its later revisions, so is quite similar to the Pi 4.

The Pentium 4's successor, the Core2, was often considered to be a little disappointing on memory performance. I still have access to a quad core 2.4GHz Core 2 (a Q6600), and its scores are:

Number of Threads counted = 1
Function    Best Rate MB/s
Copy:            5281.4
Scale:           3958.4
Add:             4772.1
Triad:           4753.6

Number of Threads counted = 4
Function    Best Rate MB/s
Copy:            4980.2
Scale:           4356.6
Add:             4829.2
Triad:           4871.3  

so a slight win for the Pi 4 for single-core performance, and a slight loss for quad-core performance, compared to this CPU launched in 2007.

A more contemporary Intel desktop, a Kaby Lake, which uses DDR4/2400 memory, manages a score of 32,000MB/s. Whereas the Pi 4 has a 32 bit data bus to its memory, the Kaby Lake has a pair of 64 bit buses, making its theoretical performance four times higher.

Some high-end Intel desktops use four 64 bit memory buses and can achieve over 60,000MB/s.