Memory Latency

Memory latency, the time taken to respond to a request for a random address, is notoriously difficult to measure. Not only is it likely to be very small in the first level cache, so it is hard to eliminate other overheads from the measurement, but, in main memory, one needs to consider whether one is including TLB misses in the measurement or not.

That said, if one runs the same code on multiple computers, comparisons are probably valid.

The results need not be particularly closely related to the bandwidth results from programs such as Stream. Bandwidth measures a very long series of consecutive memory accesses, not random accesses.

The code I have used here is quite unpublishable, but it does tend to confirm that the Pi is a little compromised by its low power design.

Within the level one cache the code reports 3.33ns, compared to 1.79ns on a 2.4GHz Core2 from 2007. Not too bad considering that the Pi's clock speed is just 1.5GHz. Both have 32KB level one data caches.

The shared level 2 cache is a different matter. The Pi has 1MB shared between all cores, and the Core2 (a Q6600) 4MB per pair of cores. The Pi has a latency of around 18ns, the Core2 around 9ns. So the Core2's cache is at least four times as big, and twice as fast.

For random access to main memory, the Pi 4 manages around 170ns for a 64MB block, rising to 270ns for a 1GB block. The figures for the Core2 are around 100ns and 140ns respectively.

So whereas the Pi 4 is quite similar to the Core2 Q6600 on main memory bandwidth, it is losing on latency.

Note that in both cases the latency seen when jumping randomly around a 1GB block of memory is about seventy times higher than when jumping randomly around a 16KB block of memory (i.e. well within the first level cache). This sort of slowdown would not be seen on cacheless computers like the vector Crays, and is one reason why one has to be careful when comparing the performance of different computers.

Also note that if a benchmark is particularly sensitive to caches, then the results of any comparison can be very sensitive to the data size chosen. If one were to quote solely a figure for random accesses to a 2MB array, this is within the level 2 cache on the Core2, so it would report about 9ns. For the Pi, it is not within the level 2 cache, and it reports 90ns. (This figure might be understood by assuming that, for random access to a 2MB array, about half the time the data will be in the 1MB already cached, so one expects something like the average of the cache latency (18ns) and the main memory latency (around 160ns).) For most other sizes the performance difference between the Pi and the Core2 is around a factor of two, not ten.