# The Pi's Core

Following on from the rough overview of the SoC, herewith a rough overview of the cores themselves.

Again this page is only an approximation to the truth, but it shows some of the key points.

### A53 / A72 / A76 Core

Whilst there are many differences between the A53 core of the Pi 3 (and Pi 2 V1.2), the A72 of the Pi 4, and the A76 of the Pi 5, at this level of approximation they are almost indistinguishable.

In red are highlighted the "functional units". These are parts of the core which are dedicated to a few, simple, operations. One might say they are where the real work gets done.

Note that the integer and floating point / vector parts of the core are so distinct that one could imagine a core without any of the floating point / vector parts. Indeed such cores exist, and are used in some embedded systems and low-end phones. The A7 used in the original Pi 2 is a step towards this position, for it has just a single FP & vector functional unit. It also lacks the 64 bit extensions of the A53 and A72.

The arrows show the direction of data movement, and are bidirectional on the load unit as that unit can also move data between the integer and FP/vector registers.

The above diagram shows two important points. The functional units are independent, so that they can operate simultaneously. And they obtain their operands from registers, and write their results to registers. Only the load and store units access memory, so to add two numbers currently in memory and put the result back into memory takes four instructions: two loads of the operands from memory into two registers, one add, which will use the integer or floating-point functional unit as appropriate, and then one store.

For most instructions each functional unit can accept a new instruction every clock cycle. (The main exceptions being integer and floating point division, and floating point square root. It can be ten to twenty clock cycles before a functional unit can accept another instruction after starting one of those.) However, it may take several clock-cycles before the answer appears.

The simplest instructions, such as integer addition, do execute in a single clock cycle in the functional unit. They are said to have a latency of one cycle. A floating point add or multiply takes around four cycles however. So if one has a set of additions or multiplications which are independent, that is, the operands (inputs) do not depend on the results of the others, the functional unit can start one on every clock cycle. If the operands do depend on previous results (c=a+b, e=c+d), then the second must wait for the first to finish completely. A floating point adder might be said to have a latency of four cycles, but a throughput of one cycle (as it can start a new addition every cycle).

Why are some functional units duplicated? The core can issue multiple instructions on a single clock cycle, provided that they target different functional units, and their operands are independent. The A53 can issue instructions to two functional units at every clock cycle, the A72 up to five, and the A76 up to eight (a branch unit is not shown on the above diagram).

And whilst the functional units' throughputs on the A53 and A72 are mostly identical, the latencies on the A72 tend to be lower, which produces performance improvements if there are not large numbers of independent operations to perform. The A76 is lower yet.

### Peak Floating Point Performance

Like many modern CPUs, these ARM CPUs support a "fused multiply-add" (FMA) instruction. One of its forms is the core of a vector dot-product: a=a+b*c as a single instruction. The A72 can start two FMA instructions on each clock cycle, one in each of its FP/vector units. Each will ultimately execute two floating point operations, one add and one multiply, so the peak performance is four floating point operations per clock cycle (or 6 GFLOPS per core at 1.5GHz).

In the case of both the A53 and A72 the FP functional units can work on 64 bits of floating point data at once. The vector registers contain 128 bits (two double precision numbers, or four single precision) each. If a register is only half-full, then the FP functional units can accept a new instruction every clock cycle. If it is completely full, then a new instruction can be accepted only every other cycle -- the latency is two cycles.

So with double precision data in the vectors, the peak performance is either two operations (one FMA) per functional unit (two) per cycle if the vectors are half-full, or four operations (one FMA on a two element vector) per functional unit (two) per two cycles. Either way, the answer is four double precision floating point operations per clock cycle. But, if single precision is used, the answer is eight.

On the A72, a very small number of vector instructions operate with a latency of one on a vector containing two double precision values, most importantly the load instruction. On the later A76 of the Pi 5, the floating point add, multiply and FMA operations all have a latency of just one on a vector containing two doubles. So its theoretical performance is eight floating point operations per clock cycle (or 19.2 GFLOPS per core at 2.4GHz).

### The Vector Unit

The vector unit is quite sophisticated, being capable of processing both vectors of integers, and vectors of floating point data. It can operate on vectors formed of:

- Eight or sixteen eight-bit integers (bytes)
- Four or eight sixteen-bit integers ("words")
- Two or four 32-bit integers ("double words")
- One or two 64-bit integers ("quad words")
- Two or four single precision (32-bit) floating point values
- One or two double precision (64-bit) floating point values (aarch64 only)

And it can also operate on floating point scalars, single or double precision.

The integer vector operations available include add, multiply, shift, min, max, logical and compare (but not divide), and the floating point vector operations include add, multiply, multiply-add, divide (aarch64 only), min, max and compare.

Floating point scalar operations include divide and square root, even in aarch32 mode.

The two FP & vector units are not quite identical: only one can do integer multiplication and floating point division, and only the other can do floating point (scalar) square roots.

### AArch32 or AArch64

If the CPU is operating in its 32 bit mode, then sixteen 32-bit integer registers, and sixteen 128-bit vector registers, are available. The functional units refered to above as "FP & vector" support operations on vectors of 64 or 128 bits formed of 8, 16, 32 or 64 bit integers, or of 32-bit (single precision) floating point values. They also support single- and double-precision floating point operations on scalars.

If the CPU is operating in its 64 bit mode, then it is not limited to the features which would have been available on the older generation of CPUs, but has access to thirty-two 64-bit registers, and thirty-two 128-bit vector registers. The "FP" functional units now also support operations on vectors of 64-bit double precision values.

ARM calls these two modes AArch32 and AArch64.