Secret sauce of ARM’s big.LITTLE multi-processor architecture
Modern mobile devices must support ever increasing performance requirements within a relatively flat power budget. CPU designers have responded with a series of innovations to enable the required efficiency improvements: dynamic voltage and frequency scaling (DVFS), dynamic clock gating, domain power gating, state retention idle modes, and more recently heterogeneous compute.
Technology known as big.LITTLE from ARM is essentially a heterogeneous compute architecture that tightly combines two different but fully compatible CPU core types, a ‘big’ CPU tuned for maximum performance at good efficiency and a ‘LITTLE’ CPU tuned for good performance and maximum efficiency.
Together, they form a highly efficient CPU subsystem that is capable of market-leading peak performance at lower average power consumption than other high-end mobile CPUs. Software is key to enabling this combination of performance and efficiency.
The current generation of ARM big.LITTLE technology integrates a Cortex-A15 CPU cluster and a Cortex-A7 CPU cluster using a cache coherent interconnect (CCI-400) and generic interrupt controller (GIC-400).
The two processors are architecturally identical, implementing the ARMv7-A architecture. Thus, an application binary compiled for a given processor will execute in an architecturally consistent way on the other processor. The key difference is in the micro-architecture.
The ‘big’ processor in the current generation is Cortex-A15 – it features out-of-order-execution, sustained triple-issue, and a 15 to 24 stage pipeline.
The ‘LITTLE’ processor in the current generation is Cortex-A7 – it features in-order-execution, partial dual-issue, with an 8 to 10 stage pipeline.
Since per-instruction energy consumption is a function of the pipeline length, the Cortex-A7 processor is significantly more energy efficient. Cortex-A7 consumes less than 1/5th the power of Cortex-A15, yet has been shown to deliver very good performance for entry-level smartphone handsets and tablets via quad-core Cortex-A7 SoCs that are now shipping in production.
The first big.LITTLE SoCs are now in the market. Several ARM partners have samples and development boards, and Samsung is shipping the first production SoC featuring big.LITTLE technology – the Exynos 5 Octa. The various SoCs differ in topology, but share the same instruction set architecture and look the same to user space code.
Under the hood, silicon vendors optimise big.LITTLE processors toward specific target markets by their choice of the number of big and LITTLE cores, the interconnect fabric, cache sizes, and back-end implementation.There are also differences in the software running on these various SoCs.
The software that decides whether to run threads and workloads on a big or LITTLE processor is relatively straightforward and completely invisible to the user, much the same way DVFS software and state transitions are invisible to users and mobile app developers now.
There are three primary architectures for the software:
• Cluster Migration
• CPU Migration
• Global Task Scheduling
CPU/Cluster Migration overview
The migration modes of big.LITTLE software extend existing DVFS mechanisms to move software execution to high performance cores when requirements exceed the maximum operating point of the LITTLE CPUs, and back down to LITTLE cores when performance falls below the lowest operating point of the big CPUs. It is like an extended DVFS table.
In every production smartphone today, a DVFS driver controls voltage and frequency dynamically in response to processing demands. DVFS drivers like Linux’s cpufreq sample OS performance at regular and frequent intervals (50 milliseconds or less), and decide whether to shift to a higher or lower operating point or remain at the current operating point.
A DVFS state transition takes about 100 microseconds in the best case. A typical big.LITTLE state transition takes 30 microseconds and is therefore around 3x faster than a DVFS transition, and much faster than the OS evaluation period. The migration solution has the advantage of simplicity: no kernel scheduler modifications are required. It works in the same way as DVFS does in current mobile devices.
However, it does limit the system topology to an equal number of big and LITTLE cores, and it can only run up to half of the processors at a given time.
Towards heterogeneous multiprocessing (Global Task Scheduling)
If CPU/Cluster migration is an extension to DVFS, Global Task Scheduling is an extension to symmetric multiprocessing (SMP).
In an SMP system, the operating system sees all of the cores in the system that are powered up, and schedules individual threads of execution to cores based on policies that seek to optimize the distribution of work for power or performance.
Global Task Scheduling is similar to SMP, with the exception that the CPUs have different capacities, so the scheduler needs to match high performance threads to high performance cores, again completely transparent to the user. As with SMP, in Global Task Scheduling the OS power management will idle unused cores through a mechanism like CPU hotplug or cpuidle. The heterogeneous model provides a number of advantages:
- The ability to scale execution across all processors in the big.LITTLE system simultaneously.
- The ability to work with asymmetric big.LITTLE clusters – where the number of processors in each cluster differs.
- Fine-grained compute resource control by the operating system – use only as many cores as the workload justifies and only the level of computing resource needed.
The CoreLink CCI-400 block extends the coherency domain across the two clusters, so they appear as one from the viewpoint of software. The ARM implementation of Global Task Scheduling is called big.LITTLE MP; several of its components, such as the per-thread load tracker, are already in the upstream code base.
Global Task Scheduling has been demonstrated using the ARM big.LITTLE MP code base. It is in active development with an aim towards production release in devices and submission to the upstream in the 2nd half of 2013.
Tags: cpu subsystem, dual issue, energy consumption, stage pipeline