How Does It Work?
The genius of big.LlTl'LE is that it's
transparent to the OS (to be fair, so is Nvidia's 4-PLUS-1). No software
modification is required. It works through DVFS (not a furniture store but
dynamic voltage and frequency scaling), and support for that is built into
every modern OS already. On Windows, Linux, and most other PC platforms, DVFS
is handled through ACPI (the Advanced Configuration and Power Interface). Say
your CPU runs as standard at 3.5GHz and l.3\/. If you're just playing
Minesweeper, ACPI will tell the CPU to lower its frequency. The CPU has a
number of ‘p-states’ (hard-wired steps of frequency and voltage) and here it
might switch to the lowest (800MHz at 0.8V, In Intel CPUs, the DVFS
implementation is called SpeedStep. In AMD CPUs, it’s called Cool ’n’ Quiet
(desktop models) or PowerNow! (laptop models). We’re all familiar with those!).
Likewise, if your PC is struggling (perhaps
you’ve gone from Minesweeper to Black Medal of Faraway Duty vs Half a Fallout
on Biodoom Island: Episode 4: Chapter 16), ACPI will tell the CPU that its
frequency should be raised. It’ll switch to a higher p-state (assuming it’s not
already in p0, the highest) – perhaps 3GHz at 1.2V.
Big.LITTLE
Processing ARM
On Android, DVFS is supported through
PowerManager. The Exynos 5 Octa, of course, effectively has two CPUs, each with
its own p-states. When the Cortex-A7 cluster is running full-tilt and
PowerManager still demands more, big.LITTLE migrates the workload to the
Cortex-A15 cluster. PowerManager's requests then apply to that. When
performance requirements drop, the same happens in reverse. The beauty is that
Android is oblivious: it thinks the Octa has just one CPU cluster and one set
of p-states.
It’s All About Timing
Although their designs are poles apart, the
A7 and A15 employ the same ARMv7 instruction set and the same extensions –
Advanced
New
Samsung Cortex A-based
SIMD (‘NEON’), virtualization, Thumb-2 and
so on. To software, they're identical. Migration would otherwise be near
impossible. The A7 and A15 clusters are ‘coherent’ – aware of each other. This
is facilitated via two interconnects: the GIC-400 and the CCI-400. The former
migrates active interrupts; the latter migrates instructions from pipelines and
also data from registers and caches. Together, they allow the entire state of
the ‘outbound’ cluster (the one being closed down) to be copied to the
‘inbound’ cluster (the one being fired up). This occurs directly, bypassing
slow system RAM.
Now, while the A7 cluster‘s 512KB of L2
cache will easily fit into the A15 cluster's 2MB, clearly the reverse isn't
true. Caches give a CPU rapid access to frequently used data, so if 75% of the
A15’s cache is lost during migration to the A7, is there a performance penalty?
Well, a neat feature of big.LITTLE is that the outbound cluster's L2 cache can
remain temporarily active even after migration is complete, allowing the
inbound cluster to ‘snoop’. Of course, to ensure power-efficiency, eventually
it's shut down. In truth, big.LITLE isn't perfect (see Contact Your MP, a good
way to round off your reading), but it's unquestionably clever and elegant and
fast! The Exynos 5 Octa can complete a migration in fewer than 2,000
instructions about 20,000 clock cycles. Taking the A7 cluster's maximum
frequency of 1.2GHz, that translates to roughly 17 microseconds - one 60,000th
of a second, give or take. That's longer than a p-state transition, but not so
you'd notice. You're not going to find you've been eaten alive by a swarm of
zombies because of dropped frames!
A Long Story
The history of Intel's Pentium 4 has
entered folklore (see Geeks‘ Peeks). The pipeline in the Pentium III had ten
stages, yet in the final variant of the Pentium 4 it had a colossal 31 stages.
The rise in power-draw and drop in IPC killed the NetBurst architecture stone
dead. Tellingly, the Core architecture, used in Core 2 CPUs, reverted to a pipeline
of just 14 stages.
Intel
Inside Pentium 4
Pipelines can only be lengthened so far
before performance benefits are lost or even reversed, and it's quite possible
that the 15 stages in the Cortex-A15 pipeline is the ARM architecture’s
realistic limit. Interesting times lie ahead!
Hybrid Theory
One obvious downside to big.LlTTLE is the
amount of silicon duplication. For example, the A7 and A15 clusters each have
their own L2 cache. In future designs, wouldn't it make sense to implement a
single, shared L2 cache, perhaps split into blocks that could be enabled or
disabled as required?
Taking the idea further, could it be
possible to design “hybrid” cores, with shared integer and floating-point units
but differing pipeline paths? The same core could have a low-energy pipeline
(like the Cortex-A7's) and also a high-performance pipeline (like the
Cortex-A15's), with the active one being determined by workload.
Cortex-A7
and Cortex-A15 DVFS Curves
The resultant SoC could have a smaller die,
be cheaper to manufacture and exhibit even greater energy-efficiency. This
concept of ‘composite’ cores has been proposed in a detailed paper by a group
of engineers at the University of Michigan (see Geeks‘ Peeks). Who knows what
we'll see in the future? The big.LITTLE I've explored here is merely ARM's
first stab. Around the corner is the pairing of Cortex-A53 and Cortex-A57,
which are ARMv8 parts targeting the x86-dominated server market.
Contact Your MP
The Exynos 5 Octa’s CPU is what’s termed
heterogeneous: it comprises cores of differing architectures. However, in
operation, as with any regular CPU, it’s homogeneous: only cores of one
architecture are active at once.
Let's say there's an app whose performance
requirements can't quite be met by four A7 cores. Perhaps three A7 cores and
one A15 core would tip the balance. But Samsung's Octa can't mix and match like
that - its whole A15 cluster would get fired up. That's not especially
energy-efficient.
Big.LITTLE
MP
Of course, the Octa can 'power—gate', in
common with most other modem ARM SoCs and most modern x86 silicon. That means
that, within the active cluster, it can switch cores on and off. Even in a
scenario where the A15 cluster is in use, power-draw doesn’t necessarily run wild:
up to three of the four cores could be in ‘deep sleep’.
ARM aims to cater for such scenarios with
big.LlTTLE MP (multi-processing). This allows any core to be active
simultaneously with any other, even if they're from different clusters. It also
allows all cores to be active. With the Octa, we'd be in proper laptop
territory - the performance harnessed by eight active cores would be immense.
So would the energy-draw, of course – the CPU would suck up power like a
third-world dictator.
The problem with big.LITTLE MP is that the
host OS needs to be heterogeneity-aware. Presently, Android (and any other
ARM-compatible OS) expects all CPU cores to be the same. Thankfully, ARM is
busy writing code that should eventually feature in future Linux kernels
(Android is Linux-based), so within the next 12 months we may well see
big.LITTLE MP in action either from Samsung or some other SoC maker.