ARM’s Big Little At Large (Part 2)

8/14/2013 6:44:01 PM

How Does It Work?

The genius of big.LlTl'LE is that it's transparent to the OS (to be fair, so is Nvidia's 4-PLUS-1). No software modification is required. It works through DVFS (not a furniture store but dynamic voltage and frequency scaling), and support for that is built into every modern OS already. On Windows, Linux, and most other PC platforms, DVFS is handled through ACPI (the Advanced Configuration and Power Interface). Say your CPU runs as standard at 3.5GHz and l.3\/. If you're just playing Minesweeper, ACPI will tell the CPU to lower its frequency. The CPU has a number of ‘p-states’ (hard-wired steps of frequency and voltage) and here it might switch to the lowest (800MHz at 0.8V, In Intel CPUs, the DVFS implementation is called SpeedStep. In AMD CPUs, it’s called Cool ’n’ Quiet (desktop models) or PowerNow! (laptop models). We’re all familiar with those!).

Likewise, if your PC is struggling (perhaps you’ve gone from Minesweeper to Black Medal of Faraway Duty vs Half a Fallout on Biodoom Island: Episode 4: Chapter 16), ACPI will tell the CPU that its frequency should be raised. It’ll switch to a higher p-state (assuming it’s not already in p0, the highest) – perhaps 3GHz at 1.2V.

Big.LITTLE Processing ARM

Big.LITTLE Processing ARM

On Android, DVFS is supported through PowerManager. The Exynos 5 Octa, of course, effectively has two CPUs, each with its own p-states. When the Cortex-A7 cluster is running full-tilt and PowerManager still demands more, big.LITTLE migrates the workload to the Cortex-A15 cluster. PowerManager's requests then apply to that. When performance requirements drop, the same happens in reverse. The beauty is that Android is oblivious: it thinks the Octa has just one CPU cluster and one set of p-states.

It’s All About Timing

Although their designs are poles apart, the A7 and A15 employ the same ARMv7 instruction set and the same extensions – Advanced

New Samsung Cortex A-based

New Samsung Cortex A-based

SIMD (‘NEON’), virtualization, Thumb-2 and so on. To software, they're identical. Migration would otherwise be near impossible. The A7 and A15 clusters are ‘coherent’ – aware of each other. This is facilitated via two interconnects: the GIC-400 and the CCI-400. The former migrates active interrupts; the latter migrates instructions from pipelines and also data from registers and caches. Together, they allow the entire state of the ‘outbound’ cluster (the one being closed down) to be copied to the ‘inbound’ cluster (the one being fired up). This occurs directly, bypassing slow system RAM.

Now, while the A7 cluster‘s 512KB of L2 cache will easily fit into the A15 cluster's 2MB, clearly the reverse isn't true. Caches give a CPU rapid access to frequently used data, so if 75% of the A15’s cache is lost during migration to the A7, is there a performance penalty? Well, a neat feature of big.LITTLE is that the outbound cluster's L2 cache can remain temporarily active even after migration is complete, allowing the inbound cluster to ‘snoop’. Of course, to ensure power-efficiency, eventually it's shut down. In truth, big.LITLE isn't perfect (see Contact Your MP, a good way to round off your reading), but it's unquestionably clever and elegant and fast! The Exynos 5 Octa can complete a migration in fewer than 2,000 instructions about 20,000 clock cycles. Taking the A7 cluster's maximum frequency of 1.2GHz, that translates to roughly 17 microseconds - one 60,000th of a second, give or take. That's longer than a p-state transition, but not so you'd notice. You're not going to find you've been eaten alive by a swarm of zombies because of dropped frames!

A Long Story

The history of Intel's Pentium 4 has entered folklore (see Geeks‘ Peeks). The pipeline in the Pentium III had ten stages, yet in the final variant of the Pentium 4 it had a colossal 31 stages. The rise in power-draw and drop in IPC killed the NetBurst architecture stone dead. Tellingly, the Core architecture, used in Core 2 CPUs, reverted to a pipeline of just 14 stages.

Intel Inside Pentium 4

Intel Inside Pentium 4

Pipelines can only be lengthened so far before performance benefits are lost or even reversed, and it's quite possible that the 15 stages in the Cortex-A15 pipeline is the ARM architecture’s realistic limit. Interesting times lie ahead!

Hybrid Theory

One obvious downside to big.LlTTLE is the amount of silicon duplication. For example, the A7 and A15 clusters each have their own L2 cache. In future designs, wouldn't it make sense to implement a single, shared L2 cache, perhaps split into blocks that could be enabled or disabled as required?

Taking the idea further, could it be possible to design “hybrid” cores, with shared integer and floating-point units but differing pipeline paths? The same core could have a low-energy pipeline (like the Cortex-A7's) and also a high-performance pipeline (like the Cortex-A15's), with the active one being determined by workload.

Cortex-A7 and Cortex-A15 DVFS Curves

Cortex-A7 and Cortex-A15 DVFS Curves

The resultant SoC could have a smaller die, be cheaper to manufacture and exhibit even greater energy-efficiency. This concept of ‘composite’ cores has been proposed in a detailed paper by a group of engineers at the University of Michigan (see Geeks‘ Peeks). Who knows what we'll see in the future? The big.LITTLE I've explored here is merely ARM's first stab. Around the corner is the pairing of Cortex-A53 and Cortex-A57, which are ARMv8 parts targeting the x86-dominated server market.

Contact Your MP

The Exynos 5 Octa’s CPU is what’s termed heterogeneous: it comprises cores of differing architectures. However, in operation, as with any regular CPU, it’s homogeneous: only cores of one architecture are active at once.

Let's say there's an app whose performance requirements can't quite be met by four A7 cores. Perhaps three A7 cores and one A15 core would tip the balance. But Samsung's Octa can't mix and match like that - its whole A15 cluster would get fired up. That's not especially energy-efficient.



Of course, the Octa can 'power—gate', in common with most other modem ARM SoCs and most modern x86 silicon. That means that, within the active cluster, it can switch cores on and off. Even in a scenario where the A15 cluster is in use, power-draw doesn’t necessarily run wild: up to three of the four cores could be in ‘deep sleep’.

ARM aims to cater for such scenarios with big.LlTTLE MP (multi-processing). This allows any core to be active simultaneously with any other, even if they're from different clusters. It also allows all cores to be active. With the Octa, we'd be in proper laptop territory - the performance harnessed by eight active cores would be immense. So would the energy-draw, of course – the CPU would suck up power like a third-world dictator.

The problem with big.LITTLE MP is that the host OS needs to be heterogeneity-aware. Presently, Android (and any other ARM-compatible OS) expects all CPU cores to be the same. Thankfully, ARM is busy writing code that should eventually feature in future Linux kernels (Android is Linux-based), so within the next 12 months we may well see big.LITTLE MP in action either from Samsung or some other SoC maker.

Top 10
SG50 Ferrari F12berlinetta : Prancing Horse for Lion City's 50th
The latest Audi TT : New angles for TT
Era of million-dollar luxury cars
Game Review : Hearthstone - Blackrock Mountain
Game Review : Battlefield Hardline
Google Chromecast
Keyboards for Apple iPad Air 2 (part 3) - Logitech Ultrathin Keyboard Cover for iPad Air 2
Keyboards for Apple iPad Air 2 (part 2) - Zagg Slim Book for iPad Air 2
Keyboards for Apple iPad Air 2 (part 1) - Belkin Qode Ultimate Pro Keyboard Case for iPad Air 2
Michael Kors Designs Stylish Tech Products for Women
- First look: Apple Watch

- 3 Tips for Maintaining Your Cell Phone Battery (part 1)

- 3 Tips for Maintaining Your Cell Phone Battery (part 2)
Popular Tags
Video Tutorail Microsoft Access Microsoft Excel Microsoft OneNote Microsoft PowerPoint Microsoft Project Microsoft Visio Microsoft Word Active Directory Exchange Server Sharepoint Sql Server Windows Server 2008 Windows Server 2012 Windows 7 Windows 8 Adobe Flash Professional Dreamweaver Adobe Illustrator Adobe Photoshop CorelDRAW X5 CorelDraw 10 windows Phone 7 windows Phone 8 Iphone