
ARM has surprised us. It is telling engineers brought up to maximise hardware resource use that it can be cost-effective to add a co-processor to a system chip then leave it, or the original processor, idle 100% of the time.
How? It was this month's launch of the Cambridge-based processor firm's Cortex-A7 processor, and the reasoning has everything to do with power consumption in phones.
“The gap between high-end apps and the low end is getting bigger,” Peter Greenhalgh, architect of the A7, told Electronics Weekly.
He cites ever-more complex games and increasingly content-rich web sites at the top end, while email continues to need almost no processing power at all.
One way do deal with this is to have a large powerful processor which can be throttled-back for lighter tasks.
But the processing differential is now so great that the huge deep-pipeline multi-issue processor needed to run a game is still eating considerable power even at low clock speeds.

Another way to do it would be to run a couple of identical processors, say Cortex-A9s - together for games and with one shut down for email.
Another valid way of doing things, but only if the customer can find a way, and can be bothered, is to write their heavy apps with partitioning for multi-processors.
So, thought ARM, we have a huge powerful processor for smart phones - the 3.5DMIPS Cortex-A15 - why don’t we make an instruction-identical smaller cousin that can do email while the A15 sleeps, and add a code-transparent way to switch between the two.
And that is where Greenhalgh and his A7 come in.
“The A7 and A15 are 100% instruction set architecture identical,” he said. “Both processors implement the new virtualisation extensions, and both implement the larger physical address extensions to 40bit [1Tbyte].”
In between them is hardware that can read the complete state of the active core (A7 or A15) - called the out-bound core, and deposit the state into the in-bound core (A15 or A7 respectively).
“To switch, we pull all of the architectural state - all the registers, anything to do with the operating system and anything that records the state of the processor - and we restore it to other processor,” said Greenhalgh. “We can switch in 20µs.”
The switch hardware is bound up in ARM’s AMBA connection bus and a new interconnect called CCI-400.
Cortex-A15 was the first ARM core to have the 4 ACE version of AMBA - which supports cache coherency across multiple processors, and A7 is the second.
CCI-400 is a bus-level high-speed cross-bar interconnect using 128bit wide busses.
“One of other cool things about CCI-400 is that it has other paths, so you can attach a graphics processor [GPU] and do things like GPU off-load more efficiently,” said Greenhalgh.

Cache coherency is important because both the A7 and A15 have their own private L1 and L2 caches, and the idle core and its caches can be powered down completely once processing is switched away.
“We have full coherency via CCI-400,” said Greenhalgh. “Cache warming time is reduced because you can snoop content of other cache. When A7 does a data request, it can snoop the caches of A15, when A15 does a data request, it can snoop the caches of A7.”
Interrupt control is also common across the processors.
At the micro-architecture level, the processors differ significantly in pipeline length and the amount of instructions that can be issue at the same time.
“With the A15, the aim is high-performance for the best energy. With A7 it is always best possible energy,” said Greenhalgh. “A7 has an eight stage pipeline, limited dual instruction issue, its not symmetric and it has in-order execution. A15 has a 15 stage out-of-order, multi-issue pipeline, and can sustain over three instructions per clock cycle.”
If the aim was low power consumption, why has the A7 complex features like a dual-issue pipeline?
Complexity to save power “is an interesting thing. We put in huge amount of effort into branch prediction because every time you flush the pipeline you waste power”, said Greenhalgh. “With better branch prediction you flush less, so you get better performance and better energy. You can build a monstrous amount of branch-prediction hardware for the same energy as a pipeline flush.”
Another example is dual-issue.
“We looked through a lot of code for instructions we could dual issue, we looked out for instructions that would give you an overall energy decrease,” said Greenhalgh. “You decrease execution time when you dual issue so you can improve energy efficiency if amount of work you have to do to dual issue is less”
The result of all this micro architectural engineering is a 1.9DMIPS A7 and a 3.5DMIPS A15. These compare with a score of 2DMIPS for the existing Cortex-A8.
ARM does not favour DMIPS to compare phone processors. “Dhrystone does measure throughput, but everything is in L1 caches so it doesn’t really exercise memory,” said Greenhalgh. “A7 is 20% above A8 on our own web-browsing benchmark.”
And how does A7 compare with A15 for browsing?
Greenhalgh is not saying “for marketing reasons”, nor is ARM revealing any power metrics like mW/MHz for either A7 or A15, yet.
Part of the reason that a chip maker can have a whole processor idle on a chip is the large number of gates that are available at 28nm, which is the process that the A7 is aimed at, followed by 20nm.
“We are heading for situation where there are more transistors than we know what to do with,” said Greenhalgh. “We are not there yet, and if we can make the die smaller, we still would.”
The argument is that gates are increasingly cheap and there are so many other blocks on a system-on-chip that the 0.45mm[super2] needed for an A7 at 28nm is insignificant amongst the graphics processors, video coders, and memory.
“It is not expensive in silicon, yield or test. It is a fairly easy decision to make,” said Greenhalgh.
There are two other use models for the A7 aside from swapping applications with an A15 - which ARM has branded big.LITTLE operation. One is to increase processing by using the A7 at the same time as the A15 - Dubbed big.LITTLE MP. The other is to use the A7 alone.
“For phones on the shelf in 2013, the A7 will still be a capable smartphone processor. It has more performance than the Cortex-A8 and is a lot smaller.” said Greenhalgh.
ARM is predicting sub-$100 entry-level smartphones in 2013 using the A7 for performance equivalent to a 2011 $500 phone.
Broadcom, HiSilicon, Freescale, Fujitsu, LG, Samsung, ST-Ericsson, and Texas Instruments are amongst the chip firms that have signed up for the A7. Most of them are intending to use it with an A15, although some will use it alone, or use it both ways, said Greenhalgh.
First products are expected in 2012.