ARM is entering the digital signal controller market with the Cortex-M4, a 32-bit core with built-in integer DSP, and an optional floating point unit. The processor is aimed at applications in audio, motor control, industrial automation and automotive.
Its instruction set is a superset of the Cortex-M3's, which is itself a superset of the M0's.
"There is straight-through binary compatibility from the M0 to the M4," ARM product manager Shyam Sadasivan told EW.
When not executing DSP or floating point instructions, the M4 has a similar performance to the M3. Differences start to show once the DSP instructions are invoked.
"If you have a 120MHz flash device, a Cortex-M3 will do MP3 decode for 20-25MHz. An M4 can do it in less than half the MHz," said Sadasivan. "An M4 could do 5.1 Dolby digital AC3 headset decode at 50MHz."
Power consumption is predicted to be less than 40µW/MHz, with MP3 decode consuming 0.5mW.
Adding DSP instructions into the architecture has pushed gate count up to 65,000, compared with 45,000 to 50,000 for the M3, said Sadasivan. The optional floating point unit adds a further 25,000.
"That puts an M4 with floating point well below 100,000 gates and we expect to have typically 150MHz operation in silicon by the end of the year," he said. "Potentially, we could see even faster; it depends on the flash from the partner."
As with the Cortex-M3 and M0, NXP is a lead licensee.
"We are currently working hard on first silicon implementation, and hope to show functional silicon to lead customers very soon. We're planning for high-volume ramp-up very early next year [2011]," NXP general manager of MCUs, Geoff Lees, told EW.
With its floating point unit, isn't this starting to look like the Cortex-R4?
"The difference between the R4 and the M4 is quite big in terms of features, performance and size," said Sadasivan. "The R4 is a much larger processor with an eight to 10-stage pipeline, tightly coupled memory, and caches. The M4 is for flash, with a three-stage pipeline."
Sadasivan does not want to compare the M4 to pure DSP architectures; he points out that it is still fundamentally a load-store core with DSP instructions. However, DSP instructions are single cycle, including 32x32+64>64 in the MAC.
"There is a single cycle MAC, a bunch of saturating maths, and SIMD instructions for 4x8bit and 2x16-bit arithmetic," he said.
The floating point unit is not all single cycle. Instructions take one, three, or 14 cycles, the latter for square root and divide.
"It is IEEE754 single precision only, with double precision through software. The R4 has double precision in hardware," said Sadasivan.
Because the floating point unit is decoupled from the pipeline, it can be left to its own devices while the core continues to execute.
Sadasivan sees the floating point unit coming into its own for industrial automation.
"Interest is coming from those using meta languages like The MathWorks' Matlab and NI's Labview for highly abstracted design," he said. "Floating point is very attractive to them as these tools are intended for floating point."
In cars, ARM is touting the M4 as a cost-reducing alternative to the Cortex-R4 mentioned before.