New MCUs not always best for embedded graphics

Refresh your embedded design with a new processor and get better performance – right?

Wrong – or it could be if your application has a screen.

The issue is 2D graphics acceleration, highlighted by benchmarking at Birmingham consultancy ByteSnap Design.

“These days you expect the CPU vendors to put in a GPU and have all the graphics processing on board, you would also expect that any new CPU would blow away the 2D performance from six years ago. Well that’s what I thought, ” said ByteSnap co-founder Graeme Wintle.

Wintle tested a number of ARM processors, running 2D graphics tasks under Windows CE6 and CE7 using the Windows CE Test Kit (CETK) and the Graphics Device Interface Performance (GDIP) test.

“The bottom line is, a chip might look great on paper, but you have to see what the software supports. Sometimes, you only get an improvement if you re-write your application in OpenGL, and the software support is there to leverage the hardware assistance,” he told Electronics Weekly.

OpenGL
OpenGL is a graphics application programming interface (API) – a standard way for software to request actions on a screen.

Standard operating system graphic operations use a set of functions from the GDI (Graphics Device Interface) API set, these may be implemented:
In software using a series of CPU instructions provided by the OS (slow)
In hardware using a hardware accelerator internal to the CPU (fast)
Somewhere in between – using software sometimes provided by the chip manufacturer in special instructions that interact with general purpose SIMD (single instruction, multiple data) hardware – Intel‘s ‘MMX’ and ARM’s ‘Neon’ are examples.

A modern fast application processor may well have a powerful built-in 2D/3D graphics accelerator, but that accelerator may only be exposed through an OpenGL API, and this is not necessarily used by your application if it is written to use the GDI interface.

“OpenGL acceleration is not used by the Windows CE graphical shell or basic windows application out of the box” said Wintle, pointing out that to use this users have to write their application to directly interface to OpenGL or use an application framework such as Microsoft Silverlight to unlock the OpenGL.

Screen objects, from clickable virtual ‘stop’ and ‘start’ buttons, to windows, message boxes and text scrolling, are all created through 2D operations such as line draw, block copy, pattern fill, typically implemented by GDI API calls on Windows systems. On Linux systems there are many frameworks such as SDL or QT but the integration with the hardware accelerator “needs to be implemented or a software library could be being used by-passing the potential power of your processor”, said Wintle.

Oldies
Less modern processors had no on-chip graphics capability, and had to be used with a third-party graphics chip that was likely to include hardware acceleration for GDI API calls. 

“A number of years ago processors based on SH3, SH4, MIPS, ARM and StrongARM would not drive the screen directly or would do so in a very basic way,” said Wintle. “People would typically use a separate graphics chip in the same way as in a desktop machine. Silicon Image, MediaQ and others made graphics processors targeted at the embedded space. Even NVidia made a low power embedded GPU.”

Occasionally, GDI hardware even got built onto the processor: the Marvell PXA320 for example (see below).

With OpenGL embedded into the application processor, said Wintle, the vendor is unlikely to also include GDI hardware assisted functions. Instead, the vendor will use a software implementation that runs on the main CPU or, with slightly faster results, through SIMD instructions to hardware like ARM’s Neon instructions, expecting the applications to instead be written in an OpenGL aware way, this is not always the case when porting older applications.

And, he added, today’s processor vendors may not be able to allocate as much effort to hand-crafting GDI API software as the third party graphics chip companies used to.

“TI for instance has board support packages for Windows CE which have assembler routines for basic GDI API calls, but these are software accelerated through Neon,” said Wintle. “They have hardware OpenGL 2D/3D acceleration, but an operating system like Windows CE isn’t using that, it is not aware of OpenGL for basic tasks, so it uses Neon.”
 
“In the MediaQ days,” he continued, “scrolling was one function that hardware assistance would just blow away software implementations, screens would scroll smoothly as if you were using a desktop machine. If you do it on a chip without 2D acceleration at the operating system level, then all operations go through Neon – or worse still through standard ARM operations bottlenecking the processor. If there are 800×600 pixels, moving the screen up one pixel is a lot of instructions to perform.”

ByteSnap’s test results: 

BitBlt PatBlt DrawText PolyLine
iMX25 400MHz 3563 1141 479 93
iMX28 450MHz 3079 984 429 93
PXA320 624MHz 1751 179 142 69
AM3517 600MHz 3196 1182 240 122
M3730 1000MHz 1782 618 129 60
iMX53 800MHz 2223 683 80 60

Numbers are in milliseconds, so higher numbers mean a slower response. All the mean values from the GDIP test on a particular platform have been averaged into the four categories shown

The older PXA320 ARM-based processor is the quickest in most tests, even though it is not the fastest processor.

“The main reason is that the PXA320 has o ptimised assembly routines for line operations, and calls its internal 2D engine to accelerate line and BitBlt [fill and copy] operations,” said Wintle. “The others either don’t have a GPU [iMX25/28] or they only use it for OpenGL operations, which we aren’t testing here, so it comes down to a clock-for-clock comparison.”

Freescale‘s  iMX25 just has LCD driver hardware, no graphics acceleration at all, and just an ARM core with no SIMD (Neon) extensions. iMX28 is the same as the iMX25, except for a slightly faster clock and a wider data bus for faster RAM access.

Marvell’s PXA320 has basic 2D line acceleration in its ARM core, and the silicon vendor supplies Windows CE drivers to that use this hardware.

Texas Instrument’s DM3730 and AM3717 “have loads of graphics acceleration on-board, but it cannot be used for this GDI test. PatBit [patterned block operation] is slower even though it includes optimise Neon instructions. We did not even use Neon in the PXA320 runs”, said Wintle.

The Freescale iMX53 graphics drivers can be compiled with GDI acceleration, which helps on operations such as text and line drawing, “but looks to lack the full implementation on the block operations to match the older PXA320″, said Wintle.

So what is his advice?
“Be careful what your application uses. A CPU with a GPU is something that you think you need, but really aren’t using to its full potential unless you are running it in the correct way. An older 2D accelerated chipset, maybe the one you are already using, could be quicker,” said Wintle.

Tags: 2d performance, ARM, benchmarking, ByteSnap, chip manufacturer, embedded, GDI, graphics, graphics device, OpenGL, opengl api, Windows CE

Related posts