Latest News
|NewsletterBenchmark data and implementation conditions matter when choosing a high-performance microprocessor core but true comparisons are often difficult to achieve.
Consumers want higher quality music, video and gaming in multimedia products. And businesses want converged products that deliver multiple office applications. To make all of this possible, embedded designs need high-performance microprocessor cores.
The trouble is that when specifying and choosing them, there are more design and project constraints to consider than might first be expected. As well as performance, cost, power and time to market are also important, especially in consumer markets where being first to market with a competitively priced product is critical to success.
Is benchmark data all it seems?
Designers typically look at published benchmark figures when choosing cores. Ideally, the data should provide firm technical information on which to base a rational selection and which informs their evaluation of system performance.
Two of the most established processor benchmarks are Dhrystone and EEMBC. Dhrystone, which has been in use for over 20 years, consists of a small amount of code and operates on a small data set.
Despite just about every vendor quoting performance in Dhrystone MIPS or ‘DMIPS’, the benchmark does not actually reflect the performance of a processor dealing with real-world operations: it only measures processing within the core, rather than the system performance, and places a disproportionate emphasis on string operations.
EEMBC, and other benchmarks such as BDTi, Mediabench and Spec, more closely reflect real-world workloads and can give a more accurate indication of how a processor will actually perform.
The exact benchmark conditions play a major part in determining the quoted figures. Getting an ‘apples for apples’ comparison is critical so designers should ask the following questions:
● Was the benchmark measured on the first loop, or after running through several times? After the first loop, caches will have been loaded and branch prediction set up, which will let the processor take advantage of these features.
● Has the compiler been optimised for the processor? Compilers that don’t take advantage of specific micro-architectural features in the processor can lower performance by 10 to 20%. A compiler that fails to take advantage of the processor will give an unfairly poor result for the processor.
● Has the compiler been specifically tuned for a benchmark? Compilers that have been specially optimised for particular benchmarks can easily double the performance of the processor for that benchmark.
While this makes the processor look good, it is misleading for designers who are trying to interpret the data either to choose between cores or understand whether the processor will meet their application’s needs.
● Which flags have been set for compilation? Has the compiler been set to optimise for code density or performance? Instructing the compiler to produce dense code can mean a smaller code footprint, but will decrease performance.
Also, using a specific, unusual set of flags to get a good result for a benchmark means that the result will not be relevant to the real workload the designer has in mind for
the product. If the benchmark conditions are unpublished, not transparent or impossible to reproduce, it is very difficult to make like-for-like comparisons.
Maximising performance
Improving general purpose computation in the core is one way to enhance performance, but there are other things designers can do within the system to speed it up.
Selecting the right memory architecture is key to maximising system performance and optimising cost. Deciding how to use SRAM and ROM, the size and partitioning of the cache or tightly coupled memories (TCM), is fundamental to the system design.
These choices depend upon the real-time constraints of the application and the flexibility of the processor.
If level-2 cache memory is used, it is possible to bring more data closer to the CPU, which helps resolve the performance-limiting bandwidth and latency constraints associated with accessing off-chip memory.
Power savings
Using level-2 cache can dramatically improve the performance of systems with large code or data structures, long memory latencies, or busy system buses. At the same time, considerable power savings are possible since many off-chip transactions are eliminated.
The system’s bus architecture also influences system performance, and determines how backplane and peripheral components will perform with the core.
For example, the AMBA 3 AXI bus system is appropriate for high-frequency and low-latency designs that maximise the use of interconnect resources, which enables very high data-throughput.
Technologies and features that accelerate specific tasks such as executing Java or media algorithms further enhance processor performance. These hardware ‘extensions’ can perform specific types of tasks more efficiently than the core processor itself.
The ability to combine more technology into the processor, with a single coherent toolchain and the ability to leverage a wide software base, can be a powerful motivation for designers focused on minimising cost and risk.
Because execution environments are tightly coupled to the processor, the ability to accelerate languages such as Java with a reasonable code size is an important consideration.
High-speed implementation
When it comes to implementation, it’s important to understand how quoted performance data relates to the physical chip implementation, and what designers can do to achieve the optimal performance from the core.
Here are some of the questions designers should ask:
● Validity – Are the power, performance and area figures validated post-layout and after floor planning? Do the figures take into account power rails, signal integrity, IR-drop and scan?
● Frequency – What are the process, technology, library and PVT conditions? Is the RAM compiled or custom designed, and has its speed been taken into account?
● Area – Has the design been optimised for speed or area? Has the area overhead for MMU, RAM, power rails and signal integrity been taken into account? What kind of gate utilisation can designers achieve ‘out of the box’?
● Power – Does the measurement use any low-power libraries or additional low-power techniques? What are the measurement conditions?
● Implementation – Was the design implemented through a standard ‘out-of-the-box’ Asic methodology, or was custom optimisation work performed?
Custom implementation is manually intensive and can impact time to market, but in general it’s possible to achieve higher performance than standard Asic implementation.
Improving performance
When quoting performance data, ARM uses realistic de-rating margins wherever possible. For example, using performance data without allowing for OCV (on-chip variation) can give an overly optimistic view of a core’s performance.
Judicious use of some low threshold voltage (LVt) transistors allows for a significant increase in frequency. Traditionally, ARM publishes benchmarks for slow-slow (SS) silicon speed, but designing for typical silicon (TT) can raise performance using low-power 65nm and 45nm technologies.
Overdriving the voltage also improves clock frequency, but like many performance tradeoffs, at the expense of higher power consumption.
Choosing a high-performance processor requires a careful approach to interpreting benchmark data. Developers need to consider whether the benchmark conditions enable a fair comparison between different cores.
When it comes to predicting performance accurately, the use of realistic de-rating information is essential. Combining performance-focused implementation enhancements can give a significant boost to clock frequencies.
Karl Whealton is a product manager at ARM