Tip 5: Use the Three-Step Port, Test, Optimise Paradigm
Among the most notable changes in the development of intelligent connected devices over the past few years has been the emergence of what has become known as multicore devices incorporating multiple processing cores in more power-efficient packages.
Unfortunately for software developers, these devices increase design complexity by moving the programming paradigm away from a primarily sequential model toward one that incorporates true parallel, or more properly, concurrent program execution. This offers the potential to produce higher performance and smaller and more power-efficient designs but requires designers and developers to "think differently" if the full benefit of multicore performance is to be realised. I'll outline five tips for anyone looking at multicore silicon for their next project. It is primarily aimed at those doing the following:
• Working in the embedded or device software domain
• Initiating their first multicore-based development or undertaking their first migration of an existing project from unicore to multicore silicon
• Moving from a traditional unicore real-time operating system to a symmetric multiprocessing real-time operating system (such as Wind River's VxWorks 6.6 with SMP) where a single instance of an operating system manages multiple, identical processing cores and application tasks are able to run on any available core
Some of the advice is also relevant for developers contemplating using an asymmetric multiprocessing architecture (i.e., separate operating systems running in each core), although data synchronisation in an AMP environment does hold its own particular challenges that are not covered in this article.
Tip 1: Be Aware of the Hardware Beneath You
It is well-understood that device software developers, especially those working directly at the operating system API level, need to be acutely aware of underlying hardware resources such as registers, caches and memory pages. However, there is another layer beneath the software interface that remains largely hidden from software developers. It is sometimes referred to as hardware optimisation and uses advanced silicon optimisation techniques (e.g., pipelining and branch prediction) to optimise the way that code is executed within limited hardware resources.
When operating in a true parallel execution environment such as a multicore processor it is necessary for device software developers to be aware of this hardware optimisation layer and to understand how it may affect the way their code is running.
For example, let us consider the effect of out-of-order execution (or OoOE), a hardware optimisation technique that is used to optimise cycle implementation in the processor. It enables the processing core to continue to run at times when it would normally be stalled waiting for unavailable data by completing execution of alternate (out-of-order) instructions and placing resultant write operations in a (possibly out-of-order) "committal" queue. Within the context of a single core, OoOE is unseen at the software level because the core includes logic that ensures data coherency. However, the asynchronous nature of multiple cores means that there is always a potential for inconsistent data.
To guard against potential problems caused by OoOE, programmers should use specific instructions included in SMP operating systems. These are normally referred to as "memory barrier" functions and are compiled down to low-level machine instructions (in the case of the Freescale 8572, the assembly language instruction is called mbar) that enforce the correct ordering of memory operations. This ensures that the developer can correctly control arbitration between multiple cores accessing global variables.
Tip 2: Avoid Assumptions Based on Single Core Hardware
A preemptive priority-based scheduler (of the type normally used in VxWorks) running on a traditional single core device will schedule the highest priority ready-to-run task and allow it execution time. This simple fact has been a constant since the development of the first RTOS implementations and has spawned a series of assumptions about how to develop code in such environments as the following:
• While running in the context of a relatively high priority task it is safe to spawn other lower priority tasks and know that they will not begin running until the current high priority task completes or pends.
• While running in the context of a relatively high priority task, performing a semGive may cause preemption by higher priority tasks that are pending on that semaphore. However, lower priority tasks pended on the same semaphore will continue in the pended state.
• Whilst in interrupt state, all tasks (of any priority) will cease to run.
Assumptions of this kind lead to programming methods referred to as implicit synchronisation, meaning that a synchronisation event is implied in the code but is not explicitly defined. Such techniques work in a unicore environment because there is only one execution resource running only the single highest priority task or interrupt at any one time (i.e., there is only one thread of execution).
However, an SMP operating system (again, such as VxWorks 6.6 with SMP) implements a modified scheduling algorithm in which the highest N priority tasks are given processor time (where N is equivalent to the number of cores available). In this environment, priority-based implicit synchronisation techniques cease to be valid because, depending on the state of the system at any point in time, multiple tasks or interrupts (of any priority) could run concurrently.
In essence, priority cannot be relied upon as a method through which to lock access to any specific resource in the system (whether it is CPU time, memory or peripherals).
Tip 3: Parallelise to Optimise
A fundamental step in the development of any truly optimised SMP-based system is the need for parallelisation of the application. This requirement is described by a model called Amdahl's law, which, through some complex mathematical equations, carries a relatively simple message: Algorithms that implement an increased proportion of parallelisation will run faster than algorithms that are largely serial in nature.
The development of a parallelised application requires the developer to think in terms of parallel threads of execution rather than single tasks or processes. Threads, in this context, are most easily thought of as duplicated tasks that run simultaneously on multiple cores and share the application load.
It is important to note, however, that threads are not encapsulated within their own address space but exist within the same address space as their duplicate brethren and therefore, whilst having access to their own local variables and stack, may require access to the same global memory or physical resources. So in the same way that task duplication provides the route to optimised multicore applications, it is effective synchronisation that lies at the heart of task duplication.
Indeed the importance of a good understanding of effective synchronisation techniques should not be understated because they lie at the heart of successful parallelisation but are also a rich source of bugs on multicore systems.
Most SMP operating systems include a number of synchronisation functions such as spinlocks, CPU-specific mutual exclusion, memory barriers and atomic operators. These should be thoroughly researched to decide the best approach for synchronisation in any specific situation.
It is important for the developer to be aware of the number of cores available to him and to balance his application with those resources. For example, spawning three parallel, identical threads on a two-core device provides no more performance than spawning two threads in the same situation (and depending on the application may actually reduce performance due to the cycles required to spawn the additional thread). Most SMP operating systems include a function that allows the developer to find how many cores are available; functions of this type should be used to create parallelised code that is balanced with the hardware resources available.
Tip 4: Be Prepared for More Complex Debug Problems
Firstly, the good news: Software developers have many of the same development and debug tools available whether they are working with unicore or multicore devices.
The bad news is that the debug challenges they face can be significantly more complex in the multicore world. This complexity arises from the following:
• The inherently asynchronous nature of the system
• The sharing of resources
• The fact that the majority of existing code and programmer skills are focused on unicore operation
• The inherently sequential nature of current programming techniques and languages
These factors lead to debug issues that can be among the most time-consuming and perplexing that a developer is likely to meet including timing problems (e.g., race conditions between cores) and inconsistent data (maybe caused by incorrect locking of shared memory). These complex debug issues call for a new type of debug tool focusing on the control and visualisation of the individual resources in the system.
Multicore-capable hardware debug tools are, like those for unicore devices, based on JTAG or BDM technology and focus on the control of individual hardware blocks at a low level. They are useful in the early hardware debug stage because they allow low-level control of devices using a hardware interface through which they can quite literally freeze a processor core or other silicon device. However, multicore devices add an additional level of complexity because freezing a single processor core on a multicore device is no longer enough. Therefore the best multicore capable hardware tools now have the ability to freeze multiple CPUs whilst avoiding delay problems referred to as "skid," which can cause data overrun or underrun conditions that obscure the initial problem and make debugging more difficult.
During the code debug phase, developers have many of the same tools available to them in the multicore environment as they do in the unicore environment. However, the additional complexity of multicore debug makes tools to enable visualisation of the running system particularly important. In the case of VxWorks 6.6 with SMP, Wind River System Viewer allows visualisation of the threads running on individual cores and allows the developer to visualise potential issues arising from synchronisation problems.
Tip 5: Use the Three-Step Port, Test, Optimise Paradigm
Possibly the most important step in migrating from a unicore to a multicore/SMP environment comes before coding even begins through a thorough understanding of the task at hand and careful planning of the migration effort. Once migration work begins, the following three-step approach will provide a framework for the developer to move code that was originally written for a unicore device into a multicore environment:
- Port your code to the latest version of whichever OS you are using. This preparatory step is essential to ensure that outdated or inappropriate API calls will not cause problems at a later time. You should also, where possible, begin using API calls that are "SMP friendly" (such as replacing semaphores with spinlocks or intLock with intCpuLock). Ensure that your code continues to run correctly in a unicore environment by testing it either on a unicore device or on the targeted multicore device with only one core enabled.
- Test your newly ported code on a true multicore system. This step is likely to expose unicore assumptions in your code and you may find that you spend a good deal of time debugging code that worked in a unicore environment but is simply unsuitable for use in a multicore system. If possible use advanced software debug tools aimed at multicore (such as System Viewer for VxWorks 6.6 with SMP) to visualise and identify run-time errors.
- Parallelise your application to fully realise the benefits of a multicore environment. Think in terms of duplicated threads of execution and consider modifying the algorithms in use at the core of the application. Use advanced debug tools to analyse where further optimisation may be possible (e.g., to find where resources are being underused, possibly leading to idle core time).
New programming practices
Multicore silicon, being new and unfamiliar to most designers and developers, has often been seen as a complex and high-risk choice of technology. However, it is now becoming increasingly commonplace, and device software developers are now more likely than ever to find themselves working in a multicore environment.
This transition will force developers to learn new programming practices if a smooth development or migration effort is to be assured. Gaining a full understanding of the true nature of multicore computing, avoiding assumptions learned from unicore programming and following a well planned migration strategy will ensure that developers transition smoothly into this new and exciting technology.
Paul Tingey is a system architect with Wind River, supporting alliance companies across a number of vertical markets.
References
Raymond, Eric Steven, "Threads - Threat or Menace?", The Art of Unix Programming, http://www.faqs.org/docs/artu/ch07s03.html#id2923889.