
Recent years have seen a rapid increase in the use of
multicore devices in the PC market. This trend is now being
replicated in the device software environment with dual-core and
many-core devices being released by a wide range of silicon
vendors.
While these multicore devices clear the way for increased
performance, this can only be achieved if software developers adapt
the way they write code so that it fully uses the newly realised
hardware resources they have available. Here is an outline of the
multicore programming "problem" and list five top tips for software
developers to prepare themselves for the new "parallel landscape"
in front of them.
I'll outline five tips for anyone looking at multicore silicon
for their next project. It is primarily aimed at those doing the
following:
• Working in the embedded or device software domain
• Initiating their first multicore-based development or undertaking
their first migration of an existing project from unicore to
multicore silicon
• Moving from a traditional unicore real-time operating system to a
symmetric multiprocessing real-time operating system (such as Wind
River's VxWorks 6.6
with SMP) where a single instance of an operating system manages
multiple, identical processing cores and application tasks are able
to run on any available core
Some of the advice is also relevant for developers contemplating
using an asymmetric multiprocessing architecture (i.e., separate
operating systems running in each core), although data
synchronisation in an AMP environment does hold its own particular
challenges that are not covered in this article.
Tip 1: Be Aware of the
Hardware Beneath You
It is well-understood that device software developers,
especially those working directly at the operating system API
level, need to be acutely aware of underlying hardware resources
such as registers, caches and memory pages. However, there is
another layer beneath the software interface that remains largely
hidden from software developers. It is sometimes referred to as
hardware optimisation and uses advanced silicon optimisation
techniques (e.g., pipelining and branch prediction) to optimise the
way that code is executed within limited hardware resources.
When operating in a true parallel execution environment such as
a multicore processor it is necessary for device software
developers to be aware of this hardware optimisation layer and to
understand how it may affect the way their code is running.
For example, let us consider the effect of out-of-order
execution (or OoOE), a hardware optimisation technique that is used
to optimise cycle implementation in the processor. It enables the
processing core to continue to run at times when it would normally
be stalled waiting for unavailable data by completing execution of
alternate (out-of-order) instructions and placing resultant write
operations in a (possibly out-of-order) "committal" queue. Within
the context of a single core, OoOE is unseen at the software level
because the core includes logic that ensures data coherency.
However, the asynchronous nature of multiple cores means that there
is always a potential for inconsistent data.
To guard against potential problems caused by OoOE, programmers
should use specific instructions included in SMP operating systems.
These are normally referred to as "memory barrier" functions and
are compiled down to low-level machine instructions (in the case of
the Freescale 8572, the assembly language
instruction is called mbar) that enforce the correct ordering of
memory operations. This ensures that the developer can correctly
control arbitration between multiple cores accessing global
variables.
Tip 2: Avoid Assumptions
Based on Single Core Hardware
A preemptive priority-based scheduler (of the type normally used
in VxWorks) running on a traditional single core device will
schedule the highest priority ready-to-run task and allow it
execution time. This simple fact has been a constant since the
development of the first RTOS implementations and has spawned a
series of assumptions about how to develop code in such
environments as the following:
• While running in the context of a relatively high priority task
it is safe to spawn other lower priority tasks and know that they
will not begin running until the current high priority task
completes or pends.
• While running in the context of a relatively high priority task,
performing a semGive may cause preemption by higher priority tasks
that are pending on that semaphore. However, lower priority tasks
pended on the same semaphore will continue in the pended
state.
• Whilst in interrupt state, all tasks (of any priority) will cease
to run.
Assumptions of this kind lead to programming methods referred to as
implicit synchronisation, meaning that a synchronisation event is
implied in the code but is not explicitly defined. Such techniques
work in a unicore environment because there is only one execution
resource running only the single highest priority task or interrupt
at any one time (i.e., there is only one thread of execution).
However, an SMP operating system (again, such as VxWorks 6.6
with SMP) implements a modified scheduling algorithm in which the
highest N priority tasks are given processor time (where
N is equivalent to the number of cores available). In this
environment, priority-based implicit synchronisation techniques
cease to be valid because, depending on the state of the system at
any point in time, multiple tasks or interrupts (of any priority)
could run concurrently.
In essence, priority cannot be relied upon as a method through
which to lock access to any specific resource in the system
(whether it is CPU time, memory or peripherals).
Tip 3: Parallelise to
Optimise
A fundamental step in the development of any truly optimised
SMP-based system is the need for parallelisation of the
application. This requirement is described by a model called
Amdahl's law, which, through some complex mathematical equations,
carries a relatively simple message: Algorithms that implement an
increased proportion of parallelisation will run faster than
algorithms that are largely serial in nature.
The development of a parallelised application requires the
developer to think in terms of parallel threads of execution rather
than single tasks or processes. Threads, in this context, are most
easily thought of as duplicated tasks that run simultaneously on
multiple cores and share the application load.
It is important to note, however, that threads are not
encapsulated within their own address space but exist within the
same address space as their duplicate brethren and therefore,
whilst having access to their own local variables and stack, may
require access to the same global memory or physical resources. So
in the same way that task duplication provides the route to
optimised multicore applications, it is effective synchronisation
that lies at the heart of task duplication.
Indeed the importance of a good understanding of effective
synchronisation techniques should not be understated because they
lie at the heart of successful parallelisation but are also a rich
source of bugs on multicore systems.
Most SMP operating systems include a number of synchronisation
functions such as spinlocks, CPU-specific mutual exclusion, memory
barriers and atomic operators. These should be thoroughly
researched to decide the best approach for synchronisation in any
specific situation.
It is important for the developer to be aware of the number of
cores available to him and to balance his application with those
resources. For example, spawning three parallel, identical threads
on a two-core device provides no more performance than spawning two
threads in the same situation (and depending on the application may
actually reduce performance due to the cycles required to spawn the
additional thread). Most SMP operating systems include a function
that allows the developer to find how many cores are available;
functions of this type should be used to create parallelised code
that is balanced with the hardware resources available.
Tip 4: Be Prepared for More
Complex Debug Problems
Firstly, the good news: Software developers have many of the
same development and debug tools available whether they are working
with unicore or multicore devices.
The bad news is that the debug challenges they face can be
significantly more complex in the multicore world. This complexity
arises from the following:
• The inherently asynchronous nature of the system
• The sharing of resources
• The fact that the majority of existing code and programmer skills
are focused on unicore operation
• The inherently sequential nature of current programming
techniques and languages
These factors lead to debug issues that can be among the most
time-consuming and perplexing that a developer is likely to meet
including timing problems (e.g., race conditions between cores) and
inconsistent data (maybe caused by incorrect locking of shared
memory). These complex debug issues call for a new type of debug
tool focusing on the control and visualisation of the individual
resources in the system.
Multicore-capable hardware debug tools are, like those for
unicore devices, based on
JTAG or
BDM technology and focus on the
control of individual hardware blocks at a low level. They are
useful in the early hardware debug stage because they allow
low-level control of devices using a hardware interface through
which they can quite literally freeze a processor core or other
silicon device. However, multicore devices add an additional level
of complexity because freezing a single processor core on a
multicore device is no longer enough. Therefore the best multicore
capable hardware tools now have the ability to freeze multiple CPUs
whilst avoiding delay problems referred to as "skid," which can
cause data overrun or underrun conditions that obscure the initial
problem and make debugging more difficult.
During the code debug phase, developers have many of the same
tools available to them in the multicore environment as they do in
the unicore environment. However, the additional complexity of
multicore debug makes tools to enable visualisation of the running
system particularly important. In the case of VxWorks 6.6 with SMP,
Wind River System Viewer allows visualisation of the threads
running on individual cores and allows the developer to visualise
potential issues arising from synchronisation problems.
Tip 5: Use the Three-Step
Port, Test, Optimise Paradigm
Possibly the most important step in migrating from a unicore to
a multicore/SMP environment comes before coding even begins through
a thorough understanding of the task at hand and careful planning
of the migration effort. Once migration work begins, the following
three-step approach will provide a framework for the developer to
move code that was originally written for a unicore device into a
multicore environment:
- Port your code to the latest version of whichever OS you are
using. This preparatory step is essential to ensure that outdated
or inappropriate API calls will not cause problems at a later time.
You should also, where possible, begin using API calls that are
"SMP friendly" (such as replacing semaphores with spinlocks or
intLock with intCpuLock). Ensure that your code
continues to run correctly in a unicore environment by testing it
either on a unicore device or on the targeted multicore device with
only one core enabled.
- Test your newly ported code on a true multicore system. This
step is likely to expose unicore assumptions in your code and you
may find that you spend a good deal of time debugging code that
worked in a unicore environment but is simply unsuitable for use in
a multicore system. If possible use advanced software debug tools
aimed at multicore (such as System Viewer for VxWorks 6.6 with SMP)
to visualise and identify run-time errors.
- Parallelise your application to fully realise the benefits of a
multicore environment. Think in terms of duplicated threads of
execution and consider modifying the algorithms in use at the core
of the application. Use advanced debug tools to analyse where
further optimisation may be possible (e.g., to find where resources
are being underused, possibly leading to idle core time).
New programming practices
Multicore silicon, being new and unfamiliar to most designers
and developers, has often been seen as a complex and high-risk
choice of technology. However, it is now becoming increasingly
commonplace, and device software developers are now more likely
than ever to find themselves working in a multicore
environment.
This transition will force developers to learn new programming
practices if a smooth development or migration effort is to be
assured. Gaining a full understanding of the true nature of
multicore computing, avoiding assumptions learned from unicore
programming and following a well planned migration strategy will
ensure that developers transition smoothly into this new and
exciting technology.
Paul Tingey is a system architect with Wind River,
supporting alliance companies across a number of vertical
markets.
References
Raymond, Eric Steven, "Threads - Threat or Menace?",
The Art of Unix Programming,
http://www.faqs.org/docs/artu/ch07s03.html#id2923889.