Latest News
|NewsletterAshish Dixit from Tensilica looks at multiple processor SoC design, and considers how the processor has become the basic building block
The arrival of affordable multiprocessing in embedded systems provides engineering teams with much more flexibility than they have previously had. It is often easier to allocate tasks to different processors than try to schedule all operations on just one fast CPU.
The move toward multiple processor system-on-chip (SoC) designs is very real. Multiple processors are used in consumer devices ranging from low-cost ink-jet printers to mobile phones. Most of the newest network processors are based on multiple processor designs. The CRS-1, a router designed by Cisco Systems, employs 188 processors on a single chip, and multiple chips within the system.
A task-based analysis can show how multiple processors can be used efficiently in a system. Tasks that are mostly independent can be allocated to different processors, with intertask communication handled using message passing and shared-memory data structures (Figure 1). Each individual task that runs on a particular processor can be accelerated through the use of custom instructions that are dedicated to the most common operations needed by the task.
| Figure 1: Simple heterogeneous system partitioning |
If more performance is needed, the task can be decomposed into a set of parallel tasks (Figure 2) running on a set of optimised, inter-communicating processors. Conversely, multiple low-bandwidth tasks can be run on one processor by time-slicing them. This approach degrades parallelism, but may improve SoC cost and efficiency if the processor has enough available computation cycles.
| Figure 2: Parallel task system partitioning |
The first stage is to determine the performance required of the system. If the tasks are represented as algorithms in a programming language such as C, early system modelling can verify the functionality and measure the data transfers between tasks. At this stage, tasks have not been allocated to processors, and communications among tasks is still expressed abstractly.
An early abstract system simulation model serves as the basis for sizing the computational demands of each task. This information is not exact, but can yield important insights into both computational and communication hot spots.
Using system simulation throughout the design process has two advantages. First, an early start to simulation provides insight into bottlenecks. Second, the model’s role as a performance predictor gradually evolves into a role as a verification test bench. To test a subsystem, a designer replaces the subsystem’s high-level model with a lower-level implementation model.
There are two guidelines for mapping tasks to processors. The first is that the processor must have sufficient computational capacity to handle the task. The second guideline is that tasks with similar requirements should be allocated to the same processor as long as the processor has the computational capacity to accommodate all of the tasks.
The choice of processor type is important. A control task needs substantially more cycles if it is running on a simple DSP rather than a Risc processor. A numerical task usually needs more cycles running on a Risc CPU than a DSP. The combination of Risc processors and DSPs calls for the use of multiple software tools, which complicates development.
Developers would prefer to use multiple instances of the same general-purpose processor. But many standard, general purpose 32bit Risc processors are not fast enough to handle critical parts of some applications. The standard approach to providing greater performance partitions the application between software running on a processor and a hardware accelerator block, but this has serious limitations.
The methods for designing and verifying large hardware blocks are labour-intensive, error-prone and slow. If the requirements for the portion of the application running on the accelerator change late in the design or after the SoC is built, a new silicon design may be needed, further adding to cost. There may even be a performance hit.
Moving data back and forth among the processor, accelerator, and memory may slow total application throughput, offsetting much or all of the benefit derived from hardware acceleration.
Ironically, the promise of concurrency between the processor and the accelerator is also often unrealised because the application, by nature of the way it is written, may force the processor to sit idle while the accelerator performs necessary work. In addition, the accelerator will be idle during application phases that cannot exploit it.
Configurable and extensible processors offer several advantages to accelerator design. First, it incorporates the accelerator function into the processor, eliminating the processor-accelerator communication overhead. The configurable approach makes the accelerator functions far more programmable and significantly simplifies integration and testing of the total application. It also allows the acceleration hardware to have intimate access to all of the processor’s resources.
Further, converting the accelerator to a separate processor configured for application acceleration allows the second task to run in parallel with the general-purpose processor, receiving commands through registers or through shared data memory.
Once the rough number and types of processors is known and tasks are tentatively assigned to the processors, basic communication structure design starts. The goal is to discover the least expensive communications structure that satisfies the bandwidth and latency requirements of the tasks.
When low cost and good flexibility are the most important considerations, a shared-bus architecture, in which all resources are connected to one bus, may be the most appropriate. The liability of the shared bus is long and unpredictable latency, particularly when a number of bus masters contend for access to different shared resources. A parallel communications network provides high throughput with flexibility. The most common example is a crossbar connection with a two-level hierarchy of busses (Figure 3).
| Figure 3: General-purpose parallel communications style: on-chip mesh network |
Traditional processor cores provide only the block-oriented, general-bus interface. Configurable and extensible processors allow faster, more flexible communications using direct processor-to-processor connections to reduce cost and latency (Figure 4).
| Figure 4: Optimised direct parallel communications |
The memory used by a system introduces a further set of trade-offs. Off-chip RAM is much cheaper than on-chip RAM - at least for large memories. The designer needs to look at the memory-transfer requirements of each task to ensure that the memory used can handle the traffic. When on-chip RAM requirements are uncertain, caches can improve performance and shared memories aid inter-processor communication. But it is important to watch for contention latency in memory access. Increasing the memory width or increasing the number of memories that can be active may be used to overcome contention bottlenecks.
Even though processors make a potent alternative to hardwired logic blocks, often RTL blocks have already been designed and verified, so it is important to re-use them if appropriate. Two interface mechanisms to RTL blocks are generally used. The first mechanism maps hardware registers into local memory space. This makes the hardware block look much like an I/O device, and makes the controlling software look much like a standard device driver.
The alternative hardware interface mechanism that can be used is to extend the instruction set to directly stimulate hardware functions. With configurable processors, the designer can specify new processor instructions that take hardware block outputs as instruction-source operands and use hardware block inputs as instruction-result destinations. This avoids the use of intermediate registers and greatly accelerates the task by eliminating I/O overhead.
As designers get comfortable with a processor-based approach, processors have the potential to become the even more powerful building blocks for next-generation SoC designs, and SoC designers will turn to a processor-centric design methodology that has the potential to solve the ever-increasing hardware/software integration dilemma.
Ashish Dixit is vice-president of hardware engineering at Tensilica