Latest News
|NewsletterThe news last month that PicoChip has secured $20.5m in third round funding, with Intel featuring amongst the investors, suggests that processor array-based devices are beginning to make their mark.
ClearSpeed also had a successful June with its first public demonstration of the CSX600 floating point array processor. And earlier this year MathStar announced that Honeywell had chosen its field programmable object array (FPOA) technology for use in satellites in military space systems.
These firms are targeting different applications but all are demonstrating the power of using many small processors in parallel. “Putting more processors on one chip is now becoming almost the de facto way of getting a lot more performance. Even Intel and AMD are putting two cores on their desktop chips,” comments Simon McIntosh-Smith, ClearSpeed’s director of architecture and applications.
The applications which ClearSpeed and others are targeting have inherent parallelism ready to be exploited. PicoChip is processing radio algorithms while ClearSpeed is tackling large maths problems that can be chopped into small bits and run on many processors at once.
| ClearSpeed CSX600 Board |
ClearSpeed’s chip, which came out in October 2004, is an array of 96 processing elements designed for 64-bit double precision floating point maths. Each element has an integer ALU, a dual issue 64-bit FPU, 6kbytes of local memory, and a MAC. The array is clocked at 250MHz and delivers a sustained 25Gflops at less than 10W.
PicoChip’s architecture is tuned for WiMAX and HSDPA applications. The PC102 device has more than 300 processors which run at 160MHz and deliver 200Gips and 40GMACs between them. Unlike the ClearSpeed device which relies on every processor doing the same thing at once (single instruction multiple data), the PC102 is multiple data multiple instruction, so every processor can be doing something different.
The key to the PC102, says Rupert Baines v-p marketing for PicoChip, is the 3300Gbit/s PicoBus interconnection system, which provides deterministic performance and scalability. “It means that 300 processors are 300 times better than one,” he summarises. The biggest system PicoChip has built uses 16 chips that together deliver three Tera operations per second.
MathStar’s FPOA is a finer grained architecture, with an array of 400 objects comprising 256 ALUs, 64MACs and 80 register files. It has a faster clock-frequency too, running at up to 1GHz.
The common theme is how these devices are replacing FPGAs. The reason is simple. The FPGAs’ flexibility comes with a silicon and power overhead and it makes sense for those who do not need the flexibility to use something more efficient.
“One of our WiMAX customers had a system that used five TigerSharcs and two large FPGAs. They replaced this with two of our chips, which cost a quarter of the price and used a quarter of the power,” comments Baines.
| Sean Riley |
MathStar’s Sean Riley has a similar story: “What we hear all the time is FPGA users wanting to run something at, say, 200MHz and only managing to get 150MHz. We don’t have the issue of physical routing and physical timing closure. If you buy our part and want to run it at 1GHz, it will run at 1GHz.”
The big question, however, is whether moving to one of these new architectures is now sufficiently straightforward to make re-coding worthwhile. The code for MathStar’s FPOA is written in SystemC and, according to Riley, anyone who has designed hardware will find an FPOA easy to programme.
Baines says that part of the reason PicoChip is having such success is that its development route is simple. The chips are programmed in C in an object-oriented way. “Each little processor is an object and it talks to the outside world with inputs and outputs. You write your program for each processor. So, you might have one processor that acquires an input, the next processor runs a filter, the next one runs an FFT and so on,” he explains.
For the WiMAX customer mentioned earlier, Baines says: “For them, it was worth the pain of redeveloping the code because in a space of six months, they had radically reduced the cost and power.”
ClearSpeed has a different focus and does not expect many users to program its device directly. “We provide pre-optimised libraries that you can call from within a C, C++ or Fortran program. We have libraries for all kinds of vector and matrix arithmetic and for FFTs and so on. In the past, this model couldn’t work because there weren’t standard maths libraries that were prevalent enough,” says McIntosh-Smith.
The power efficiency of all these devices suggests that mobile equipment is the next stop.
IMEC is looking at this in its M4 (multi-mode multi-media) programme, developing chips for mobile terminals with partners including Samsung, Freescale, Infineon and Xilinx. Prof. Hugo de Man, IMEC’s co-founder, is famous for saying that Von Neumann is a poor use of scaling as all the energy goes on communicating between processor and memory. “It’s better to use 20 microprocessors running at 100MHz than one at 2GHz,” is his view.
M4 revolves around using multiples of IMEC’s Adres processor, a rather larger processor element than those discussed, that is also highly parallel in its internal architecture. “The architecture we envisage has network-on-chip style of communication that services the different processors and regulates the data transfers between the processors and memory,” says Serge Vernalde, technical business director at IMEC. IMEC hopes to deploy this in the context of M4 next year.
While PicoChip’s PC102 is for infrastructure applications, Baines says the basic architecture is well suited to multi-mode mobiles: “Different algorithms can run on different processors or on many processors. Those you aren’t using just doze so you only burn the amount of power you need.”