It has been branded xcore.ai and has:
- 16 real-time logical cores, with support for scalar/float/vector instructions
- IO ports with nanosecond latency for real-time response
- Support for binarised (1bit, see below), 8bit, 16 bit and 32bit neural network inference
- Multi-modal data capture and processing “enabling concurrent on-device application across classification, audio interfaces, presence detection, voice interfaces, comms and control, actuation”, according to XMOS.
- Instruction set for digital signal processing, machine learning and cryptographic functions
- On-device inference of TensorFlow Lite for Microcontroller models
“Traditionally this type of capability would be deployed either through a powerful applications processor or a microcontroller with additional components to accelerate key capabilities,” said the firm. “However, the xcore.ai crossover processor is architected to deliver real-time inferencing and decisioning at the edge, as well as signal processing, control and communications.”
XMOS is claiming an inferencing performance improvement compared with an Arm Cortex-M7 (in STM32M7).
Its is offering execution time figures for a quantised 8-bit model from the CIFAR-10 test
- STM32M7, whole chip running at 600MHz = 35.676ms
- xcore.ai, running on 1 xcore logical core at 160MHz = 5.236ms
“If we scale that onto five logical cores, each running at 160MHz, entirely reasonable assumption considering we have 16 cores on the device, we expect to get a reduction of close to x5 execution time, which would be better than x32 [over the Arm processor],” a company spokesman told Electronics Weekly.
It is programmable in C, with DSP and machine learning features accessible through C libraries. FreeRTOS real-time operating system is supported for open-source library components, and there is a TensorFlow Lite to xcore.ai.
Up to 128 pins of IO are available, and hardware USB 2.0 PHY and MIPI interfaces are included.
Where will it be used?
“Imagine the humble smoke detector,” said XMOS. “With xcore.ai embedded, a smoke detector could use radar and imaging to identify whether there are people in the affected building and, if so, determine how many and where they are located. Using voice interfaces, the detector could communicate with those inside, while vital sign detection could identify whether they are breathing. Put together, this builds an intelligent picture of the environment that can be fed straight to the emergency services, enabling an informed rescue operation, improving accuracy and speed of response.”
Product demos are scheduled to be available from June 2020.
Developing the technology was part funded by the European Union Horizon 2020 research and innovation programme under Grant Agreement No 849469.
Binarised neural networks
To simplify hardware, there is much interest in cutting down the data length of neural networks, and optimisation techniques have improved to the point where some inferencing can be implemented in ‘binarised neural networks’ (BNNs, aka ‘XNOR’ networks) – those cut down so far that weights and activations are simply two value rather than being represented by any number higher than one.
“In deep networks, it is typically the inner layers where most of the multiplications occur and can benefit the most from 8bit arithmetic,” according to XMOS. “This saves memory and increases multiplication throughput compared to 16bit or 32bit operations. Higher accuracy layers can be supported through 32bit and 16bit arithmetic where necessary. For even higher multiplication throughput, the inner layers of a network can be designed to use single bit operations – values +1 and -1 – so-called binarised neural networks . In this mode, a device running at 800MHz can reach a total of just over 300 billion operation/s. By using binary neural networks, xcore.ai delivers 2.6x to 4x more efficiency than its 8bit counterpart.”
The actual data types handled by xcore.ai are:
- 8-bit vector arithmetic, 32 parallel multiply accumulates
- 16-bit vector arithmetic, 16 parallel multiply accumulates
- 32-bit vector arithmetic, 8 parallel multiply accumulates
- 1-bit vector arithmetic (XNOR), 256 parallel ‘multiply’* accumulates
- 32-bit complex vector arithmetic, 2 parallel multiply accumulates
*ask XMOS
Electronics Weekly