Latest News
|NewsletterDeployment of ESL (Electronic System Level) design-methodologies often leads to a degradation of achievable results. Usually, a higher abstraction level has a negative impact on both device utilization as well as achievable clock frequency, compared to hand-crafted VHDL or Verilog.
However, this sacrifice is normally accepted because ESL design methodologies allow dramatic shrinking of development time.
In the past, similar steps towards a higher abstraction level were adopted, e.g. in hardware, the use of RTL (VHDL/Verilog) instead of schematics or in software, the use of "C" language instead of "assembler". These design methodologies are not questioned anymore even if time-to market is achieved at the cost of device efficiency in order to ensure product success.
Application Example: Image Processing
Image processing applications make it necessary to use encoders/decoders in order to process enormous amounts of digital data. The amount of data has to be reduced for two different reasons: To be able to store the data on a commercially available storage medium and to enable the transfer of digital data by using a reduced transmission bandwidth.
The encoder is used to compress the data; the decoder is required for decompression. Key components for compression and decompression algorithms are the "Two-Dimensional Discrete Cosine Transform (2D-DCT)" and the "Two-Dimensional Inverse Discrete Cosine Transform (2D-IDCT)". It is necessary to use both of the two components (2D-DCT und 2D-IDCT) in order to describe the encoder, only the 2D-IDCT is required for implementation of the decoder.
A software implementation for both of the two components is suitable for low picture rates. However, demanding applications such as HDTV make it necessary to choose a hardware implementation in order to accelerate the processing of arithmetic calculations.
The use of FPGAs for this kind of compute intense application is common practice, partly because FPGAs contain dedicated multipliers and other blocks which allow an efficient implementation of algorithms. For example, FPGA embedded DSP blocks contain not only multipliers but also registers, adders, and saturation/rounding circuits.
Algorithm Implementation
The 2D-DCT/IDCT transforms are separable transformations, meaning that they can be computed in two separated steps: first transforming the rows one-by-one, then transforming the results on the columns one-by-one.
A separable approach for 2D-DCT/IDCT is usually adopted in the implementation, cascading two times a one-dimensional processing unit for DCT/IDCT.
For a 2D-DCT/IDCT targeted to a FPGA it is extremely important to use an efficient algorithmic description because the number of available hardware multipliers may be limited. One efficient algorithm for the implementation of the IDCT function is known as Loeffler algorithm. The implementation of a 1D-DCT/IDCT with eight points involves only 11 multiplications. It is also possible to modify the original Loeffler algorithm in order to achieve higher clock speeds at the cost of slightly more resources, specifically multipliers.
Expanding the 1D-IDCT to a 2D-IDCT
It is possible to expand the 1D-IDCT block to a 2D-IDCT function by cascading the one-dimensional function. Two transpose operations have to be added in order to arrange the data in the proper order. A plain serial implementation of two 1D-IDCT would double the number of resources in a FPGA or Asic. However, instead of just cascading the 1D-IDCT it is also possible to achieve a more efficient implementation by applying a "resource sharing" optimisation technique. This is achieved by implementing just one 1D-IDCT and feeding back the output data stream to the input of the 1D-IDCT block. It is possible to apply this kind of time-domain-multiplexing because the 2D-IDCT is implemented from two identical 1D-IDCT functions.
Of course, as the block is used twice it is necessary to clock the function at twice the frequency for the same throughput. Using this technique it is possible to achieve a very resource efficient implementation. The complete 2D-IDCT only requires 11, 12, or 14 multiplications (dependent on the used algorithm).
Taking this to extremes, it is possible to implement the full functionality in even the smallest available low-cost FPGAs, thus competing head-on with low-cost DSP Processors. ESL is allowing DSP developers to explore the use of hard coded algorithms in FPGA and ASIC whilst maintaining device efficiency.
Pierluigi Lo Muzio is a DSP specialist and Philipp Jacobsohn, a senior field applications engineer at Synplicity