Deployment of ESL (Electronic System Level) design-methodologies
often leads to a degradation of achievable results. Usually, a
higher abstraction level has a negative impact on both device
utilization as well as achievable clock frequency, compared to
hand-crafted VHDL or Verilog.
However, this sacrifice is normally accepted because ESL design
methodologies allow dramatic shrinking of development time.
In the past, similar steps towards a higher abstraction level
were adopted, e.g. in hardware, the use of RTL (VHDL/Verilog)
instead of schematics or in software, the use of "C" language
instead of "assembler". These design methodologies are not
questioned anymore even if time-to market is achieved at the cost
of device efficiency in order to ensure product success.
Application Example: Image Processing
Image processing applications make it necessary to use
encoders/decoders in order to process enormous amounts of digital
data. The amount of data has to be reduced for two different
reasons: To be able to store the data on a commercially available
storage medium and to enable the transfer of digital data by using
a reduced transmission bandwidth.
The encoder is used to compress the data; the decoder is
required for decompression. Key components for compression and
decompression algorithms are the "Two-Dimensional Discrete Cosine
Transform (2D-DCT)" and the "Two-Dimensional Inverse Discrete
Cosine Transform (2D-IDCT)". It is necessary to use both of the two
components (2D-DCT und 2D-IDCT) in order to describe the encoder,
only the 2D-IDCT is required for implementation of the decoder.
A software implementation for both of the two components is
suitable for low picture rates. However, demanding applications
such as HDTV make it necessary to choose a hardware implementation
in order to accelerate the processing of arithmetic
calculations.
The use of FPGAs for this kind of compute intense application is
common practice, partly because FPGAs contain dedicated multipliers
and other blocks which allow an efficient implementation of
algorithms. For example, FPGA embedded DSP blocks contain not only
multipliers but also registers, adders, and saturation/rounding
circuits.
Algorithm Implementation
The 2D-DCT/IDCT transforms are separable transformations,
meaning that they can be computed in two separated steps: first
transforming the rows one-by-one, then transforming the results on
the columns one-by-one.
A separable approach for 2D-DCT/IDCT is usually adopted in the
implementation, cascading two times a one-dimensional processing
unit for DCT/IDCT.
For a 2D-DCT/IDCT targeted to a FPGA it is extremely important
to use an efficient algorithmic description because the number of
available hardware multipliers may be limited. One efficient
algorithm for the implementation of the IDCT function is known as
Loeffler algorithm. The implementation of a 1D-DCT/IDCT with eight
points involves only 11 multiplications. It is also possible to
modify the original Loeffler algorithm in order to achieve higher
clock speeds at the cost of slightly more resources, specifically
multipliers.
Expanding the 1D-IDCT to a 2D-IDCT
It is possible to expand the 1D-IDCT block to a 2D-IDCT function
by cascading the one-dimensional function. Two transpose operations
have to be added in order to arrange the data in the proper order.
A plain serial implementation of two 1D-IDCT would double the
number of resources in a FPGA or Asic. However, instead of just
cascading the 1D-IDCT it is also possible to achieve a more
efficient implementation by applying a "resource sharing"
optimisation technique. This is achieved by implementing just one
1D-IDCT and feeding back the output data stream to the input of the
1D-IDCT block. It is possible to apply this kind of
time-domain-multiplexing because the 2D-IDCT is implemented from
two identical 1D-IDCT functions.
Of course, as the block is used twice it is necessary to clock
the function at twice the frequency for the same throughput. Using
this technique it is possible to achieve a very resource efficient
implementation. The complete 2D-IDCT only requires 11, 12, or 14
multiplications (dependent on the used algorithm).
Taking this to extremes, it is possible to implement the full
functionality in even the smallest available low-cost FPGAs, thus
competing head-on with low-cost DSP Processors. ESL is allowing DSP
developers to explore the use of hard coded algorithms in FPGA and
ASIC whilst maintaining device efficiency.
Pierluigi Lo Muzio is a DSP specialist and Philipp
Jacobsohn, a senior field applications engineer at
Synplicity