Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Processors/DSPs

The basics of implementing DSP functions on low-cost FPGAs

Posted: 01 Sep 2005 ?? ?Print Version ?Bookmark and Share

Keywords:fpga? xilinx?

By Suhel Dhanani, Xilinx
Programmable Logic DesignLine

In the past, low-end FPGAs were the result of cost-reduced architectural derivatives of high-end FPGA platforms manufactured on trailing-edge process geometries. As the penetration of FPGAs into consumer systems continues, the use model is moving from being glue/control logic-centric to core data processing-centric, changing the dynamics of the low-cost model. As a result, these FPGAs have a unique architecture and feature-set optimized to reduce the cost of implementing widely used functions such as image and video compression.

Today's low-cost FPGAs are leading the process curve with dominant architectures implemented on 90nm technology. The relentless march down the process curve, coupled with increasing yield on even larger wafer sizes, has allowed a dramatic decrease in FPGA costs. Programmable devices with the high-volume cost of 50,000 gate devices have dropped from a $20 price tag in 1998, to $2.95 today. Measured in terms of cost per 1000 logic cells, cost is reduced by a factor of nearly 30 over just seven years. This reduction is due purely to smaller die sizes enabled by ever shrinking process technology.

At the 90nm node, the FPGAs incorporate platform features for the integration of high-speed DSP, memory control, complex clocking and processing functions with minimal overhead.

The benefits of these low price points can be measured in terms of both unit cost of the device, and the cost of an implemented function. The term "cost of implemented function" is a measure of the device's efficiency in implementing the function. In particular, the platform features included in leading low-cost FPGA architectures make them particularly suited to implement high-performance DSP functions in a cost effective manner.

Low-Cost FPGAs with embedded DSP functionality
Platform features included in currently available low-cost FPGA architectures are well suited to implement DSP functions. Features such as embedded multipliers, distributed memory, shift registers and block memories allow FPGAs to efficiently implement high-performance DSP functions required in consumer devices. FPGAs bring two key advantages to the digital signal processing requirements of consumer products. First, FPGA architectures have the capacity to provide highly parallel implementations of a DSP function affording very high-performance. Second, user programmability allows designers to trade off device area vs. performance by selecting the appropriate level of parallelism to implement functions.

By programming the FPGA to use greater on-chip resources, designers achieve higher performance. Conversely, by using fewer resources, and accepting a corresponding lower performance, designers can optimize the design for low cost. For a real-life example of the value of user programmability, consider an N-tap FIR filter. By using resources available in the FPGA fabric, the designer can create a highly parallel implementation and achieve higher performance (see Figure 1). A conventional DSP approach, in which one has to use a limited multiplier and accumulator resource in a time-multiplexed manner may not meet required system performance. By using several multipliers in parallel, very high DSP performance can be achieved with an FPGA.

Figure 1: FPGA's parallel approach to DSP utilizing the embedded multipliers enables high computational throughput

Given that FPGAs are completely hardware configurable, designers have the flexibility to use only the resources that the algorithm demands. Put another way, the designer can optimize the hardware architecture to suit the ideal algorithm.

There are different ways to implement four multiply-accumulate (MAC) functions (see Figure 2). By using four embedded multipliers within the FPGA fabric, this can be done at maximum performance. Alternatively, designers can opt to conserve area and implement the same function at a lower performance by using only one multiplier, one accumulator, and a register. A semi-parallel approach can also be used to balance performance and costs.

Figure 2: The FPGA can be used to implement DSP functions in a variety of ways to meet the price-performance characteristics of the system

The platform features promote the design of high-performance DSP functions in a small fraction of the total device. This leaves the rest of the device free to implement system logic functions for lower costs and higher system integration.

Dedicated 18 x18 multipliers can significantly speed up DSP functions in an area and power efficient manner. Often, multiplier blocks share routing resources with block memory for increased efficiency in many applications. Applications such as signed-signed, signed-unsigned, and unsigned-unsigned multiplication, logical, arithmetic, and barrel shifters, two's-complement and magnitude return can be easily implemented using these structures.

Shift register logic enables efficient designs that require delay or latency compensation. Shift registers are also useful in synchronous FIFO and Content Addressable Memory (CAM) designs.

Distributed RAM is crucial in implementing scratch-pad memory and small FIFOs. Applications that require large, on-chip memories will benefit from block RAMs that can be used to implement RAM, ROM, FIFOs, large look-up tables, data width converters, circular buffers and shift registers.

A handheld system design example
A simple version of a video router/mixer system compresses several, high bandwidth video streams from different sources. Then, using high-speed, pixel-rate, pipelined math, video streams are manipulated and merged. The system uses a low-cost FPGA family for implementing high performance and cost-effective DSP functions, and interfaces with popular handheld devices to send and receive images as well as GPS data. It has a complete wireless video system based on proprietary and Discrete Wavelet Transform algorithms.

Various image compression algorithms were considered. DCT is the basis of standards such as JPEG, MPEG, H.261 and H.263. When used as a compression method, it identifies the frequencies that are not detectable by the human eye. These components can be eliminated without adversely affecting visual quality.

Vector Quantization uses visual similarities in small blocks of still images. Each similar block is replaced with a candidate block and is stored in a look up table. An image is then composed of a sequence of such blocks and coded using a variable length Huffman style encoder.

Standards such as JPEG 2000 and MPEG 4 use a different type of transformation called Discrete Wavelet Transform. This transformation process filters the image through a set of low-pass/high-pass filter combinations; images of multiple resolutions are generated. Each of the filtered images can be down sampled and compressed using techniques such as vector quantization.

Motion estimation methods used in MPEG and related standards identify and quantify the movement of the foreground relative to the background. Given a reference video frame, motion estimation can encode another captured frame in the sequence in terms of the affine transformations to be done on the reference frame. Affine transformations yield very high compression ratios.

An example of a simple motion type is "translation," as it happens in the panning of a camera. Other motion types include scaling, skewing, and rotation. Estimation techniques employ block-based or object based comparison for identifying correspondence between objects in two frames. A video stream is a sequence of captured and predicted frames with the predicted frames yielding high degree of compression.

Since quality of and the resources required for video compression depends on efficiently identifying structural, spatial, visual and temporal similarity, it allows for significant innovation in developing new compression parameters and new efficient ways of computing similarity. Newer and better ways to compress video are sought, and the system shown in Figure 3 is a prime example.

Using a proprietary implementation of wavelet compression technology, and a patent-pending proprietary progressive encoding scheme, the still image codec used in the system renders full color 4:1:1 QCIF and SQCIF format image clips, at file sizes that are significantly lower than other image codec implementations, with high image quality (peak signal-to-noise-ratio).

The video codec utilized in the system is designed to deliver compressed video in various frame-size formats and can support the multimedia component of a variety of services and applications such as, peer-to-peer messaging, video download, content distribution, video-conferencing, streaming, MMS, etc.

Figure 3: The Qwikard system enables wireless transfer of image, video and GPS location, using popular handheld PDAs

This video compression technology delivered an efficient low bit-rate video encoder/decoder that uses the embedded multipliers in the FPGA fabric. Using embedded multipliers in parallel to process DSP functions allows the system to reach the performance required for real-time video processing.

There were various challenges associated with the system design. Motion estimated compressed video stream generated burst and variable bit rate data traffic with certain delay requirements. Adapting this traffic to the normal TCP/IP protocol was a challenge. Another challenge was to have real-time compressed video suitable for digital storage, as well as for real-time viewing over networks of varying link capacity. Also, the geographical information obtained through the GPS module required embedding in the image file.

Using the embedded features in the FPGA, such as multipliers, and distributed memory, helped overcome the challenges. The FPGA fabric was used to implement both image compression algorithms, as well as the system logic functions affording a high level of integration. In addition, the system designers were able to implement the memory interfaces and the GPS logic in the same FPGA device (see Figure 4).

Figure 4: The system uses the FPGA to implement not only control logic, but also compression algorithms

Higher levels of customization and lower run rates increasingly characterize consumer electronics systems, as product life cycles shorten and target markets fragment. For these products, low-cost FPGAs enable integration of system-level functions providing the ideal platform for implementing not only control logic and interconnect, but also controllers, DSP and other core data processing.

By integrating most of the system functionality on a programmable platform, companies not only get their product to market faster, but also mitigate the risks associated with future design changes and feature enhancements.

For cutting-edge designs where design cycles are tight, requirements are fluid, and expected volumes are uncertain; system companies implement most of the core functionality in low-cost FPGAs to not only insure flexibility in case of change, but also to meet required system performance at the lowest possible system cost. Such systems tend to use FPGAs for the bulk of their core processing and DSP requirements. Today these systems are only a small percentage of the consumer systems in existence, but trends within the FPGA and the consumer electronics industry indicate that their numbers will grow.

About the author
Suhel Dhanani
is a senior marketing manager for High Volume Products at Xilinx. Suhel has almost 10 years of marketing experience in the semiconductor industry. He has completed graduate work in Management Science from Stanford and also holds M.S.E.E. and M.B.A. degrees from Arizona State University.

Article Comments - The basics of implementing DSP funct...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top