**Processors/DSPs??**

# FPGA accelerates real-time app performance

**Keywords:fpga?
asic?
fft?
dsp?
hanning window?
**

*Rodger Hosking and Richard Kuenzler of Pentek Inc. talk about the pros and cons of FPGAs and ASICs, and why FPGAs are better for real-time apps.*

Field programmable logic has been the circuitry of choice for connecting high-speed peripherals like wideband adcs/dacs, digital receivers and communication links to programmable processors in embedded real-time systems. FPGAs are well suited to handle the clocking, synchronization and other diverse timing circuitry needed to tame some specialized devices. Also, FPGAs are useful in data-formatting tasks like serial-to-parallel conversion, data packing, time stamping, multiplexing and packet formation.

However, with recent advances in chip technology, with the silicon becoming much faster and denser, the role of the FPGA is changing dramatically. Virtually all of the latest device offerings from major FPGA vendors include ready-to-use building blocks for DSP applications. These include dedicated multiplier engines and flexible memory structures that can be tailored into block memory, dual-port RAM, FIFO memory and shift registers.

Indeed, DSP capability has become one of the most significant product strategies for FPGAs, as evidenced by sharp increases in engineering and marketing investments over the last few years.

**FPGA vs. ASIC**

After jumping on the DSP bandwagon, FPGA vendors are prone to predictions of demise of the general-purpose programmable DSP chip. In practice, however, programmable DSPs will continue to be a better choice for many kinds of tasks found in real-time embedded systems. The latest generation of FPGAs is now available with on-chip standard RISC processor cores to complement the logic blocks and configurable memory resources. This is primarily because it is still easier to program a characterized, well-supported, standard processor than a one-of-a-kind combination of signal-processing elements in an FPGA.

Nevertheless, FPGAs can be extremely effective for certain DSP functions, especially for well-defined algorithms that can take advantage of parallel operations. To help accelerate this practice, FPGAs are firmly entrenched in the product designs of embedded systems and are often located directly in the signal flow path. Therefore, it can be a short technical leap to enlist newly acquired DSP resources of the FPGA to handle some DSP functions formerly assigned to the programmable processor.

**FPGAs in software radio**

An onboard FPGA accepts real outputs from both ADCs, as well as complex baseband outputs from both of the digital downconverters. The FPGA implements the velocity interface mezzanine to deliver data directly into each DSP or PowerPC on the processor board, where FIFO buffers support DMA block data transfers at rates up to 400MBps.

With an eye toward adding DSP capability, a natural choice for the FPGA on this product was the Xilinx Virtex-II family. With 96 dedicated 18x18 multiplier blocks and over 200KB of block RAM, the XC2V3000 offers a mix of signal-processing resources, even for some of the more substantial applications.

In the basic factory configuration of the module, the FPGA still performs the traditional tasks of timing, formatting and glue logic for the various devices on board. Since these functions are relatively simple, they consume only 6 percent of the programmable logic. This leaves 94 percent of the logic blocks, all 96 multipliers and virtually the entire block RAM available for adding DSP algorithms.

To demonstrate the power of these untapped resources, an engineering project was launched to implement a high-performance FFT engine. Since communications, radar and signal intelligence systems all use FFTs for tracking, tuning and image-processing operations, the FFT remains one of most popular algorithms for benchmarking processor performance.

In a nutshell, the FFT accepts a block of input time-domain samples and converts them into a block of output frequency-domain samples. Because the calculation is rather complex, it consumes a significant share of DSP resources and becomes a prime candidate for FPGA implementation.

**Constructing FFT**

One of the most efficient methods of performing the FFT calculation is an iteration of the radix-4 "butterfly" algorithm. Inside each butterfly, four input data points are multiplied by coefficients from a sine table and then combined to produce four output points. This butterfly operation is repeated until all input points are processed, four at a time, representing a single "stage." To implement a 4,096-point FFT, six stages of butterflies are required.

One of the benefits of using an FPGA over a conventional programmable processor for computing FFTs is the large number of multipliers available for simultaneous calculation. In the 4,096 example above, a total of 60 multipliers are needed to implement all six FFT butterfly stages in parallel. Since the XC2V3000 has 96 multipliers available, it becomes obvious why FPGAs can often outperform a standard DSP having only two or four hardware multipliers, especially for algorithms like the FFT.

Since the FFT is inherently a block-oriented algorithm, the FFT operates most efficiently when a freely addressable RAM supports quick access to all I/O samples. However, this model of random data availability is contrary to the sequential input data samples streaming from the ADC. Fortunately, the configurable block RAM resources of the FPGA can be retooled to form a memory structure that feeds the appropriate samples into four input data memory ports of the butterfly engines in parallel, thus solving the data availability problem. This proprietary memory architecture allows subsequent input blocks to be processed in a continuous, systolic manner so that all of the multipliers in all six stages can be productively engaged all the time.

For every FPGA clock cycle, each radix-4 butterfly processes four input samples. Therefore, when the FPGA processing clock is equal to the A/D clock, the architecture above is capable of running four times faster than real-time. With suitable hardware multiplexing schemes, this same FFT engine can be used to handle four streams of input data instead of just one.

In our example, with two ADCs and the FPGA all clocking at 100MHz, the FPGA is only working at half capacity. But with little extra effort, the engine can be set to handle 50 percent input overlap processing of both channels to fully utilize the hardware. In this case, the pipelined execution time is 10.24?s for each FFT. This is four times faster than the time it takes to collect the 4,096 input points at a 100MHz sampling rate, consistent with performing four FFTs in real-time.

**FFT enhancements**

Since only 60 of the 96 multipliers were used for the FFT algorithm, additional features were incorporated. At each of the four complex input streams, an optional Hanning window can be applied, requiring eight extra multipliers. Since coefficients for the FFT and for the Hanning window use separate FPGA table memories, alternate input windowing functions can be substituted for the Hanning window.

Eight more multipliers are used to perform an optional power calculation at the FFT output, in which the real and imaginary components of each of the four outputs are squared and then added together. Finally, an averager stage adds the two outputs of the 50 percent input overlap FFTs to improve signal-to-noise characteristics. At the output of the FPGA, a multiplexer allows the results of each signal-processing stage to be directed to the processor interface.

**FPGAs are here to stay**

At an execution speed of 10.24? s for a 4,096-point complex FFT, this FPGA engine outperforms benchmarks for an optimized FFT algorithm running on a 400MHz G4 PowerPC by a factor of ten.

To achieve a calculation dynamic range of better than 90dB, several techniques were used to reduce the rounding and truncation errors inherent in FPGA integer arithmetic. After optimization for execution speed by deploying the available FPGA resources, the entire design used 76 of the 96 multipliers, 99 percent of the logic slices and 97 percent of the block RAM of the XC2V3000 device.

Although this particular FPGA component is still expensive because of its recent introduction, two concentric subsets of the BGA footprint pattern accommodate two smaller devices in the same family to save costs for less demanding applications.

- **Rodger Hosking, Vice President**

Richard Kuenzler, *Sr. Design Engineer*

Pentek Inc.

Related Articles | Editor's Choice |

Visit Asia Webinars to learn about the latest in technology and get practical design tips.