Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > FPGAs/PLDs

DSPs vs. FPGAs for multiprocessing

Posted: 10 Mar 2008 ?? ?Print Version ?Bookmark and Share

Keywords:multiprocessor signal processing? DSP vs FPGA? OFDM Access? parallel processing?

By Edward Young and Paul Moakes

Life used to be easy. If you were working on a multiprocessor signal processing application, you would write down the requirements, check the specs of the devices on offer from the major DSP vendors, and just pick the chip that suited best.

Times have changed, and today's engineers are blessed with far more choice. The big FPGA vendors have stepped up their offerings for signal processing, and choosing the best solution can seem complex.

For distributed applications, the choice of interconnect technology can also obviously have a crucial effect on the overall solution. Crunching the data is all very well, but your system needs to have the right interfaces to move it around between the different processors, and off-load the results. What do the DSP and FPGA vendors have to offer in this area?

This article will look at what's available for multiprocessor systems (which inevitably tends to mean the high-performance end of the market), and how you can make the best choice between DSP, FPGA or a hybrid mixture of the two. We'll look fairly briefly at issues involved in the two types of chip, but concentrate more on system-level factors.

For high-performance signal processing applications, of course there are other options beyond DSPs and FPGAs. Massively parallel processors from vendors like picoChip are one alternative, but unfortunately often require the use of the vendor's proprietary toolset. ASICs and ASSPs are also well-suited to certain signal processing tasks, but their high up-front costs rule them out except in high volume applications.

DSPs evaluated
Pretty much since their invention in the 1980s, DSPs have provided excellent performance at reasonable power and cost levels. A large community of experienced DSP engineers has also grown alongside the technology, developing a substantial base of off-the-shelf field proven code to run on the DSP cores. There is also a well-established support from third party vendors for debug and optimization tools.

High performance DSPs continue to develop with faster clock speeds and multicore solutions. Very-long instruction word (VLIW) DSPs provide high clock rates and independent execution units to get the maximum speed.

DSP development cost is relatively low, and as a mature technology it can be argued that it has a lower risk and faster time-to-market than FPGAs and other signal processing technologies.

DSPs can be attractive for many applications, which are based on emerging standards, which often change frequently and rapidly. As DSP algorithms can be readily implemented in an accessible language such as C, it is easier to update the code to reflect changes in the standards as they occur. In addition, the complex nature of many of the signal processing algorithms in applications such as the latest wireless standards often make them more suitable to implement using a DSP: it is much easier for a DSP device to change the processing algorithm on-the-fly by calling a different software routine. While modern FPGAs can be reconfigured quickly, to achieve this dynamically while continuing to process data is a complex and challenging task.

DSPs are also improving their performance in the field of power. Led by the demands of the hand-held market, some next generation high-performance DSPs are incorporating power management techniques from their little brothers. This allows overall system power dissipation to be reduced during times of low traffic or to prevent over-temperature. A power and temperature-aware FPGA configuration could, of course, manage its clock domains in a similar way, but at the cost of greater development effort.

However, the DSP is not particularly well suited to parallel processing tasks: multiple devices can be required for tasks, which easily fit into a single FPGA. For example, in wireless baseband applications, for the processing of WiMAX OFDM Access (OFDMA) channels, a pure DSP solution cannot match an FPGA in the bandwidth and number of channels it can process. Consequently the DSP solution may have an unacceptable cost and power per channel.

To improve DSP performance in specific algorithms, vendors have introduced hardware cores to handle some processing traditionally off-loaded to FPGAs. For example TI's TCI6482 DSP includes Viterbi and turbo decoder co-processors for 3GPP and 3GPP2, while the multi-core TCI6487 DSP also includes a direct Common Public Radio Interface (CPRI) / Open Base Station Architecture Initiative (OBSAI) interface which can be chained between DSPs.

The FPGA alternative
FPGAs have one big advantage over DSPs: their efficiency in concurrent applications, achieved by using multiple parallel processing blocks. Coupled with their flexibility to allow the embedded systems designer to tailor the device to match their application's demands as closely as possible, FPGAs can achieve the highest possible throughput with low cost per channel.

The FPGAs' flexibility has traditionally come with an additional cost in power due to the increased gate count and silicon area of non-optimized solutions in comparison to hardwired architectures. However, 65nm technologies and the use of equivalent ASIC technology for volume manufacture mean that FPGAs can be low-power in the lab, and power-reduced further in volume.

The per-channel power of an FPGA may now be well be below that of DSPs, even though the chip-level power dissipation is higher. DSPs typically consume 3-4W and FPGAs 7-10W but FPGAs can handle 10x the channel density.

Acknowledging the advantages of DSPs has seen a shift in recent years to FPGAs incorporating DSP technology, for example Xilnx Virtex-5 SXT devices. This enables the FPGA to incorporate DSP algorithmic processing for tasks, which are not naturally parallel. Such "DSP-enabled" FPGAs have shown huge throughput advantages for certain types of signal processing, which has been reflected in their success in the high-end processing market. However, FPGAs are in general ill suited to processing sequential conditional data flow.

Figure 1. Block diagram of CommAgility DSP/FPGA module (AMC-D4F1).

Programming FPGAs remains difficult, usually requiring a hardware-oriented language such as Verilog or VHDL. FPGA solutions can take an order of magnitude longer to code than DSP solutions which impacts development costs and increases time to market.

C-based synthesis tools have yet to deliver the ease of use and performance of C-coded processor solutions. High-level representations such as Simulink block diagram synthesis are not currently widely adopted and old FPGA synthesis methods still persist, especially where maximum performance is required.

Hybrid multi-processor systems
From a design engineer's point of view, this neck-and-neck technology development of FPGAs and DSPs is enabling them to find new and better solutions for signal processing applications. There is no simple answer as to whether FPGAs or DSPs are superior, and for many applications the best approach is a hybrid system, including both technologies to provide a solution that is superior to the sum of its parts.

Figure 1 shows a typical blade-level subsystem, which includes four Texas Instruments DSPs and one Xilinx FPGA. In addition to EMIF connections from the DSPs to the FPGA to allow co-processing with minimal overhead, it has a full Serial RapidIO (Serial RapidIO) architecture allowing its use for radio data distribution and as a low-latency direct memory access between devices, both on and off-card.

The scalability of the Advanced Mezzanine Card (AMC) form factor extends across the whole chassis, especially when systems are built using Serial RapidIO as the primary data transport. In either the Advanced Telecom Computing Architecture (ATCA) or MicroTCA chassis system, integrators have the option of mixing and matching DSP-centric and FPGA-centric blades to get the right balance of technology.

To develop efficient hybrid systems, protocols such as Serial RapidIO and standards such as AMC enable designers and system integrators to manage the balance both at the blade and system level. Figure 2 illustrates a typical system.

At CommAgility, we have aimed to keep designers' options open by providing a scalable range of AdvancedMCs that include varying numbers of FPGAs and DSPs. This includes the AMC-D4F1 (with four TI TMS320C6455 DSPs and one Xilinx Virtex-4 FX series FPGA), and the planned AMC-D1F3, which provides one DSP and three FPGAs. This allows developers to vary the technology used depending on their overall processing requirements, stage of application development and optimization, and experience with the existing code base for DSPs and FPGAs.

Using Serial RapidIO both on card and in the chassis allows the various elements to be brought together; the AMC-D4F1 provides two high-speed 10Gbit/s link off the card, achieved using two 4x Serial RapidIO interfaces.

In MicroTCA, the imminent ratification of the AMC.4 specification for Serial RapidIO distribution from an AMC backplane will be the final piece in the system jigsaw, although this hasn't prevented a Serial RapidIO AMC eco-system flourishing already in preparation, with multiple vendors now providing Serial RapidIO support for both MicroTCA Hub Carriers, and control and signal processing AMC cards.

The importance of interconnects
Let's return to our earlier example of wireless baseband processing. Looking at a typical WiMAX baseband system today there can be between 24-48 antenna streams per base station, with data rates upwards of 123Mbit/s per stream. This is an overall total of 3-6Gbit/s of antenna data.

When supporting multiple-input multiple-output (MIMO) systems with channels encoded using spread-spectrum techniques such as CDMA, data from all radio antennas has to be available to all baseband processing blocks. To achieve good performance, the key is an efficient low-latency interconnect.

Figure 2. Software view of a Serial RapidIO system using the RapidFET system management and analysis software.

Serial RapidIO is becoming increasingly popular in this kind of multi antenna system, as it has a lower protocol overhead compared to Ethernet, and supports multiple masters, unlike PCIe. Serial RapidIO's multicast feature is also very important in distributed systems for this kind of application.

Serial RapidIO is also well-suited to the needs of other high-performance signal processing applications, including radar, imaging and signals intelligence. Here also multicasting can be a useful feature, for example in video processing applications such as IPTV servers, where data is sent to multiple DSPs.

FPGA solutions can suffer when accommodating external interfaces. The number of logic elements taken to implement a Serial RapidIO interface today runs to several thousand gates, which comes at a premium in comparison with the DSP's hardwired interfaces. This point is not lost on the FPGA vendors, for example the Xilinx Virtex-5 introduces a hard core PCIe interface. An elegant way to avoid this cost is to use an FPGA as a co-processor to a DSP, connected via the DSP's external memory interface bus, which allows data to be DMA'd to and from the FPGA at little cost in logic elements or DSP overhead.

Wireless baseband processing
To understand the implications for designers, we can look at a practical solution for the WiMAX case discussed above, and how it could be implemented on a DSP/FPGA multi-processor board. 3-6Gbit/s of antenna data is far too much to be processed on a DSP such as the C6455, so the antenna data processing needs to be handled by an ASIC or FPGA.

If we take the example of CommAgility's AMC-D4F1 (which includes four C4655 DSPs and one FPGA), the Xilinx FPGA takes on this antenna data processing role. The AMC-D4F1's Serial RapidIO connection between the on-card FPGA and the AdvancedMC fabric is ideally suited for transporting the antenna data from a radio card in a MicroTCA chassis to the AMC-D4F1 acting as the baseband processing card.

WiMAX user data on the other hand is approximately 19Mbit/s per user channel and the C6455 DSPs can easily process multiple user channels. On the AMC-D4F1 three DSPs have a 32bit 125MHz external memory interface connection to the FPGA, while one DSP has a 64bit interface. That's at least a 4Gbit/s interface which is sufficient for each DSP to process over 100 user channels.

The backplane Serial RapidIO connectivity of the AMC-D4F1 allows system integrators to deploy multiple cards to scale to the size of base station required. It also allows vendors to implement a pay-as-you-grow approach to base station deployment. This is an important consideration to minimize capital expenditure until users, and thence revenues, materialize.

Recent developments in FPGA technology redress many of the long-held preconceptions about their use, and have met many engineers' concerns about power, cost and complexity. Developing signal processing applications on FPGAs still requires significantly more effort than for DSPs, even with the high-level development tools and libraries available from FGPA vendors. Finding the right engineers with DSP and system-level experience to develop applications on FPGAs can also be tricky.

Independent benchmarks can be a valuable help of choosing the best device. For example, in 2007, BDTI published an analysis of FPGAs in DSP applications. The BDTI tests looked at cost/performance in a typical multi-channel communications application. The results are clear-cut, with the FPGA delivering a cost-per-channel figure of better than 20x compared to the DSP. This does not mean that FPGAs are necessarily best for high-performance signal processing applications, but certainly demonstrates they can have clear performance advantages over DSPs in some circumstances.

Another important factor is the IP cores and software libraries geared at particular target applications, which are often provided by vendors. These can alleviate some of the reliance on in-house development of complex algorithms using the vendor tools and further reduce time-to-market.

The key advantages of the DSP are reduced development time for new and complex algorithms, and flexibility to run many different algorithms. For the FPGA, its number one benefit is efficiency gains from parallel processing. In many applications, such as image processing and wireless baseband processing, there is a mixture of these repetitive, simple processing tasks that are best suited to an FPGA and more complex and less predictable tasks that are perhaps better handled with a DSP. Additionally, as parallel processing blocks implemented in FPGAs become mainstream, they are increasingly likely to be integrated into the DSP vendors' silicon.

Overall, this means that a hybrid system containing DSPs and FPGAs can often provide the best solution for high-performance multi-processing applications, allowing each device to play to its strengths. The key to this particular debate is to look at FPGAs and DSPs as complementary technologies, rather than competition for each other.

About the authors
Edward Young
is managing director and co-founder of CommAgility, a UK developer of embedded products for high performance signal processing applications. Paul Moakes, is technical director and co-founder at CommAgility, with responsibility for software design and development.

Article Comments - DSPs vs. FPGAs for multiprocessing
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top