Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > FPGAs/PLDs

Peak floating-point performance calculations

Posted: 10 Nov 2014 ?? ?Print Version ?Bookmark and Share

Keywords:DSPs? GPUs? FPGAs? fast Fourier transforms? GFLOPs?

DSPs, GPUs, and FPGAs act as accelerators for many CPUs, providing both performance and power efficiency benefits. Given the variety of computing architectures available, designers need a uniform method to compare performance and power efficiency. The accepted method is to measure floating-point operations per second (FLOPs), where a FLOP is defined as either an addition or multiplication of single (32 bit) or double (64 bit) precision numbers in conformance with the IEEE 754 standard. All higher-order functions, such as division, square root, and trigonometric operators, can be constructed using adders and multipliers. As these operators, as well as other common functions such as fast Fourier transforms (FFTs) and matrix operators, require both adders and multipliers. There is commonly a 1:1 ratio of adders and multipliers in all these architectures.

Let's look at how we go about comparing the performance of the DSP, GPU, and FPGA architectures based on their peak FLOPS rating. The peak FLOPS rating is determined by multiplying the sum of the adders and multipliers by the maximum operation frequency. This represents the theoretical limit for computations, which can never be achieved in practice, since it is generally not possible to implement useful algorithms that can keep all the computational units occupied all the time. It does however provide a useful comparison metric.

First, we consider DSP GFLOPS performance. For this we selected an example device such as Texas Instruments' TMS320C667x DSP. This DSP contains eight DSP cores, with each core containing two processing sub-systems. Each sub-system contains four single-precision floating-point adders and four single-precision floating-point multipliers. This is a total of 64 adders and 64 multipliers. The fastest version available runs at 1.25GHz, providing a peak of 160 GigaFLOPs (GFLOPs).

Figure 1: TMS320C667x DSP architecture.

GPUs have become very popular devices, particularly for image processing applications. One the most powerful GPUs is the Nvidia Tesla K20. This GPU is based upon CUDA cores, each with a single floating-point multiple-adder unit that can execute one per clock cycle in single-precision floating-point configuration. There are 192 CUDA cores in each Streaming Multiprocessor (SMX) processing engine. The K20 actually contains 15 SMX engines, although only 13 are available (due to process yield issues, for example). This gives a total of 2,496 available CUDA cores, with two FLOPs per clock cycle, running at a maximum of 706MHz. This provides a peak single-precision floating-point performance of 3,520 GFLOPs.

Figure 2: GP-GPU architecture.

FPGA vendors such as Altera now offer hardened floating-point engines in their FPGAs. A single-precision floating-point multiplier and adder have been incorporated into the hard DSP blocks embedded throughout the programmable logic structures. A medium-sized FPGA, in Altera's midrange Arria 10 FPGA family, is the 10AX066. This device has 1,678 DSP blocks, each of which can perform two FLOPs per clock cycle, resulting in 3,376 FLOPs each clock cycle. At a rated speed of 450MHz (for floating pointthe fixed-point modes are higher), this provides for 1,520 GFLOPs. Computed in a similar fashion, Altera states 10,000 GFLOPs, or 10 TeraFLOPs, of single-precision performance will be available in the high-end Stratix 10 FPGAs, achieved with a combination of both clock rate increases and larger devices with much more DSP computing resources.

1???2?Next Page?Last Page

Article Comments - Peak floating-point performance calc...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top