Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > FPGAs/PLDs

Peak floating-point performance calculations

Posted: 10 Nov 2014 ?? ?Print Version ?Bookmark and Share

Keywords:DSPs? GPUs? FPGAs? fast Fourier transforms? GFLOPs?

Floating point routines have always been available in FPGAs using the programmable logic of the FPGA. Furthermore, with programmable logic-based floating point, an arbitrary precision level can be implemented that is not restricted to industry-standard single and double precision. For example, Altera offers seven different levels of floating-point precision. However, determining the peak floating-point performance of a given FPGA using a programmable logic implementation is not at all straightforward.

Therefore, the peak floating-point rating of any Altera FPGA is based solely on the capability of the hardened floating-point engines, and assumes that the programmable logic is not used for floating point, but rather for the other parts of a design, such as the data control and scheduling circuits, I/O interfaces, internal and external memory interfaces, and other required functionality.

There are several factors that make the calculation of floating-point performance using programmable logic very difficult. The amount of logic to build one single-precision floating-point multiplier and adder can be determined by consulting the FPGA vendor's floating-point intellectual property (IP) user guide. However, one vital piece of information is not reported in the user guide, and that is the routing resources required. To implement floating-point, large barrel shifters, which happen to consume tremendous amounts of the programmable routing (interconnect between the programmable logic elements) are required. All FPGAs have a given amount of interconnect to support the logic, which is based on what a typical fixed-point FPGA design will use.

Unfortunately, floating point does require a much higher degree of this interconnect than most fixed-point designs. When a single instance of a floating-point function is created, it can draw upon routing resources in the general region of the logic elements used.

Figure 3: Floating-point DSP block architecture in FPGAs.

However, when large numbers of floating-point operators are packed together, the result is routing congestion. This causes a large reduction in achievable design clock rates and logic usage that is much higher than a comparable fixed-point FPGA design. Altera has a proprietary synthesis technique known as "fused datapath," which does mitigate this to some extent, thereby allowing very large floating-point designs to be implemented in the logic fabric and leveraging fixed-point 27x27 multipliers for single precision, or 54x54 for double precision.

In addition, the FPGA logic cannot be fully used. Since the design takes up a large percentage of the available logic resources, the clock rate or fMAX at which timing closure can be achieved is reduced, and eventually timing closure cannot be achieved at all. Typically, 70% to 90% of the logic can actually be used andwith dense floating-point designsit tends to be at the lower end of this range.

For all of the above reasons, it is nearly impossible to calculate the floating-point capacity of an FPGA when implemented in programmable logic. Instead, the best method is to build benchmark floating-point designs, which include the timing closure process. Alternatively, the FPGA vendor can supply such designs, which would greatly aid in estimating what is possible in a given FPGA.

Again, for example, Altera provides benchmark designs on 28nm FPGAs, which cover basic as well as complex floating-point designs. The published results show that with 28nm FPGAs, several hundred GFLOPs can be achieved for simpler algorithms such as FFTs, and just over 100 GFLOPs for complex algorithms such as QR and Cholesky decomposition.

FPGAs with hardened floating-point DSP blocks are now available, and provide single-precision performance from 160 to 1,500 GFLOPs in midrange devices, and up to 10,000 GFLOPs in high-end devices such as Altera's Stratix devices. These peak GFLOPs metrics are computed based on the same transparent methodology used on CPUs, GPUs, and DSPs.

This methodology provides designers a reliable technique for the baseline comparison of the peak floating-point computing capabilities of devices with very different architectures. The next level of comparison should be based on representative benchmark designs implemented on the platform of interest. For FPGAs lacking hard floating-point circuits, using the vendor-calculated theoretical GFLOPs numbers is quite unreliable. Any FPGA floating-point claims based on a logic implementation at over 500 GFLOPs should be viewed with a high level of scepticism. In this case, a representative benchmark design implementation is essential to make a comparative judgement. The FPGA compilation report showing logic, memory, and other resources, along with the achieved clock rate, should also be provided.

About the author
Michael Parker is Principal DSP Product Planning Manager at Altera Corp.

To download the PDF version of this article, click here.

?First Page?Previous Page 1???2

Article Comments - Peak floating-point performance calc...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top