Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > FPGAs/PLDs

How to optimise floating point calculations on FPGAs

Posted: 15 Jan 2015 ?? ?Print Version ?Bookmark and Share

Keywords:floating point? GPUs? FPGAs? OpenCL? GFLOPs?

High-performance floating point processing has long been linked to high performance CPUs. In the last few years, GPUs have also become powerful floating point processing platforms, moving beyond graphics, and known as General Purpose Graphics Processing Units (GP-GPUs). A new innovation is FPGA-based floating point processing in demanding applications. This paper focuses on FPGAs and their floating point performance and design flows and the use of OpenCL, a leading programming language for high performance floating point calculations.

GFLOPs ratings of various processing platforms have been increasing, and now the term TFLOP/s is commonly used. However, the peak GFLOP/s or TFLOP/s often provides little information on the performance of a given device in a particular platform. Rather, it indicates the total number of theoretical floating point additions or multiplies which can be performed per second. This analysis indicates that FPGAs can exceed 1 TFLOP/s [1] of single precision floating point processing.

A common algorithm of moderate complexity is the FFT. A 4096 point FFT has been implemented using single precision floating point. It is able to input and output four complex samples per clock cycle. Each FFT core can run at over 80 GFLOP/s, and a large FPGA has resources to implement seven such cores.

However, as figure 1 indicates, the GFLOP/s of FFT algorithm on this FPGA is nearly 400 GFLOP/s. This is a "push button" OpenCL compile result, with no FPGA expertise required. Using logic-lock and DSE optimisations, the seven core design can approach the Fmax of the single core design, boosting GFLOP/s to over 500, with over 10 GFLOP/s per Watt.

This GFLOP/s per Watt is much higher than achievable CPU or GPU power efficiency. In terms of GPU comparisons, the GPU is not efficient at these length FFTs, so no benchmarks are presented. The GPU becomes efficient with FFT lengths of several hundred thousand points, at which point it can provide useful acceleration to a CPU.

Figure 1: An Altera Stratix V 5SGSD8 FPGA Floating Point FFT Performance.

To summarise, the useful GFLOP/s is often a fraction of the peak or theoretical GFLOP/s. For this reason, a more useful approach is to compare performance on an algorithm that can reasonably represent the characteristics of typical applications. The more complex the algorithm, the more representative of a typical real application the benchmark is.

Third-party benchmarking
Rather than rely upon vendor's peak GFLOP/s ratings to drive processing technology decisions, an alternative is to rely upon third-party evaluations using examples of representative complexity. An ideal algorithm for high-performance computing is the Cholesky Decomposition.

This algorithm is commonly used in linear algebra for efficient solving of multiple equations, and can be used to perform matrix inversion functionality. It has high complexity, and almost always requires floating point numerical representation for reasonable results. The computations required are proportional to N3, where N is the matrix dimension, so the processing requirements are often demanding. The actual GFLOP/s will depend both on the matrix size and the required matrix processing throughput.

Results on based upon an Nvidia GPU rated at 1.35 TFLOP/s are shown in table 1, using various libraries, as well as a Xilinx Virtex6 XC6VSX475T, an FPGA optimised for DSP processing, with a density of 475K LCs. This is similar density as the Altera FPGA used for Cholesky benchmarks.

The LAPACK and MAGMA are commercially supplied libraries, while the GPU GFLOP/s refers to the OpenCL implementation developed at University of Tennessee. The latter are clearly more optimised at smaller matrix sizes.

Table 1: GPU and Xilinx FPGA Cholesky Benchmarks from Univ. of Tennessee [2]

1???2???3???4?Next Page?Last Page

Article Comments - How to optimise floating point calcu...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top