How to optimise floating point calculations on FPGAs
Keywords:floating point? GPUs? FPGAs? OpenCL? GFLOPs?
GFLOPs ratings of various processing platforms have been increasing, and now the term TFLOP/s is commonly used. However, the peak GFLOP/s or TFLOP/s often provides little information on the performance of a given device in a particular platform. Rather, it indicates the total number of theoretical floating point additions or multiplies which can be performed per second. This analysis indicates that FPGAs can exceed 1 TFLOP/s [1] of single precision floating point processing.
A common algorithm of moderate complexity is the FFT. A 4096 point FFT has been implemented using single precision floating point. It is able to input and output four complex samples per clock cycle. Each FFT core can run at over 80 GFLOP/s, and a large FPGA has resources to implement seven such cores.
However, as figure 1 indicates, the GFLOP/s of FFT algorithm on this FPGA is nearly 400 GFLOP/s. This is a "push button" OpenCL compile result, with no FPGA expertise required. Using logic-lock and DSE optimisations, the seven core design can approach the Fmax of the single core design, boosting GFLOP/s to over 500, with over 10 GFLOP/s per Watt.
This GFLOP/s per Watt is much higher than achievable CPU or GPU power efficiency. In terms of GPU comparisons, the GPU is not efficient at these length FFTs, so no benchmarks are presented. The GPU becomes efficient with FFT lengths of several hundred thousand points, at which point it can provide useful acceleration to a CPU.
![]() |
Figure 1: An Altera Stratix V 5SGSD8 FPGA Floating Point FFT Performance. |
To summarise, the useful GFLOP/s is often a fraction of the peak or theoretical GFLOP/s. For this reason, a more useful approach is to compare performance on an algorithm that can reasonably represent the characteristics of typical applications. The more complex the algorithm, the more representative of a typical real application the benchmark is.
Third-party benchmarking
Rather than rely upon vendor's peak GFLOP/s ratings to drive processing technology decisions, an alternative is to rely upon third-party evaluations using examples of representative complexity. An ideal algorithm for high-performance computing is the Cholesky Decomposition.
This algorithm is commonly used in linear algebra for efficient solving of multiple equations, and can be used to perform matrix inversion functionality. It has high complexity, and almost always requires floating point numerical representation for reasonable results. The computations required are proportional to N3, where N is the matrix dimension, so the processing requirements are often demanding. The actual GFLOP/s will depend both on the matrix size and the required matrix processing throughput.
Results on based upon an Nvidia GPU rated at 1.35 TFLOP/s are shown in table 1, using various libraries, as well as a Xilinx Virtex6 XC6VSX475T, an FPGA optimised for DSP processing, with a density of 475K LCs. This is similar density as the Altera FPGA used for Cholesky benchmarks.
The LAPACK and MAGMA are commercially supplied libraries, while the GPU GFLOP/s refers to the OpenCL implementation developed at University of Tennessee. The latter are clearly more optimised at smaller matrix sizes.
![]() |
Table 1: GPU and Xilinx FPGA Cholesky Benchmarks from Univ. of Tennessee [2] |
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.