How to optimise floating point calculations on FPGAs
Keywords:floating point? GPUs? FPGAs? OpenCL? GFLOPs?
A mid-size Altera Stratix V FPGA (460 kLE) was also benchmarked, using the Cholesky algorithm in single precision floating point. As seen in table 2, the Stratix V FPGA performance on the Cholesky algorithm is much higher than Xilinx results.
It should be noted that the matrix sizes are not the same. The University of Tennessee results start at matrix sizes of [512x512]. The BDTI benchmarks go up to [360x360] matrix sizes. The reason for this is that GPUs are very inefficient at smaller matrix sizes, so there is little incentive to use them to accelerate a CPU in these cases. FPGAs can operate efficiently with much smaller matrices.
Table 2: Altera FPGA Cholesky and QR Benchmarks from BDTI [3] |
Secondly, the BDTI benchmarks are per Cholesky core. Each parameterizable Cholesky core allows selection of both matrix size, vector size, and channel count. The vector size roughly determines the FPGA resources. The larger [360x360] matrix size uses a larger vector size, allowing for a single core in this FPGA, at 91 GFLOP/s. The smaller [60x60] matrix use less resources, so two cores could be implemented, for a total of 2x39 = 78 GFLOP/s. The smallest [30x30] matrix size would permit three cores, for a total of 3x26 = 78 GFLOP/s.
FPGAs seem to be much better suited for problems with smaller data sizes. One explanation is that since computational loads increase as N3, which data I/O increases as N2, eventually the I/O bottlenecks of GPU become less important as the dataset increases. The other consideration is throughput. As the matrix sizes increase, throughput in matrices per second drops dramatically due to the increase processing per matrix. At some point, the throughput becomes so low as to be unusable in many applications. In many cases, large matrices may be tiled, and the smaller individual sub-matrices processed, to address the throughput limitations due to sheer processing loads.
For FFTs, the computation load increases an N log2 N, whereas the data I/O increases as N. Again, at very large data sizes, the GPU becomes an efficient computational engine. By contrast, the FPGA is an efficient computation engine at much smaller data sizes, and better suited in many applications where FFTs sizes are in the thousands, and GPUs where FFT sizes are in the hundreds of thousands.
GPU and FPGA design methodology
GPUs are programmed using either Nvidia's proprietary CUDA language, or an open standard OpenCL language. These languages are very similar in capability, with the biggest difference being that CUDA can only be used on Nvidia GPUs.
FPGAs are typically programmed using HDL languages Verilog or VHDL. Neither of these languages is well suited to supporting floating point designs, although the latest versions do incorporate definition, though not necessarily synthesis, of floating point numbers. For example, in System Verilog, a short real variable is analogue to IEEE single (float), and real to an IEEE double.
Synthesis of floating point datapaths into an FPGA is very inefficient using traditional methods. The low performance of the Xilinx FPGAs on the Cholesky algorithm, implemented using Xilinx Floating Point Core Gen functions, substantiates this. However, Altera offers two alternatives. The first is using a Mathworks based design entry, known as DSPBuilder Advanced Blockset. This tool contains support for both fixed and floating point numbers. It supports seven different precisions of floating point, including IEEE half, single and double precisions. It also supports vectorisation, which is needed for efficient implementation of linear algebra. However, the key is its ability to map floating point circuits efficiently onto today's fixed point FPGA architectures, as is demonstrated by the benchmarks supporting close to 100 GFLOP/s on the Cholesky algorithm in a mid-size 28 nm FPGA [3]. By comparison, the Cholesky implementation on a similar sized Xilinx FPGA without this synthesis capability shows only 20 GFLOP/s of performance on the same algorithm, using a similar density FPGA [2].
OpenCL for FPGAs
OpenCL is familiar to GPU programmers. An OpenCL Compiler [5] for FPGAs means that OpenCL code written for AMD or Nvidia GPUs can be compiled onto an FPGA. An OpenCL Compiler from Altera enables GPU programs to use FPGAs, without the necessity of developing the typical FPGA design skill set.
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.