How to optimise floating point calculations on FPGAs
Keywords:floating point? GPUs? FPGAs? OpenCL? GFLOPs?
Using OpenCL with FPGAs offers several key advantages over GPUs. First, GPUs tend to be I/O limited. All input and output data needs to be passed by the host CPU via the PCIe interface. The resulting delays can stall the GPU processing engines, resulting in lower performance.
FPGAs are well known for their wide variety of high bandwidth I/O capabilities. This can allow data to stream in and out of the FPGA over Gigabit Ethernet, SRIO, or directly from ADCs and DACs. Altera has defined a vendor-specific extension of the OpenCL standard to support streaming operation.
FPGAs can also offer much lower processing latency than a GPU, even independent of I/O bottlenecks. It is well known that GPUs must operate on many thousands of threads to perform efficiently. This is due to the extremely long latencies to and from memory and even between the many processing cores of the GPU. In effect, the GPU must operate on many, many tasks so that it can keep the processing cores from stalling as they await data, which results in very long latency for any given task.
The FPGA uses a "coarse grained parallelism" architecture instead. It creates multiple parallel, optimised datapaths, and each datapath usually outputs one result per clock clock. The number of instances of the datapath depends upon the FPGA resources, but is typically much less than the number of GPU cores. However, each datapath instance is much higher throughput than a GPU core. A primary benefit of this approach is low latency. Minimising latency is a critical performance advantage in many applications.
Another advantage of FPGAs is the much lower power consumption, resulting in dramatically lower GFLOPs per Watt. As measured by BDTI, the GFLOP/s per Watt for complex floating point algorithms such as Cholesky is 5-6 GFLOP/s per Watt [4]. GPU energy efficiency measurements are much hard to find, but using the GPU performance of 50 GFLOP/s for Cholesky and typical power consumption of 200W, results in 0.25 GFLOP/s per Watt, which is twenty times more power consumed per useful FLOP/s.
Both OpenCL and DSPBuilder rely on a technique known as "Fused Datapath" (figure 2), where floating point processing is implemented in such a fashion as to dramatically reduce the number of barrel shifting circuits required, which in turn allows for large scale and high performance floating point designs to be built using FPGAs.
Figure 2: Fused Datapath Implementation of Floating Point. |
To reduce the frequency of implementing barrel shifting, the synthesis process looks for opportunities where using larger mantissa widths can offset the need for frequent normalisation and de-normalisation. The availability of 27x27 and 36x36 hard multipliers allows for significantly larger multipliers than the 23 bits required by single precision, and also the construction of 54x54 and 72x72 multipliers allows for larger than the 52 bits required for double precision. The FPGA logic is already optimised for implementation of large, fixed point adder circuits, with inclusion of built-in carry look- ahead circuits.
Where normalisation and de-normalisation is required, an alternative implementation which avoids low performance and excessive routing is to use multipliers. For a 24bit single-precision mantissa (including the sign bit), the 24x24 multiplier shifts the input by multiplying by 2n. Again, the availability of hardened multipliers in 27x27 and 36x36 allows for extended mantissa sizes in single precision, and can be used to construct the multiplier sizes for double precision.
A vector dot product (figure 3) is the underlying operation consuming the bulk of the FLOP/s used in many linear algebra algorithms. A single precision implementation of length 64 long vector dot product would require 64 floating point multipliers, followed by an adder tree made up of 63 floating point adders. This requires implementation of many barrel shifting circuits.
Figure 3: Vector dot product optimisation. |
Instead the outputs of the 64 multipliers can be de-normalized to a common exponent, being the largest of the 64 exponents. Then these 64 outputs could be summed using a fixed point adder circuit, and a final normalisation performed at the end of the adder tree. This localized block floating point dispenses with all the interim normalisation and de- normalisation required at each individual adder, and is shown in figure 3. Even with IEEE754 floating point, the number with the largest exponent is going to basically determine the exponent at the end, so this just moved this exponent adjustment to an earlier point in the calculation.
However, when performing signal processing, the best results are found by carrying as much precision as possible as performing truncation of results at the end of the calculation. The approach here compensates for this sub-optimal early de-normalisation by carrying extra mantissa bit widths over and above that required by single precision floating point, usually from 27 to 36 bits. The mantissa extension is done with floating point multipliers, to eliminate the need to normalise the product at each step.
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.