Intel® FPGA SDK for OpenCL™ Pro Edition: Best Practices Guide

ID 683521
Date 12/19/2022
Public
Document Table of Contents

4.3. Optimizing Floating-Point Operations

For floating-point operations, you can manually direct the Intel® FPGA SDK for OpenCL™ Offline Compiler to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware usage. These optimizations can cause small differences in floating-point results.
Tip: For more oneAPI DPC++-specific details, refer to Optimize Floating-point Operation topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits.

Tree Balancing

Order of operation rules apply in the OpenCL™ language. In the following example, the offline compiler performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:

result = (((A * B) + C) + (D * E)) + (F * G);

By default, the offline compiler creates an implementation that resembles a long vine for such computations:

Figure 69. Default Floating-Point Implementation


Long, unbalanced operations lead to more expensive hardware. A more efficient hardware implementation is a balanced tree, as shown below:

Figure 70. Balanced Tree Floating-Point Implementation


In a balanced tree implementation, the offline compiler converts the long vine of floating-point adders into a tree pipeline structure. The offline compiler does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.

If you want the offline compiler to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the -fp-relaxed option in the aoc command, as shown below:

aoc -fp-relaxed <your_kernel_filename>.cl

Rounding Operations

The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The offline compiler does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.

You can reduce the amount of hardware necessary to implement floating-point operations with the -fpc option of the aoc command. If your program can tolerate small differences in floating-point results, invoke the following command:

aoc -fpc <your_kernel_filename>.cl

The -fpc option directs the offline compiler to perform the following tasks:

  • Remove floating-point rounding operations and conversions whenever possible.

    If possible, the -fpc argument directs the offline compiler to round a floating-point operation only once—at the end of the tree of the floating-point operations.

  • Carry additional mantissa bits to maintain precision.

    The offline compiler carries additional precision bits through the floating-point calculations, and removes these precision bits at the end of the tree of floating-point operations.

This type of optimization results in hardware that performs a fused floating-point operation, and it is a feature of many new hardware processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to more accurate results. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The offline compiler can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.