Intel® High Level Synthesis Compiler Pro Edition: Best Practices Guide

ID 683152
Date 4/01/2024
Public
Document Table of Contents

5. Loop Best Practices

The Intel® High Level Synthesis Compiler pipelines your loops to enhance throughput. Review these loop best practices to learn techniques to optimize your loops to boost the performance of your component.

The Intel® HLS Compiler Pro Edition lets you know if there are any dependencies that prevent it from optimizing your loops. Try to eliminate these dependencies in your code for optimal component performance. You can also provide additional guidance to the compiler by using the available loop pragmas.

As a start, try the following techniques:
  • Manually fuse adjacent loop bodies when the instructions in those loop bodies can be performed in parallel. These fused loops can be pipelined instead of being executed sequentially. Pipelining reduces the latency of your component and can reduce the FPGA area your component uses.
  • Use the #pragma loop_coalesce directive to have the compiler attempt to collapse nested loops. Coalescing loops reduces the latency of your component and can reduce the FPGA area overhead needed for nested loops.
  • If you have two loops that can execute in parallel, consider using a system of tasks. For details, see System of Tasks Best Practices.

Tutorials Demonstrating Loop Best Practices

The Intel® HLS Compiler Pro Edition comes with a number of tutorials that illustrate important Intel® HLS Compiler concepts and demonstrate good coding practices.

Review the following tutorials to learn about loop best practices that might apply to your design:
Tutorial Description
You can find these tutorials in the following location on your Quartus® Prime system:
<quartus_installdir>/hls/examples/tutorials
best_practices/ divergent_loops Demonstrates a source-level optimization for designs with divergent loops
best_practices/ loop_coalesce Demonstrates the performance and resource utilization improvements of using loop_coalesce pragma on nested loops.
best_practices/ loop_fusion Demonstrates the latency and resource utilization improvements of loop fusion.
best_practices/ loop_memory_dependency Demonstrates breaking loop-carried dependencies using the ivdep pragma.
loop_controls/ max_interleaving
Demonstrates a method to reduce the area utilization of a loop that meets the following conditions:
  • The loop has an II > 1
  • The loop is contained in a pipelined loop
  • The loop execution is serialized across the invocations of the pipelined loop
best_practices/ optimize_ii_using_ hls_register Demonstrates how to use the hls_register attribute to reduce loop II and how to use hls_max_concurrency to improve component throughput
best_practices/ parallelize_array_operation Demonstrates how to improve fMAX by correcting a bottleneck that arises when performing operations on an array in a loop.
best_practices/ relax_reduction_dependency

Demonstrates a method to reduce the II of a loop that includes a floating point accumulator, or other reduction operation that cannot be computed at high speed in a single clock cycle.

best_practices/ remove_loop_carried_dependency Demonstrates how to improve loop performance by removing accesses to the same variable across nested loops.
best_practices/ resource_sharing_filter Demonstrates the following versions of a 32-tap finite impulse response (FIR) filter design:
  • optimized-for-throughput variant
  • optimized-for-area variant
best_practices/ speculated_iterations Demonstrates how to use #pragma speculated_iterations to control when speculated iterations are used.
best_practices/ triangular_loop Demonstrates a method for describing triangular loop patterns with dependencies.