3.5.2.1 Loop Pipelining

Loop pipelining is a performance optimization in high-level synthesis (HLS), which extracts loop-level parallelism by executing multiple loop iterations concurrently using the same hardware. The key performance metric when loop pipelining is the time interval between starting successive loop iterations, called the initiation interval (II). Ideally, the initiation interval should be one, meaning that we can start a new loop iteration every clock cycle. This is the highest throughput that a pipelined loop can achieve without unrolling. Although SmartHLS will always try to achieve an II of 1, sometimes this is not possible, due to resource constraints, or due to cross-iteration dependencies (recurrences) in a loop.

Consider the following example:
int sum = 0;
for (i = 0; i < N; i++) {
      #pragma HLS loop pipeline
      for (j = 0; j < N; j++) {
          sum += A[i][j] * B[i][j];
      }
}

This example shows a nested loop, which performs an element-wise multiplication of two 2-dimensional arrays and accumulates the sum. The inner loop is specified with the 3.6.1.2 Pipeline Loop pragma. SmartHLS will show message as below when compiling this code:

Info: Pipelining the loop on line 61 of loop.c with label "loop".
Info: Done pipelining the loop on line 61 of loop.c with label "loop".
    Pipeline Initiation Interval (II) = 1.

These info messages let us know that SmartHLS successfully pipelined the inner loop with an II of 1. Even though an II of 1 has been achieved, the hardware may not meet our desired performance requirements. In this case, we can choose to pipeline the outer loop by moving the pragma to the outer loop:

int sum = 0;
#pragma HLS loop pipeline
for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
        sum += A[i][j] * B[i][j];
    }
}

When an outer loop is specified to be pipelined, SmartHLS will automatically unroll all of the inner loops. This can provide higher performance at the expense of higher circuit area. In this example, N is 25, and when the inner loop is unrolled, SmartHLS will create 25 multipliers and adder units working in parallel. However, this does not mean that the performance will be improved by 25x due to the resource constraints on memories A and B.

When SmartHLS runs, we will see the following messages:

Info: Unrolling the entire loop nest on line 61 of loop.c.
    This loop nest is inside a parent loop labelled 'loop', which is specified to be
    pipelined.
Info: Pipelining the loop on line 60 of loop.c with label "loop".
Info: Resource constraint limits initiation interval to 13
      Resource 'A_local_memory_port' has 25 uses per cycle but only 2 ports available.
      +--------------------------------+-------------------+-----------+
      | Operation                      | Location          | # of Uses |
      +--------------------------------+-------------------+-----------+
      | 'load' operation for array 'A' | line 60 of loop.c | 25        |
      +--------------------------------+-------------------+-----------+
      |                                | Total # of Uses   | 25        |
      +--------------------------------+-------------------+-----------+

Info: Resource constraint limits initiation interval to 13
      Resource 'B_local_memory_port' has 25 uses per cycle but only 2 ports available.
      +--------------------------------+-------------------+-----------+
      | Operation                      | Location          | # of Uses |
      +--------------------------------+-------------------+-----------+
      | 'load' operation for array 'B' | line 60 of loop.c | 25        |
      +--------------------------------+-------------------+-----------+
      |                                | Total # of Uses   | 25        |
      +--------------------------------+-------------------+-----------+

Info: Done pipelining the loop on line 60 of loop.c with label "loop".
    Pipeline Initiation Interval (II) = 13.

The first info message indicates that the inner loop is being unrolled, since the outer loop is specified to be pipelined. Next, the info messages tell us there are 25 load operations that need to occur to both memory A and B every clock cycle if II is 1, but there are only two ports (which allows 2 loads per cycle) available for each memory. Local_memory_port indicates that this resource is a memory port of a local memory, which is described in Hardware Architecture. Due to the limited available memory ports, SmartHLS must increase the loop pipeline II until we can meet the constraint of having 25 load operations to each memory. When the II is 13, meaning that each successive loop iteration is started every 13 cycles, we have enough time to allow 26 load operations, hence the constraint is met (each memory has 2 ports by default, which allows 2 memory accesses per cycle. In 13 cycles, we can perform 26 memory accesses in total).

For this particular example, when the outer loop is pipelined, the performance is about 2x higher than when the inner loop is pipelined. However, the area has also increased by about 25x, due to having 25 multipliers and adders. Therefore, we must use care when pipelining outer loops due to the unrolling of its inner loops. In general, we recommend pipelining the innermost loop first, and if the performance requirement is not met, then try pipelining the outer loops.

Important:

If the loop specified to be pipelined contains any function calls (in the loop or in any of the inner loops), the function calls will be inlined into the loop. Any descendants of the called functions will also be inlined, and all of their loops will also be unrolled automatically. If there are many descendant functions and loops, this can increase the area significantly (also described in 3.5.2.3 Function Pipelining). We recommend the user to examine the program for such cases before pipelining a loop.

Lets look at an image filtering example:

for (y = 1; y < HEIGHT-1; y++) {
    #pragma HLS loop pipeline
    for (x = 1; x < WIDTH-1; x++) {
        out[y][x] =  in[y-1][x-1]*k[0][0] + in[y-1][x]*k[0][1] + in[y-1][x+1]*k[0][2]
                   + in[y  ][x-1]*k[1][0] + in[y  ][x]*k[1][1] + in[y  ][x+1]*k[1][2]
                   + in[y+1][x-1]*k[2][0] + in[y+1][x]*k[2][1] + in[y+1][x+1]*k[2][2];
    }
}

This example applies a 3 x 3 image kernel filter, array k, to an input image, array in, producing an output image, array out. When we turn on loop pipelining and run SmartHLS, we can see the following messages:

Info: Pipelining the loop on line 22 of kernel.c with label "loop".
Info: Assigning new label to the loop on line 22 of kernel.c with label "loop"
Info: Resource constraint limits initiation interval to 5.
      Resource 'in_local_memory_port' has 9 uses per cycle but only 2 units available.
      +---------------------------------+---------------------+-----------+
      | Operation                       | Location            | # of Uses |
      +---------------------------------+---------------------+-----------+
      | 'load' operation for array 'in' | line 23 of kernel.c | 9         |
      +---------------------------------+---------------------+-----------+
      |                                 | Total # of Uses     | 9         |
      +---------------------------------+---------------------+-----------+

Info: Done pipelining the loop on line 22 of kernel.c with label "loop".
      Pipeline Initiation Interval (II) = 5.

The pipeline initiation interval is limited by the memory accesses to the input image (array in). There are 9 loads but only two memory ports, which forces the loop II to be 5, allowing up to 10 loads per iteration from array in. For loops where the II is constrained by memory accesses to an array, you can improve the II by manually splitting the array into several smaller C arrays. Each array can be accessed independently, which reduces resource contention. In this case, we can split the image into rows of pixels, where each row is stored in a separate array (in_0, in_1, and in_2).

for (y = 1; y < HEIGHT-1; y++) {
    #pragma HLS loop pipeline
    for (x = 1; x < WIDTH-1; x++) {
        out[y][x] =  in_0[x-1]*k[0][0] + in_0[x]*k[0][1] + in_0[x+1]*k[0][2]
                   + in_1[x-1]*k[1][0] + in_1[x]*k[1][1] + in_1[x+1]*k[1][2]
                   + in_2[x-1]*k[2][0] + in_2[x]*k[2][1] + in_2[x+1]*k[2][2];
    }
}

Now when we run SmartHLS we will see:

Info: Pipelining the loop on line 22 of kernel.c with label "loop".
Info: Resource constraint limits initiation interval to 2.
      Resource 'in_0_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_0' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+

Info: Resource constraint limits initiation interval to 2.
      Resource 'in_1_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_1' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+

Info: Resource constraint limits initiation interval to 2.
      Resource 'in_2_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_2' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+
Info: Done pipelining the loop on line 22 of kernel.c with label "loop".
      Pipeline Initiation Interval (II) = 2.

Now the initiation interval has improved from 5 to 2, which is a more than a 2x performance improvement just by manually partitioning the arrays.

Consider another example below:

int A = 1;

#pragma HLS loop pipeline
for (i = 0; i < N; i++) {
    A = A * B[i];
}

We have a loop where the value of A in the current iteration is dependent on the previous iteration. This is called a cross-iteration dependency or loop recurrence. In order to achieve an II of 1, the value of A is required every clock cycle. This means that the multiplication of A and B[i] has to complete every clock cycle. Now, let's consider a case where we would like to pipeline the multiplier more in order to get a higher Fmax (the maximum frequency of the circuit). This can be done by changing the multiplier latency to 2 (using the 3.6.2.2.6 set_operation_latency constraint in SmartHLS ) from the default latency of 1.

WhenSmartHLS runs, we will see the following messages:

Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop".
Info: Cross-iteration dependency does not allow initiation interval of 1.
    Dependency (distance = 1) from 'mul' operation (at line 11 of loop_recurrence.c) to
    'phi' operation (at line 11 of loop_recurrence.c)
    Recurrence path:
    +-----------------+------------------------------+---------------+
    | Operation       | Location                     | Cycle Latency |
    +-----------------+------------------------------+---------------+
    | 'phi' operation | line 11 of loop_recurrence.c | 0             |
    | 'mul' operation | line 11 of loop_recurrence.c | 2             |
    +-----------------+------------------------------+---------------+
    |                 | Total Required Latency       | 2             |
    +-----------------+------------------------------+---------------+

    Total required latency = 2. Maximum allowed latency = distance x II = 1 x 1 = 1.
    Total required latency > Maximum allowed latency, we must increase II
Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop".
    Pipeline Initiation Interval (II) = 2.

The messages tell us that the II cannot be 1 due to a cross-iteration dependency, which is from the multiply operation of the current loop iteration to the phi operation of the next iteration. You can think of a phi as a select operation, which is need to represent the program's intermediate representation in static single assignment form. In this particular case, the phi selects the value of A between the initial value of 1 (in the first iteration of the loop), and the computed value of A from within the loop (in the iterations after the first). The dependency distance of 1 means that the multiply value is used by the phi operation 1 loop iteration later. Or alternatively, the phi operation is dependent on the multiply value from one loop iteration ago. The Recurrence path table shows that the phi operation takes 0 cycles, but the multiply operation takes 2 cycles, hence the total required latency is 2 for the path. However, the maximum allowed latency, if the II were to be 1, is 1 (distance x II = 1 x 1 = 1). In this case, the next loop iteration should be starting after 1 clock cycle (II = 1) but we still have not finished calculating the result of the multiply which is needed by the phi operation in the next iteration. Since the total required latency is greater than the maximum allowed latency, the II has to be increased to 2. With the II being 2, the maximum allowed latency becomes 2, which satisfies the total required latency. In this case, the first iteration will start, we will wait two clock cycles for the multiply to finish (II = 2), then start the next loop iteration.

In general, for pipelining, achieving a lower II is the highest priority for achieving the highest performance, even at the expense of slightly lower Fmax. For example, if we can reduce the II from 2 to 1 then we cut the clock cycles taken by the loop in half, but we are unlikely to double the Fmax by inserting one more pipeline stage (which changes an II of 1 to 2 in this case).

If we use the 3.6.2.2.6 set_operation_latency constraint to reduce the multiplier latency from 2 to 1 and run SmartHLS, we will see the following messages, where we have achieved an II of 1:

Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop".
Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop".
    Pipeline Initiation Interval (II) = 1.

The above example illustrated a case of II being increased due to the latency of operations in the presence of a loop recurrence. The II can also be increased due to the delay of operations in a loop with cross-iteration dependencies.

Consider the following example:

#pragma HLS loop pipeline
for (iter = 0; iter < MAX_ITER; iter++) {
    long long squared = x * x + y * y;
    xtmp = x * x - y * y + x_0;
    y = 2 * x * y + y_0;
    x = xtmp;

    filter += squared <= 4;
}

The code shows the main computations for the mandelbrot set, the algorithm details are not important. When we run SmartHLS on this code, we see the following messages:

Info: Pipelining the loop on line 39 of mandelbrot.c with label "loop".
Info: Cross-iteration dependency does not allow initiation interval (II) of 1.
    Dependency (distance = 1) from 'trunc' operation (at line 42 of mandelbrot.c) to
    'phi' operation (at line 42 of mandelbrot.c)
    Recurrence path:
    +-------------------+-------------------------+------------+
    | Operation         | Location                | Delay [ns] |
    +-------------------+-------------------------+------------+
    | 'phi' operation   | line 42 of mandelbrot.c | 0.00       |
    | 'sext' operation  | line 40 of mandelbrot.c | 0.00       |
    | 'mul' operation   | line 42 of mandelbrot.c | 8.00       |
    | 'ashr' operation  | line 42 of mandelbrot.c | 0.00       |
    | 'shl' operation   | line 42 of mandelbrot.c | 0.00       |
    | 'add' operation   | line 42 of mandelbrot.c | 6.40       |
    | 'trunc' operation | line 42 of mandelbrot.c | 0.00       |
    +-------------------+-------------------------+------------+
    |                   | Total Required Delay    | 14.40      |
    +-------------------+-------------------------+------------+

    Total required delay = 14.40 ns.
    Maximum allowed latency = distance x II = 1.
    Maximum allowed delay = Maximum allowed latency x clock period
                          = 1 x 8.00 ns = 8.00 ns
    Total required delay > Maximum allowed delay, we must increase II.
    Tip: Increase the clock period to be greater than the total required delay
         to improve II.

Info: Done pipelining the loop on line 39 of mandelbrot.c with label "loop".
    Pipeline Initiation Interval (II) = 2.

The messages indicate that there is a cross-iteration dependency from the truncate operation to the phi operation, where the total required delay for the operation is 14.40 ns. On the other hand, the maximum allowed latency, if the II were to be 1, is 1, and the maximum allowed delay, based on the given clock period constraint (8 ns) and the maximum allowed latency (1), is 8 ns. Since the required delay of 14.4 ns for the path is greater than the maximum allowed delay of 8 ns, SmartHLS must increase the II to 2 to satisfy the required delay. If the II is 2, the maximum allowed latency (distance x II = 1 x 2) becomes 2, hence the maximum allowed delay becomes 16 ns (maximum allowed latency x clock period = 2 x 8 ns), and the required delay can be met.

As mentioned above, keeping the II low (and ideally 1) should generally be the priority for achieving the maximum performance. Another way to meet the required delay shown above, based on the equations shown as well as the described Tip, is to increase the clock period rather than increasing the II. With an II of 1 (hence the maximum allowed latency of 1), if the clock period is bigger than 14.4, the maximum allowed delay should be greater than the total required delay.

Let's set the clock period to 15 (with the 3.6.2.2.1 CLOCK_PERIOD constraint), and re-run SmartHLS:

Info: Generating pipeline for loop on line 39 of mandelbrot.c with label "loop".
    Pipeline initiation interval = 1.

You can see that SmartHLS was now able to generate a circuit with an II of 1.

Loop pipelining is a great technique for achieving high performance, and in order to achieve the maximum performance, users should be mindful of the circuit resource constraints and the recurrences that exist in the loop.