3.4 Appendix: Loop Pipelining in Part 1 vs. Part 2

(Ask a Question)

In Part 2 of the tutorial, it was noted that the nested loops from Part 1 were manually flattened into a single for loop (called “loop flattening”). As shown in sobel.cpp from Part 2 below:

#pragma HLS loop pipeline
 for (int i = 0; i < (HEIGHT - 2) * (WIDTH - 2); i++) {
 // increment row when column reaches end of row
 y = (x == WIDTH - 2) ? y + 1 : y;
 // increment column until end of row
 x = (x == WIDTH - 2) ? 1 : x + 1;

The nested loops were flattened because SmartHLS does not support loop pipelining nested loops without unrolling the inner loops. For example, if you open sobel.cpp from Part 1 and add a pragma to pipeline the outer loop of the nested loop:

#pragma HLS loop pipeline
 for (int i = 0; i < HEIGHT; i++) {
 for (int j = 0; j < WIDTH; j++) {
 // Set output to 0 if the 3x3 receptive field is out of bound.
 if ((i < 1) | (i > HEIGHT - 2) | (j < 1) | (j > WIDTH - 2)) {
 out[i][j] = 0;
 continue;
 }

Then SmartHLS will try to fully unroll the innermost loop (j index) but SmartHLS will give a warning in the Console output since the loop has many iterations:

Warning: Failed to unroll the entire loop nest on line 19 of sobel.c.

And since the innermost loop has not been unrolled, then the loop cannot be pipelined:

Warning: SmartHLS cannot pipeline nested loops.

See the following screenshot from the SmartHLS IDE console.

Figure 3-28. Screenshot from the SmartHLS IDE Console

If you pipeline the innermost loop, then the hardware will be less efficient than flattening the nested loop into a single loop. Because for each outer loop iteration, you will need to stop and wait for the innermost loop pipeline to finish. If the nested loops are flattened, and everything is pipelined, then all the iterations can be overlapped, and you never need to wait.

For example, if you open sobel.cpp from Part 1 and pipeline the innermost loop by adding the loop pipeline pragma:

 for (int i = 0; i < HEIGHT; i++) {
#pragma HLS loop pipeline
 for (int j = 0; j < WIDTH; j++) {
 // Set output to 0 if the 3x3 receptive field is out of bound.
 if ((i < 1) | (i > HEIGHT - 2) | (j < 1) | (j > WIDTH - 2)) {
 out[i][j] = 0;
 continue;
 }

When you re-run Software to Hardware by clicking on the

icon. The initiation interval of the innermost loop is 4 as shown in the following summary report:

====== 3. Pipeline Result ======
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+
| Label | Function | Basic Block | Location in Source Code | Initiation Interval | Pipeline Length | Iteration Count | Latency |
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+
| for_loop_sobel_cpp_20_6 | sobel_filter | %for.body3 | line 20 of sobel.cpp | 4 | 10 | 512 | 2054 |
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+

Run co-simulation

and see the Console output as shown below:

Number of calls: 1
Cycle latency: 1,052,677
SW/HW co-simulation: PASS
make[1]: Leaving directory '.../workspace/sobel_part1'
15:51:35 Build Finished (took 1m:51s.138ms)

This cycle latency roughly corresponds to 512 (outer loop iterations) x 2054 (latency of innermost loop pipeline) = 1,051,648 cycles. There are extra cycles for the hardware running before and after the pipelined loop.

Compare this latency to sobel.cpp in Part 2 when you pipelined the flattened loop:

#pragma HLS loop pipeline
 for (int i = 0; i < (HEIGHT - 2) * (WIDTH - 2); i++) {
 // increment row when column reaches end of row
 y = (x == WIDTH - 2) ? y + 1 : y;
 // increment column until end of row
 x = (x == WIDTH - 2) ? 1 : x + 1;
 When we run compile software to hardware  and we look at the summary report:
====== 3. Pipeline Result ======
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+
| Label | Function | Basic Block | Location in Source Code | Initiation Interval | Pipeline Length | Iteration Count | Latency |
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+
| for_loop_sobel_cpp_20_5 | sobel_filter | %for.body | line 20 of sobel.cpp | 4 | 11 | 260100 | 1040407 |
+-------------------------+--------------+-------------+-------------------------+---------------------+-----------------+-----------------+---------+

The initiation interval of the flattened loop is still 4. But the pipeline length/depth is now 1 cycle longer (11 cycles instead of 10 cycles).

When you run co-simulation

you see the Console output as shown below:

Number of calls: 1
Cycle latency: 1,040,413
SW/HW co-simulation: PASS
make[1]: Leaving directory '.../workspace/sobel_part2'
15:59:45 Build Finished (took 1m:40s.823ms)

This cycle latency roughly corresponds to the 1,040,407 latency reported in the last column, “Latency” in the pipeline summary report. Flattening the loop improves the cycle latency from: 1,052,677 to 1,040,413 (1% improvement). In this case, there is not much improvement by loop flattening. But depending on the loop nest, there can be a big impact.

v2024.2

3.4 Appendix: Loop Pipelining in Part 1 vs. Part 2