3.5.1.13.1 Further Throughput Enhancement with Loop Pipelining

In this example, the throughput of the streaming circuit will be limited by how frequently the functions can start processing new data (for example, how frequently the new loop iterations can be started). For instance, if the slowest function among the three functions can only start a new loop iteration every 4 cycles, then the throughput of the entire streaming circuit will be limited to processing one piece of data every 4 cycles. Therefore, as you may have guessed, we can further improve the circuit throughput by pipelining the loops in the three functions. If you run SmartHLS synthesis for the example (Compile Software to Hardware), you should see in the Pipeline Result section of our report file, summary.hls.<top_level>.rpt, that all loops can be pipelined with an initiation interval of 1. That means all functions can start a new iteration every clock cycle, and hence the entire streaming circuit can process one piece of data every clock cycle. Now run the simulation (Simulate Hardware) to confirm our expected throughput. The reported cycle latency should be just slightly more than the number of data samples to be processed (INPUTSIZE is set to 128; the extra cycles are spent on activating the parallel accelerators, flushing out the pipelines, and verifying the results).