3.5.2.9 Optimizing AXI4 Initiator Performance

An AXI4 Initiator interface can be used to allow an HLS module to access an external memory while it is running, potentially saving time compared to other methods where data needs to be fully transferred first before the module can start. SmartHLS allows for optimization of the AXI4 Initiator transactions using pragmas. For more information on AXI4 Initiator interface arguments, see AXI4 Initiator Interface. For more information on the pragmas used in this section, see HLS Pragmas Manual.

And example of an unoptimized HLS module that uses an AXI4 Initiator interface is shown below. The module simply uses a loop to copy elements from one external memory to another. The interface pragmas for in_array and out_array declare that memory transfers to these external memories should be implemented using an AXI4 Initiator interface. The interface pragma for default declares that the HLS module control (see Module Control Interface), as well as pointer addresses are implemented with an AXI4 Target interface.
#pragma HLS function top
#pragma HLS interface default type(axi_target)
#pragma HLS interface argument(in_array) type(axi_initiator)                   \
    num_elements(NUM_ELEMENTS)
#pragma HLS interface argument(out_array) type(axi_initiator)                  \
    num_elements(NUM_ELEMENTS)
    for (unsigned idx = 0; idx < num_elements; ++idx) {
        out_array[idx] = in_array[idx];
    }
}
The HLS module described above is functional, but there are multiple optimizations that can be implemented to improve the performance of the AXI4 Initiator transactions. An example of an optimized version of the same module is show below:
// An optimized version of the same core.
void copy_array_optimized(unsigned *in_array, unsigned *out_array, unsigned num_elements) {
#pragma HLS function top
#pragma HLS interface default type(axi_target)
// (2) Specify maximum burst length in the interface pragma
// (3) Specify maximum number of outstanding transactions via interface pragma
#pragma HLS interface argument(in_array) type(axi_initiator)                   \
    num_elements(NUM_ELEMENTS) max_burst_len(256) max_outstanding_reads(2)
#pragma HLS interface argument(out_array) type(axi_initiator)                  \
    num_elements(NUM_ELEMENTS) max_burst_len(256) max_outstanding_writes(2)
// (1) Pipeline the loop to infer an AXI burst transfer
#pragma HLS loop pipeline
    for (unsigned idx = 0; idx < num_elements; ++idx) {
        out_array[idx] = in_array[idx];
    }
}

In the optimized version above, the following optimizations have been made:

  1. By pipelining the loop that the AXI4 Initiator transaction resides in, SmartHLS can infer an AXI burst. There are specific criteria that must be met to infer a burst transaction (see AXI4 Initiator Interface). An AXI burst has the benefit of only needing to send the address once for multiple data beats, so it can improve performance by sending less data overall, and spending less time waiting for a response from the external memory. It is important to note that when SmartHLS infers an AXI4 Initiator burst transfer, it will also infer FIFOs on the AXI4 RDATA and WDATA signals (of size equal to the maximum burst length). This is to ensure that the HLS module and the external memory will not stall each other mid-burst while processing data.
  2. For AXI4 Initiator burst transactions, the maximum burst length can be configured using the interface pragma. Setting this to a larger number (max 256, default is 16) can improve the burst performance at the cost of holding the AXI channel for a longer time per burst. The maximum burst length also determines the size of the RDATA/WDATA FIFOs inside the HLS module, so a larger burst length will use more memory.
  3. For AXI4 Initiator burst transaction, the maximum number of outstanding transactions can be configured using the interface pragma. Setting this to a larger number (maximum 8, default is 1) can improve performance by sending burst requests earlier and potentially reducing the cases where the HLS module stalls waiting for data, at the cost of larger RDATA/WDATA FIFOs.
Additionally, for SmartHLS modules that use their AXI4 Initiator interface to transfer data to/from DDR on an MSS (see SoC Features), the DDR memory region where the data resides can have a large impact on performance. For projects using the PolarFire SoC Icicle Kit and reference design, such as those generated by SmartHLS's Reference SoC Generation, the memory region can be selected in the allocation call for the software running on the MSS as shown below. For more information on using the hls_alloc library, see Memory Allocation Library.
int main() {
    // Allocate/initialize cpu-side memory
    // (4) Choose a cpu-side allocation region that best suits the transfer
    unsigned *memory_in = (unsigned *)hls_malloc(
        sizeof(unsigned) * NUM_ELEMENTS, HLS_ALLOC_CACHED);
    unsigned *memory_out = (unsigned *)hls_malloc(
        sizeof(unsigned) * NUM_ELEMENTS, HLS_ALLOC_CACHED);
    for (unsigned idx = 0; idx < NUM_ELEMENTS; idx++) {
        memory_in[idx] = idx+1;
    }

    // Run the accelerator
    copy_array_optimized(memory_in, memory_out, NUM_ELEMENTS);
}

The recommended configuration for both AXI4 Initiator configuration and memory region choice can vary greatly depending on the system the HLS module is included in. For SoC projects using the PolarFire SoC Icicle Kit, we recommend the following settings for best AXI4 Initiator performance:

  • AXI4 Initiator bursting is recommended when the code is in a format that is supported.
  • Larger max_burst_len is recommended for best performance (best performance at 256).
  • The recommended max_outstanding_reads is 2, and the recommended max_outstanding_writes transactions is 1.
  • For moving data between MSS DDR and a SmartHLS module, it is recommended to use the AXI4 Initiator interface, and to store the data in the HLS_ALLOC_NONCACHED memory region.