3.5.2.9 Optimizing AXI4 Initiator Performance
(Ask a Question)An AXI4 Initiator interface can be used to allow an HLS module to access an external memory while it is running, potentially saving time compared to other methods where data needs to be fully transferred first before the module can start. SmartHLS™ allows for optimization of the AXI4 Initiator transactions using pragmas. For more information on AXI4 Initiator interface arguments, see AXI4 Initiator Interface. For more information on the pragmas used in this section, see HLS Pragmas Manual.
in_array
and out_array
declare that memory transfers to these external memories should be implemented using an AXI4 Initiator interface. The interface pragma for default
declares that the HLS module control (see Module Control Interface), as well as pointer addresses are implemented with an AXI4 Target interface.#pragma HLS function top #pragma HLS interface default type(axi_target) #pragma HLS interface argument(in_array) type(axi_initiator) \ num_elements(NUM_ELEMENTS) #pragma HLS interface argument(out_array) type(axi_initiator) \ num_elements(NUM_ELEMENTS)
for (unsigned idx = 0; idx < num_elements; ++idx) {
out_array[idx] = in_array[idx];
}
}
// An optimized version of the same core.
void copy_array_optimized(unsigned *in_array, unsigned *out_array, unsigned num_elements) {
#pragma HLS function top #pragma HLS interface default type(axi_target) // (2) Specify maximum burst length in the interface pragma // (3) Specify maximum number of outstanding transactions via interface pragma #pragma HLS interface argument(in_array) type(axi_initiator) \ num_elements(NUM_ELEMENTS) max_burst_len(256) max_outstanding_reads(2) #pragma HLS interface argument(out_array) type(axi_initiator) \ num_elements(NUM_ELEMENTS) max_burst_len(256) max_outstanding_writes(2)
// (1) Pipeline the loop to infer an AXI burst transfer
#pragma HLS loop pipeline
for (unsigned idx = 0; idx < num_elements; ++idx) {
out_array[idx] = in_array[idx];
}
}
In the optimized version above, the following optimizations have been made:
- By pipelining the loop that the AXI4 Initiator transaction resides in, SmartHLS can infer an AXI burst. There are specific criteria that must be met to infer a burst transaction (see AXI4 Initiator Interface). An AXI burst has the benefit of only needing to send the address once for multiple data beats, so it can improve performance by sending less data overall, and spending less time waiting for a response from the external memory. It is important to note that when SmartHLS infers an AXI4 Initiator burst transfer, it will also infer FIFOs on the AXI4 RDATA and WDATA signals (of size equal to the maximum burst length). This is to ensure that the HLS module and the external memory will not stall each other mid-burst while processing data.
- For AXI4 Initiator burst transactions, the maximum burst length can be configured using the interface pragma. Setting this to a larger number (max 256, default is 16) can improve the burst performance at the cost of holding the AXI channel for a longer time per burst. The maximum burst length also determines the size of the RDATA/WDATA FIFOs inside the HLS module, so a larger burst length will use more memory.
- For AXI4 Initiator burst transaction, the maximum number of outstanding transactions can be configured using the interface pragma. Setting this to a larger number (maximum 8, default is 1) can improve performance by sending burst requests earlier and potentially reducing the cases where the HLS module stalls waiting for data, at the cost of larger RDATA/WDATA FIFOs.
hls_alloc
library, see Memory Allocation Library.int main() { // Allocate/initialize cpu-side memory // (4) Choose a cpu-side allocation region that best suits the transfer unsigned *memory_in = (unsigned *)hls_malloc( sizeof(unsigned) * NUM_ELEMENTS, HLS_ALLOC_CACHED); unsigned *memory_out = (unsigned *)hls_malloc( sizeof(unsigned) * NUM_ELEMENTS, HLS_ALLOC_CACHED); for (unsigned idx = 0; idx < NUM_ELEMENTS; idx++) { memory_in[idx] = idx+1; } // Run the accelerator
copy_array_optimized(memory_in, memory_out, NUM_ELEMENTS);
}
The recommended configuration for both AXI4 Initiator configuration and memory region choice can vary greatly depending on the system the HLS module is included in. For SoC projects using the PolarFire SoC Icicle Kit, we recommend the following settings for best AXI4 Initiator performance:
- AXI4 Initiator bursting is recommended when the code is in a format that is supported.
- Larger
max_burst_len
is recommended for best performance (best performance at256
). - The recommended
max_outstanding_reads
is2
, and the recommendedmax_outstanding_writes
transactions is1
. - For moving data between MSS DDR and a SmartHLS module, it is recommended to use the AXI4 Initiator interface, and to store the data in the HLS_ALLOC_NONCACHED memory region.