3.5.1.19.13 Error Correction Code (ECC) Library

The Error Correction Code (ECC) library offers API functions for exposing ECC signals from the hardware memory. To use the ECC library, include the header file:

#include "hls/ecc.hpp"

Accessing ECC Signals

ECC signals can be accessed through the API function read_ecc. read_ecc will take in a pointer to the array element or the FIFO as the argument and returns a data wrapper that contains 3 elements:
data
The data read from the address of the array element. When single-bit error is detected, the data output is automatically corrected.
sb_correct
True if a single-bit error was detected and corrected.
db_detect
True if a double-bit error was detected.

In RTL simulation waveform, both the db_dectect and sb_correct flags are asserted when a multi-bit error occurs, and only the sb_correct flag is asserted for single-bit error (refer to the RTG4 FPGA Fabric User Guide and the PolarFire Family Fabric User Guide for more details). To simplify the C++ design, SmartHLS automatically handled this pattern. User only needs to check for sb_correct=1 for single-bit errors and db_dectect=1 for multi-bit errors

Important: ECC memory in Microchip devices do not automatically write-back corrected data when single-bit error is detected.
The following is an example of how the API is used.
#include <stdio.h>
#include "hls/ecc.hpp"
#define SIZE 100

using namespace hls;
 
#pragma HLS memory impl variable(x) ecc(true)
int x[SIZE];
 
int f(int c) {
#pragma HLS function top
    int sb_count = 0;
    int db_count = 0;
 
    // Initialize the memory
    for (int i = 0; i < SIZE; i++)
        x[i] = c;
 
    int sum = 0;
    for (int i = 0; i < SIZE; i++) {
        auto ecc_out = read_ecc(&x[i]);
 
        // Check for single-bit error (corrected)
        if(ecc_out.sb_correct) {
            // Write back to correct the contents of the RAM
            x[i] = ecc_out.data;
            sb_count++;
        }
        // Check for double-bit error (detected)
        if(ecc_out.db_detect)
            db_count++;
 
        sum += ecc_out.data;
    }
 
    printf("sb_count: %d\n", sb_count);
    printf("db_count: %d\n", db_count);
    return sum;
}

In the example, variable x has ECC enabled by setting ecc(true) in the memory pragma. In the top-level function f, the elements of x  are initialized using the argument c. After the initialization, x  is read element-by-element using read_ecc(&x[i]) instead of directly using x[i]. The call to read_ecc returns a data wrapper ecc_out that contains the data read from memory and the ECC signals. The ECC signals are used to increment the corresponding counters sb_count , db_count , and the read data is added to sum .

Note that when single-bit error was detected (sb_correct = true), the read data output is corrected but data in the RAM location is not updated. Thus, in the example, corrected data is manually written back to the RAM for single-bit errors.

Inject ECC Error for Error Simulation

Important: ECC error simulation support is an alpha feature in SmartHLS, and is under active development. ECC error injection feature is not supported for external memories, FIFOs and Line Buffers in this release.
SmartHLS provides the API function inject_ecc_error for simulating ECC error. inject_ecc_error takes in 2 parameters:
address
Pointer to the array element
mask
  • Masking bit that will be applied to read data output at the given address
  • Read data bit is flipped where masking bit is 1
    • single-bit error if mask has only one non-zero bit
    • multi-bit error if mask has more than one non-zero bits
Important: address and mask must be compile-time constants. Input of address and mask as variables is not supported. For example, inject_ecc_error(&x[i], mask); is not supported because i and mask are variables.

The generated Verilog file that contains all the error injection tasks is located in hls_output/simulation/generated_include_file.v.

The following example shows how ECC error simulation works.

#include <stdio.h>
#include "hls/ecc.hpp"
#define SIZE 100
 
using namespace hls;
 
int fct(int &sb_count, int &db_count) {
#pragma HLS function top
 
    #pragma HLS memory impl variable(x) ecc(true)
    int x[SIZE];
    for (int i = 0; i < SIZE; i++) {
        x[i] = i;
    }
 
    inject_ecc_error(&x[3], /*mask 'b1*/ 1);
    inject_ecc_error(&x[4], /*mask 'b11*/ 3);
    inject_ecc_error(&x[99], /*mask 'b10001*/ 17);
 
    int sum = 0;
    for (int i = 0; i < SIZE; i++) {
        auto ecc_info = read_ecc(&x[i]);
        if( ecc_info.sb_correct){
            sb_count++;
            printf("sb_correct at x[%d], data = %d\n", i,  ecc_info.data);
        }
        else if(ecc_data_x.db_detect){
            db_count++;
            printf("db_detect  at x[%d], data = %d\n", i, ecc_data_x.data);
        }
 
        sum += ecc_data_x.data;
    }
    return sum;
}

In this example, x is a local memory with ECC enabled and initialized with incremental data. In the top-level function fct, inject_ecc_error is called 3 times to set error masks for x[3], x[4] and x[99]. Single-bit or double-bit errors are injected to the given address based on the masking value (see code comments). In the next for loop, x is read element-by-element using read_ecc(&x[i]). A message would be printed when a single-bit error or double-bit error is detected and the read data is added to sum.

Output of the example is shown below. Note the read data have been changed based on the mask set in inject_ecc_error.
# sb_correct at x[  3], data =           3
# db_detect  at x[  4], data =           7
# db_detect  at x[ 99], data =         114
Working with ECC Enabled Structs

SmartHLS provides memory optimization for structs such as packing and partition by struct fields. These optimizations may be tricky to handle with error injections.

If struct is partitioned by struct fields, the recommended approach is to inject error per struct field. The following example show the case when struct ST is automatically partitioned by struct fields (see Access-Based Memory Partitioning fore more details). inject_ecc_error is applied on each struct field to match with the RTL behavior.

On the other hand, when enabling bit-packing on ECC structs, all struct fields will be merged into one RAM block. It is recommended to inject error per struct to match the RTL behavior. In the example below, mask 0x300000003ULL is applied on the entire struct.

Limitations

The limitations of error injection are as follows:
  • On the software side, error injection mask will only be applied to read data when accessing the RAM with read_ecc. Access with operator[] will return the original value without any error. To ensure that the SW/HW Co-Simulation behaves correctly, it is recommend to use ECC_RAM or always access the RAM with read_ecc to avoid mismatch between software and hardware results.
  • Currently, error injection calls are only supported in hardware top-level function and all of its descendant functions. Error injection calls in the software testbench will be ignored. See Specifying the Top-level Function for details on software testbench and top-level function.
  • ECC error injection calls may have effect on number of simulation cycles, depending on the memory access pattern. All inject_ecc_error are treated as a write operation.
  • address and mask must be compile-time constants and the maximum supported size for error injection mask is 64 bits. Input of address and mask as variables is not supported. For example, consider the following: inject_ecc_error(&x[i], mask);. In this case, i and mask are variables and are not supported.
  • If memory is partitioned into individual elements (registers), ECC error injection logic will be optimized away. This may result in mismatch between software and hardware read data.

ECC RAM Wrapper

The ECC library also offers a C++ wrapper to encapsulate some of the common functionality to handle errors using ECC_RAM. ECC_RAM  is a pure C++ implementation using read_ecc  to show case how the access to the low-level ECC signals can be abstracted and used seamlessly in the design.
ECC_RAM<data_type, depth, SB_WRITE_BACK, DB_OVERRIDE, DB_DEFAULT> ecc_ram
ECC_RAM uses template parameters to configure the memory and error handling:
Data type
the element type of the memory.
Depth
the number of elements in the memory.
SB_WRITE_BACK
if true, when a single-bit error is detected, the corrected value is immediately written-back to correct the corrupted data in the memory.
CAUTION: Enabling immediate write-back using SB_WRITE_BACK , can affect the performance since the load operation can invoke an immediate store to the same address. This should be taken into consideration if the latency is critical (e.g. in a loop pipeline).
DB_OVERRIDE
if true, instead of returning the corrupted data when double-bit error is detected, a default value is used.
DB_DEFAULT
a default value when double-bit error is detected.
The example below illustrates error handling using ECC_RAM.
#include <hls/ecc.hpp>
#include <stdio.h>

using namespace hls;

#define __REPORT_ECC__

#define N 1000
#pragma HLS memory impl variable(ecc_ram) ecc(true)
ECC_RAM<int,  // data type
        N,    // depth
        true, // SB_WRITE_BACK
        true, // DB_OVERRIDE
        -1    // DB_DEFAULT
        >
    ecc_ram;

// ----- Top function: Read i, j and write to k
int f(int i, int j, int k, int val) {
#pragma HLS function top

    // ----- Reading and handling errors implicitly
    int d_i = ecc_ram[i];

    // ----- Reading and handling errors explicitly
    int d_j = 0;
    if (!ecc_ram.read(j, d_j))
      d_j = -2;

    int sum = d_i + d_j;

    // ----- Writing
    ecc_ram[k] = val;

    // ----- Reporting
    auto sb_count = ecc_ram.sb_count();

    // ----- Scrubbing after a certain number of SB errors
    if (ecc_ram.sb_count() > N / 2)
        ecc_ram.scrub();

    return d_i + d_j;
}

In the example, ecc_ram uses ECC_RAM to instantiate an int array with 1000 elements.

ECC_RAM  can be accessed similar to a C++ array using [] , however the implementation uses read_ecc  to access data and the ECC signals, and handle the errors based on the template configuration. With the configuration in the example: 

  • ecc_ram[i] will read the data and implicitly handle errors:
    • If a single-bit error is detected, write back the corrected data to the RAM and return the corrected data
    • If a double-bit error is detected, discard the read data and return the DB_DEFAULT value -1 .
  • ecc_ram.read(j, d_j) will read the data at jand and return the RAM data in d_j (correct or erroneous) to the caller:
    • If a single-bit error is detected, write back the corrected data to the RAM and set d_j  to the read data
    • If a double-bit error is detected, set d_j  to the erroneous read data.
    • Return true  if the data is correct, false  if a double-bit error is detected.

ECC_RAM  has internal counter that counts the number single-bit and double-bit errors for any read operation. The counters can be accessed using ecc_ram.sb_count()  and ecc_ram.db_count(). The counters can be reset using ecc_ram.reset_counters().

ecc_ram.scrub() will scrub the memory by reading element-by-element and write back them back. This can be useful when there are many errors to refresh the entries with single-bit errors and avoid further corruption.

ECC_RAM  can report the error handling process to the standard output when __REPORT_ECC__ is defined.

-- DB error detected at 0       Overriding with default value -1
- SB error corrected at 2       Writing back corrected value
Warning: ECC_RAM is a wrapper around the the array that represents the actual memory. In this release, it is not possible to apply memory optimizations on the underlying data (e.g. partitioning).
Here is a summary of all the API functions:
Class MethodDescription
ECC_RAM<data_type, depth, SB_WRITE_BACK, DB_OVERRIDE, DB_DEFAULT>()Create a new ECC RAM with the specified parameters.
operator[i]Read/write data at index i and handle errors based on the configuration.
bool read(i, d_i)Read data at index i and save the read data from memory into d_i. Return true  if the data is correct, false  otherwise.
int sb_count()Return the number of single-bit error detected and corrected.
int db_count()Return the number of double-bit error detected.
void reset_counters()Reset both single-bit and double-bit counters to 0.
void scrub()Read the memory element-by-element and write back them back. Note that scrub()automatically calls reset_counters() to reset the error counters.
void inject_error(addr, mask)Inject error at the given address based on the masking bits.