12.15 FIR Filter Example Project

The DSP_Intrinsics example project shows the advantages of a FIR filter implementation based on DSP Built-in Functions compared to traditional C code, without any DSP optimizations. First, the filter is implemented for Q15 input data represented as arrays of short variables:

short coeff[BUFSIZE] __attribute__ ((aligned(4)));
short delay[BUFSIZE] __attribute__ ((aligned(4)));

The traditional C code version for Q15 inputs doesn't use SIMD variables or DSP Built-in Functions. Instead, it implements the Q15 x Q15 multiplication by checking the saturation condition (both operands are 0x8000 or -1) and then left shifting the integer multiplication result by 1 bit before adding it to the accumulator.

long long FIR_Filter_Traditional_16(short *delay, short *coeff, int buflen)
{
    int i;
    short x, y;
    
    // 64-bit accumulator for result
    long long ac0 = 0;
    
    for (i = 0; i < buflen; i++) {
        x = coeff[i];
        y = delay[i];
        
        // check saturation condition
        if ((unsigned short)x == 0x8000 && (unsigned short)y == 0x8000) {
            ac0 += 0x7fffffff;
        } else {
            // multiply (Q15 x Q15) needs left shift
            // result is added to accumulator variable
            ac0 += ((x * y) << 1);
        }
    }
    return ac0;
}

For this C implementation the compiler generates inefficient assembly code. It does not have enough information to auto-vectorize the loop.

# Illustrative generated assembly code
FIR_Filter_Traditional_16:
	…
	mul	$24,$13,$15
	sll	$25,$24,1
	addu	$12,$2,$25
	sra	$9,$25,31
	sltu	$11,$12,$2
	addu	$3,$3,$9
	move	$2,$12
	addu	$3,$11,$3

In the DSP Intrinsics approach of the filter the input buffers are casted to the v2q15 SIMD vector type defined above in section SIMD Variables. Inside the loop, the "__builtin_mips_dpaq_s_w_ph" DSP built-in function is called. The result of the call is stored in the accumulator variable, of type Q32.31 (64-bit), represented as integer type a64 (see definition in section Integer representation of Q15 and Q31).

a64 FIR_Filter_Intrinsics_16(short *delay, short *coeff, int buflen)
{
    int i;    
    v2q15 *my_delay = (v2q15 *)delay;
    v2q15 *my_coeffs = (v2q15 *)coeff;
    // 64-bit accumulator for result
    a64 ac0 = 0;
    for (i = 0; i < buflen/2; i++) {
        ac0 = __builtin_mips_dpaq_s_w_ph (ac0,
my_delay[i], 
my_coeffs[i]);
    }
    return ac0;
}

This function generates "dpaq_s.w.ph" assembly DSP instructions that apply the "Dot Product with Accumulate" operation on two sets of Q15 values. The result is stored in one of the four 64-bit accumulators in the DSP-enhanced core.

# Illustrative generated assembly code
FIR_Filter_Intrinsics_16:
	…
	mtlo	$0
	mthi	$0
	addiu	$8,$7,-1
	li	$6,1
	andi	$10,$8,0x7
	dpaq_s.w.ph	$ac0,$2,$3
	addiu	$3,$4,4
	beq	$6,$7,.L60
	addiu	$2,$5,4

The project targets the PIC32MZ2048EFM144 device. The tools used are MPLAB X IDE v3.10, MPLAB XC32 v1.40 compiler and PIC32 MZ EF Starter Kit/Simulator. Optimization level is set at "-O3" with "-funroll-loops" option enabled. Using an internal timer to count the ticks during calls to the 2 functions operating on the same Q15 data buffers reveals that the Intrinsics version is approximately 4.52 times faster than the traditional C version.

Example timing output for input data buffers of size 2048:

16-bit without DSP Intrinsics: timer ticks 1180
16-bit with DSP Intrinsics: timer ticks: 261

The project includes filter implementations for buffers of 32-bit integer data types using similar approaches as for the previous 16-bit versions.

First implementation uses multiply and add operators. The 32-bit is casted to 64-bit before multiply as it was described previously in 12.13.2 Multiply and Add "32-bit int * 32-bit int + 64-bit long long = 64-bit long long int":

for (i = 0; i < buflen; i++) {
	acc += (long long)data[i] * coeff[i];
}

# Illustrative generated assembly code
FIR_Filter_32:
	…
	lwx	$24,$2($4)
	lwx	$25,$2($5)
	addiu	$2,$2,4
	madd	$ac0,$24,$25

The Intrinsics variant uses the "__builtin_mips_madd" to operate the multiply add.

for (i = 0; i < buflen; i++) {
	acc = __builtin_mips_madd (acc, data[i], coeff[i]);
}

# Illustrative generated assembly code
FIR_Filter_Intrinsics_32:
	…
	lw	$17,0($3)
	lw	$24,0($2)
	addiu	 $7,$7,1
	addiu	 $3,$3,4
	addiu	 $2,$2,4
	madd	$ac0,$17,$24

The tick counts for the 32-bit implementations are very similar. In both cases the compiler is now generating the "MADD" DSP instructions.

Example timing output for input data buffers of size 2048:

32-bit without DSP Intrinsics: timer ticks 520
32-bit with DSP Intrinsics: timer ticks 484