12.15 FIR Filter Example Project
The DSP_Intrinsics example project shows the advantages of a FIR filter implementation based on DSP Built-in Functions compared to traditional C code, without any DSP optimizations. First, the filter is implemented for Q15 input data represented as arrays of short variables:
short coeff[BUFSIZE] __attribute__ ((aligned(4)));
short delay[BUFSIZE] __attribute__ ((aligned(4)));
The traditional C code version for Q15 inputs doesn't use SIMD variables or DSP Built-in Functions. Instead, it implements the Q15 x Q15 multiplication by checking the saturation condition (both operands are 0x8000 or -1) and then left shifting the integer multiplication result by 1 bit before adding it to the accumulator.
long long FIR_Filter_Traditional_16(short *delay, short *coeff, int buflen)
{
int i;
short x, y;
// 64-bit accumulator for result
long long ac0 = 0;
for (i = 0; i < buflen; i++) {
x = coeff[i];
y = delay[i];
// check saturation condition
if ((unsigned short)x == 0x8000 && (unsigned short)y == 0x8000) {
ac0 += 0x7fffffff;
} else {
// multiply (Q15 x Q15) needs left shift
// result is added to accumulator variable
ac0 += ((x * y) << 1);
}
}
return ac0;
}
For this C implementation the compiler generates inefficient assembly code. It does not have enough information to auto-vectorize the loop.
# Illustrative generated assembly code
FIR_Filter_Traditional_16:
…
mul $24,$13,$15
sll $25,$24,1
addu $12,$2,$25
sra $9,$25,31
sltu $11,$12,$2
addu $3,$3,$9
move $2,$12
addu $3,$11,$3
In the DSP Intrinsics approach of the filter the input buffers are casted to the v2q15
SIMD vector type defined above in section SIMD Variables. Inside the loop, the
"__builtin_mips_dpaq_s_w_ph
" DSP built-in function is called. The
result of the call is stored in the accumulator variable, of type Q32.31 (64-bit),
represented as integer type a64 (see definition in section Integer representation of Q15
and Q31).
a64 FIR_Filter_Intrinsics_16(short *delay, short *coeff, int buflen)
{
int i;
v2q15 *my_delay = (v2q15 *)delay;
v2q15 *my_coeffs = (v2q15 *)coeff;
// 64-bit accumulator for result
a64 ac0 = 0;
for (i = 0; i < buflen/2; i++) {
ac0 = __builtin_mips_dpaq_s_w_ph (ac0,
my_delay[i],
my_coeffs[i]);
}
return ac0;
}
This function generates "dpaq_s.w.ph" assembly DSP instructions that apply the "Dot Product with Accumulate" operation on two sets of Q15 values. The result is stored in one of the four 64-bit accumulators in the DSP-enhanced core.
# Illustrative generated assembly code
FIR_Filter_Intrinsics_16:
…
mtlo $0
mthi $0
addiu $8,$7,-1
li $6,1
andi $10,$8,0x7
dpaq_s.w.ph $ac0,$2,$3
addiu $3,$4,4
beq $6,$7,.L60
addiu $2,$5,4
The project targets the PIC32MZ2048EFM144 device. The tools used are MPLAB X IDE v3.10, MPLAB XC32 v1.40 compiler and PIC32 MZ EF Starter Kit/Simulator. Optimization level is set at "-O3" with "-funroll-loops" option enabled. Using an internal timer to count the ticks during calls to the 2 functions operating on the same Q15 data buffers reveals that the Intrinsics version is approximately 4.52 times faster than the traditional C version.
Example timing output for input data buffers of size 2048:
16-bit without DSP Intrinsics: timer ticks 1180
16-bit with DSP Intrinsics: timer ticks: 261
The project includes filter implementations for buffers of 32-bit integer data types using similar approaches as for the previous 16-bit versions.
First implementation uses multiply and add operators. The 32-bit is casted to 64-bit before multiply as it was described previously in 12.13.2 Multiply and Add "32-bit int * 32-bit int + 64-bit long long = 64-bit long long int":
for (i = 0; i < buflen; i++) {
acc += (long long)data[i] * coeff[i];
}
# Illustrative generated assembly code
FIR_Filter_32:
…
lwx $24,$2($4)
lwx $25,$2($5)
addiu $2,$2,4
madd $ac0,$24,$25
The Intrinsics variant uses the "__builtin_mips_madd
" to operate the
multiply add.
for (i = 0; i < buflen; i++) {
acc = __builtin_mips_madd (acc, data[i], coeff[i]);
}
# Illustrative generated assembly code
FIR_Filter_Intrinsics_32:
…
lw $17,0($3)
lw $24,0($2)
addiu $7,$7,1
addiu $3,$3,4
addiu $2,$2,4
madd $ac0,$17,$24
The tick counts for the 32-bit implementations are very similar. In both cases the compiler is now generating the "MADD" DSP instructions.
Example timing output for input data buffers of size 2048:
32-bit without DSP Intrinsics: timer ticks 520
32-bit with DSP Intrinsics: timer ticks 484