I'm having trouble understanding the DSP performance of the XCORE-200.

I have an XCORE that (I think ) is running at 500 MHz. It's based on the XCORE-200 multichannel reference design, so it's an XU216.

I have installed my DSP code into the ADC pathway using a new core, and I need to run DSP on 32 channels of audio, running at 48kHz. Each channel needs to be processed with a 10 TAP FIR filter. That's it.

That's 10 taps/channel * 32 channels/frame * 48000 frames/s = 15,360,000 taps/second. Should that be achievable?

I'm getting really slow performance: It's taking 23.5uS to process 1 frame, for an equivalent of 320taps/23.5uS = 13,617,000 taps/second.

Shouldn't the DSP performance of the XCORE-200 be much better than this? More like 65 MACS/second? It is 1 MAC/tap correct?

In order to be as fast and explicit as possible, I didn't use a loop, so here's the DSP code:

Code: Select all

`#define N_COEFFS 10`

#define QFMTN 28

#pragma unsafe arrays

static inline void processSamples(int samplesIn[], int samplesOut[])

{

samplesOut[0 ] = dsp_filters_fir(samplesIn[0 ], delay_coeffs_3 , filter_states_0 , N_COEFFS, QFMTN);

samplesOut[1 ] = dsp_filters_fir(samplesIn[1 ], delay_coeffs_3 , filter_states_1 , N_COEFFS, QFMTN);

samplesOut[2 ] = dsp_filters_fir(samplesIn[2 ], delay_coeffs_2 , filter_states_2 , N_COEFFS, QFMTN);

samplesOut[3 ] = dsp_filters_fir(samplesIn[3 ], delay_coeffs_2 , filter_states_3 , N_COEFFS, QFMTN);

samplesOut[4 ] = dsp_filters_fir(samplesIn[4 ], delay_coeffs_1 , filter_states_4 , N_COEFFS, QFMTN);

samplesOut[5 ] = dsp_filters_fir(samplesIn[5 ], delay_coeffs_1 , filter_states_5 , N_COEFFS, QFMTN);

samplesOut[6 ] = dsp_filters_fir(samplesIn[6 ], delay_coeffs_0 , filter_states_6 , N_COEFFS, QFMTN);

samplesOut[7 ] = dsp_filters_fir(samplesIn[7 ], delay_coeffs_0 , filter_states_7 , N_COEFFS, QFMTN);

samplesOut[8 ] = dsp_filters_fir(samplesIn[8 ], delay_coeffs_3 , filter_states_8 , N_COEFFS, QFMTN);

samplesOut[9 ] = dsp_filters_fir(samplesIn[9 ], delay_coeffs_3 , filter_states_9 , N_COEFFS, QFMTN);

samplesOut[10 ] = dsp_filters_fir(samplesIn[10 ], delay_coeffs_2 , filter_states_10, N_COEFFS, QFMTN);

samplesOut[11 ] = dsp_filters_fir(samplesIn[11 ], delay_coeffs_2 , filter_states_11, N_COEFFS, QFMTN);

samplesOut[12 ] = dsp_filters_fir(samplesIn[12 ], delay_coeffs_1 , filter_states_12, N_COEFFS, QFMTN);

samplesOut[13 ] = dsp_filters_fir(samplesIn[13 ], delay_coeffs_1 , filter_states_13, N_COEFFS, QFMTN);

samplesOut[14 ] = dsp_filters_fir(samplesIn[14 ], delay_coeffs_0 , filter_states_14, N_COEFFS, QFMTN);

samplesOut[15 ] = dsp_filters_fir(samplesIn[15 ], delay_coeffs_0 , filter_states_15, N_COEFFS, QFMTN);

samplesOut[16 ] = dsp_filters_fir(samplesIn[16 ], delay_coeffs_2 , filter_states_16, N_COEFFS, QFMTN);

samplesOut[17 ] = dsp_filters_fir(samplesIn[17 ], delay_coeffs_2 , filter_states_17, N_COEFFS, QFMTN);

samplesOut[18 ] = dsp_filters_fir(samplesIn[18 ], delay_coeffs_1 , filter_states_18, N_COEFFS, QFMTN);

samplesOut[19 ] = dsp_filters_fir(samplesIn[19 ], delay_coeffs_1 , filter_states_19, N_COEFFS, QFMTN);

samplesOut[20 ] = dsp_filters_fir(samplesIn[20 ], delay_coeffs_2 , filter_states_20, N_COEFFS, QFMTN);

samplesOut[21 ] = dsp_filters_fir(samplesIn[21 ], delay_coeffs_2 , filter_states_21, N_COEFFS, QFMTN);

samplesOut[22 ] = dsp_filters_fir(samplesIn[22 ], delay_coeffs_1 , filter_states_22, N_COEFFS, QFMTN);

samplesOut[23 ] = dsp_filters_fir(samplesIn[23 ], delay_coeffs_1 , filter_states_23, N_COEFFS, QFMTN);

samplesOut[24 ] = dsp_filters_fir(samplesIn[24 ], delay_coeffs_0 , filter_states_24, N_COEFFS, QFMTN);

samplesOut[25 ] = dsp_filters_fir(samplesIn[25 ], delay_coeffs_0 , filter_states_25, N_COEFFS, QFMTN);

samplesOut[26 ] = dsp_filters_fir(samplesIn[26 ], delay_coeffs_2 , filter_states_26, N_COEFFS, QFMTN);

samplesOut[27 ] = dsp_filters_fir(samplesIn[27 ], delay_coeffs_2 , filter_states_27, N_COEFFS, QFMTN);

samplesOut[28 ] = dsp_filters_fir(samplesIn[28 ], delay_coeffs_1 , filter_states_28, N_COEFFS, QFMTN);

samplesOut[29 ] = dsp_filters_fir(samplesIn[29 ], delay_coeffs_1 , filter_states_29, N_COEFFS, QFMTN);

samplesOut[30 ] = dsp_filters_fir(samplesIn[30 ], delay_coeffs_1 , filter_states_30, N_COEFFS, QFMTN);

samplesOut[31 ] = dsp_filters_fir(samplesIn[31 ], delay_coeffs_1 , filter_states_31, N_COEFFS, QFMTN);

samplesOut[32 ] = samplesIn[32];

samplesOut[33 ] = samplesIn[33];

}

Any idea why this is taking so long (over 23uS) to process?

What can I do to improve performance?

Thanks,

-Caleb