XCORE 200 DSP Performance: dsp_filters_fir

Technical questions regarding the XTC tools and programming with XMOS.
Post Reply
User avatar
ccrome
Active Member
Posts: 62
Joined: Wed Sep 23, 2015 1:15 am

XCORE 200 DSP Performance: dsp_filters_fir

Post by ccrome »

Hi there,
I'm having trouble understanding the DSP performance of the XCORE-200.

I have an XCORE that (I think ) is running at 500 MHz. It's based on the XCORE-200 multichannel reference design, so it's an XU216.

I have installed my DSP code into the ADC pathway using a new core, and I need to run DSP on 32 channels of audio, running at 48kHz. Each channel needs to be processed with a 10 TAP FIR filter. That's it.

That's 10 taps/channel * 32 channels/frame * 48000 frames/s = 15,360,000 taps/second. Should that be achievable?

I'm getting really slow performance: It's taking 23.5uS to process 1 frame, for an equivalent of 320taps/23.5uS = 13,617,000 taps/second.

Shouldn't the DSP performance of the XCORE-200 be much better than this? More like 65 MACS/second? It is 1 MAC/tap correct?

In order to be as fast and explicit as possible, I didn't use a loop, so here's the DSP code:

Code: Select all

#define N_COEFFS 10
#define QFMTN 28
#pragma unsafe arrays
static inline void processSamples(int samplesIn[], int samplesOut[])
{
    samplesOut[0   ] = dsp_filters_fir(samplesIn[0   ], delay_coeffs_3 , filter_states_0 , N_COEFFS, QFMTN);
    samplesOut[1   ] = dsp_filters_fir(samplesIn[1   ], delay_coeffs_3 , filter_states_1 , N_COEFFS, QFMTN);
    samplesOut[2   ] = dsp_filters_fir(samplesIn[2   ], delay_coeffs_2 , filter_states_2 , N_COEFFS, QFMTN);
    samplesOut[3   ] = dsp_filters_fir(samplesIn[3   ], delay_coeffs_2 , filter_states_3 , N_COEFFS, QFMTN);
    samplesOut[4   ] = dsp_filters_fir(samplesIn[4   ], delay_coeffs_1 , filter_states_4 , N_COEFFS, QFMTN);
    samplesOut[5   ] = dsp_filters_fir(samplesIn[5   ], delay_coeffs_1 , filter_states_5 , N_COEFFS, QFMTN);
    samplesOut[6   ] = dsp_filters_fir(samplesIn[6   ], delay_coeffs_0 , filter_states_6 , N_COEFFS, QFMTN);
    samplesOut[7   ] = dsp_filters_fir(samplesIn[7   ], delay_coeffs_0 , filter_states_7 , N_COEFFS, QFMTN);
    samplesOut[8   ] = dsp_filters_fir(samplesIn[8   ], delay_coeffs_3 , filter_states_8 , N_COEFFS, QFMTN);
    samplesOut[9   ] = dsp_filters_fir(samplesIn[9   ], delay_coeffs_3 , filter_states_9 , N_COEFFS, QFMTN);
    samplesOut[10  ] = dsp_filters_fir(samplesIn[10  ], delay_coeffs_2 , filter_states_10, N_COEFFS, QFMTN);
    samplesOut[11  ] = dsp_filters_fir(samplesIn[11  ], delay_coeffs_2 , filter_states_11, N_COEFFS, QFMTN);
    samplesOut[12  ] = dsp_filters_fir(samplesIn[12  ], delay_coeffs_1 , filter_states_12, N_COEFFS, QFMTN);
    samplesOut[13  ] = dsp_filters_fir(samplesIn[13  ], delay_coeffs_1 , filter_states_13, N_COEFFS, QFMTN);
    samplesOut[14  ] = dsp_filters_fir(samplesIn[14  ], delay_coeffs_0 , filter_states_14, N_COEFFS, QFMTN);
    samplesOut[15  ] = dsp_filters_fir(samplesIn[15  ], delay_coeffs_0 , filter_states_15, N_COEFFS, QFMTN);
    samplesOut[16  ] = dsp_filters_fir(samplesIn[16  ], delay_coeffs_2 , filter_states_16, N_COEFFS, QFMTN);
    samplesOut[17  ] = dsp_filters_fir(samplesIn[17  ], delay_coeffs_2 , filter_states_17, N_COEFFS, QFMTN);
    samplesOut[18  ] = dsp_filters_fir(samplesIn[18  ], delay_coeffs_1 , filter_states_18, N_COEFFS, QFMTN);
    samplesOut[19  ] = dsp_filters_fir(samplesIn[19  ], delay_coeffs_1 , filter_states_19, N_COEFFS, QFMTN);
    samplesOut[20  ] = dsp_filters_fir(samplesIn[20  ], delay_coeffs_2 , filter_states_20, N_COEFFS, QFMTN);
    samplesOut[21  ] = dsp_filters_fir(samplesIn[21  ], delay_coeffs_2 , filter_states_21, N_COEFFS, QFMTN);
    samplesOut[22  ] = dsp_filters_fir(samplesIn[22  ], delay_coeffs_1 , filter_states_22, N_COEFFS, QFMTN);
    samplesOut[23  ] = dsp_filters_fir(samplesIn[23  ], delay_coeffs_1 , filter_states_23, N_COEFFS, QFMTN);
    samplesOut[24  ] = dsp_filters_fir(samplesIn[24  ], delay_coeffs_0 , filter_states_24, N_COEFFS, QFMTN);
    samplesOut[25  ] = dsp_filters_fir(samplesIn[25  ], delay_coeffs_0 , filter_states_25, N_COEFFS, QFMTN);
    samplesOut[26  ] = dsp_filters_fir(samplesIn[26  ], delay_coeffs_2 , filter_states_26, N_COEFFS, QFMTN);
    samplesOut[27  ] = dsp_filters_fir(samplesIn[27  ], delay_coeffs_2 , filter_states_27, N_COEFFS, QFMTN);
    samplesOut[28  ] = dsp_filters_fir(samplesIn[28  ], delay_coeffs_1 , filter_states_28, N_COEFFS, QFMTN);
    samplesOut[29  ] = dsp_filters_fir(samplesIn[29  ], delay_coeffs_1 , filter_states_29, N_COEFFS, QFMTN);
    samplesOut[30  ] = dsp_filters_fir(samplesIn[30  ], delay_coeffs_1 , filter_states_30, N_COEFFS, QFMTN);
    samplesOut[31  ] = dsp_filters_fir(samplesIn[31  ], delay_coeffs_1 , filter_states_31, N_COEFFS, QFMTN);
    samplesOut[32  ] = samplesIn[32];
    samplesOut[33  ] = samplesIn[33];
}

Any idea why this is taking so long (over 23uS) to process?

What can I do to improve performance?

Thanks,
-Caleb


User avatar
andrew
Experienced Member
Posts: 114
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

The best performance you are going to get depends on your specific filter design. The lib_mic_array achieves very high FIR taps / instruction as each filter has been designed to achieve the best performance for the given situation.
For example stage 1 of the mic_array converts PDM to PCM and decimates by 8 (0.5 instructions per tap)
Stage 2 decimates by 4 (1.75 instruction per tap)
Stage 3 decimates by N (2 inst per tap)

The minimum work you have to do is load the data, load the coefs and multiply them together.

Code: Select all

	ldd b, a, coef_pointer[i]
	ldd d, c, data_pointer[i]
	maccs lo, hi, c, a
	maccs lo, hi, d, b
You'll probably want to increment the index:

Code: Select all

        add i, i, N
Depending on how you want to manage memory you might want to save the data back shifted by one sample.

Code: Select all

        std c, prev_d, data_pointer[i]
(before i is incremented)

This will give you a respectable real FIR.

I recommend using the ldd and std instruction as shown above to take full advantage of the load/store bandwidth.
User avatar
ccrome
Active Member
Posts: 62
Joined: Wed Sep 23, 2015 1:15 am

Post by ccrome »

Hi Andrew,
Thanks for the reply. 1.75 to 2 instructions sounds pretty reasonable. The code that you show here:
I'm looking at the dsp_lib dsp_filters.c now. It does look like they are doing pretty close to what you suggest:

They do:

Code: Select all

ldd coef0, coef1
ldd state0, state1
sdd state (i.e. shift the state registers)
maccs coef0, state0
maccs coef1, state1
So, looks like 5 instructions for 2 taps. Not too terrible, but it does add up. So, at that rate, it looks like my 10 taps * 32 channels/frame * (5/2)instructions/tap = 800 instructions/frame (plus some overhead, let's say 200 instructions). So, I'm at 1000 instructions per frame, which is getting squeaky I guess.

By my calculations, I need to finish in something like 17uS. 17uS at 63.5 MIPS is 1079 instructions. So.... it's definitely squeaky for 1 core.

I can split the filters between 2 cores I guess and get the job done. I don't quite know how that works, but it should be managable I suspect.

Thanks again,
-Caleb
User avatar
ccrome
Active Member
Posts: 62
Joined: Wed Sep 23, 2015 1:15 am

Post by ccrome »

Wait... now I'm confused about performance again.

I see in the XU200 data sheet, it says 100 MIPS/core for up to 5 cores. I'm only using 3 cores on that tile, so I should have a full 100MIPS, right?

In that case it should run in real time... Which makes me think I have some setting wrong.

How can I verify that my part is running at the right speed, and that I'm getting all the MIPS I'm supposed to?

Thanks,
-Caleb
User avatar
andrew
Experienced Member
Posts: 114
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

If you filter is symmetric then you could save on loading the coefs by reusing them once they are loaded.
User avatar
andrew
Experienced Member
Posts: 114
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

For a tile you get 500MHz of instructions. That means you get 500 millions issues per second. When 5 cores are running you get 100 million issues per core. When 8 are running each gets 62.5 million and fewer than 5 cores will also get 100 as the pipeline length is five.
core running issues per second
1 100
2 100
3 100
4 100
5 100
6 83.3
7 71.4
8 62.5

The chip should be performing as specified by your xn file. If you want to check it you could run you program on the simulator and see how many instructions execute for you FIR.
Post Reply