Multicore interpolation slower that single-core

Technical questions regarding the XTC tools and programming with XMOS.
satov
Member
Posts: 12
Joined: Wed Apr 30, 2025 3:34 pm

Multicore interpolation slower that single-core

Post by satov »

Hello! I made a simple interpolator x2 using a polyphase FIR. I want to see the advantage of going multicore by calculating the two phases of the FIR in parallel. To my surprise, calling the function interp_x2 sequentially on the same core takes much less than using the par statement.

interp_x2 uses the VPU and is written in ASM. The code below is the infinite loop of the task I run from main. I was inspired by the repo https://github.com/xmos/xmath_walkthrough/ where the par statement is used in a similar fashion. The interp_x2 function is not that long (~25 instruction bundles), could it be that there is so much overhead in spawning the two threads that the benefit of going multicore cannot be appreciated?

Code: Select all

#define R 2
#define NCOEFF 48
while (1) {
    rx_frame(datain, FRAME_SIZE, c_audio);

    xscope_start(PROBE1);
    for(int i=0; i<FRAME_SIZE; i++) {
        pos = update_state(state, pos, NCOEFF, datain[i]);
        unsafe {
            par {
                interp_x2(&pstate[pos1], &pcoeff[0],      &pout[i*R]);		// FIR phase 1
                interp_x2(&pstate[pos1], &pcoeff[NCOEFF], &pout[i*R+1]);	// FIR phase 2
            }
        }
    }
    xscope_stop(PROBE1);

    tx_frame(c_audio, out, R*FRAME_SIZE);
}
User avatar
infiniteimprobability
Verified
XCore Legend
Posts: 1172
Joined: Thu May 27, 2010 10:08 am

Post by infiniteimprobability »

Hi,
sounds like an interesting project! Can I ask how many tasks/threads you have in your par{} statement in total? I see 2 tasks par'd there but there may be more outside?

The xcore is essentially a barrel processor with a pipeline depth of 5. It was designed so that one one stage of the pipeline can be executed at the same time.

The practical effect from this is that the fastest a single thread can run is f/5, which for 600MHz core clock means 120MHz. So if you have 1,2,3,4 or 5 threads active each gets 120MHz. If you have more than 5 active then the worst case MHz is f/n. So all 8 threads active will yield you 75MHz per thread. So for pure DSP, it doesn't win you anything (MHz-wise) to go with more than 5 workers unless it is convenient to split that way.

If at any time, any thread is paused on an event or resource, it will no longer being in the run set so you may get something closer to f/5.

You can play with the scheduling using set_core_priority_mode() which forces it to be f/5, at the expense of MHz for any other active threads. There's a good appnote which explains use of this mode here https://www.xmos.com/documentation/XM-0 ... v1.0.0.pdf
Engineer at XMOS
User avatar
Ross
Verified
XCore Legend
Posts: 1269
Joined: Thu Dec 10, 2009 9:20 pm
Location: Bristol, UK

Post by Ross »

Id be sure to compile at a decent optimisation level, O2 at least.
Technical Director @ XMOS. Opinions expressed are my own
satov
Member
Posts: 12
Joined: Wed Apr 30, 2025 3:34 pm

Post by satov »

Hi, thanks for the replies.

The optimization is -O3. On that tile I'm running two tasks, plus the two in the par statement from the code I posted. So they must be 4 in total (the 2 I launch from main and the 2 in the par statement).

I added a long loop like this

Code: Select all

    ldc r4, 1
    shl r4, r4, 16
.loop:
    sub r4, r4, 1
    bt r4, .loop
inside the interp_x2 function, and then when I do multicore I actually got x1.99 speed improvement. If I remove the loop, I got that the multicore version takes ~0.95 us/sample, and the sigle-core version ~0.61 us. It's like there is a overhead comparable to the function size (which is ~20 instruction bundles) when using the par statement which kills the performance.

I was thinking about launching the two FIR tasks from main as well, and use a lock or similar to synchronize them with the task that gets new samples and update the delay line. But I cannot believe that the overhead, in terms of added instruction introduced by the lock mechanism, is smaller than the one of the par statement.

With this example I was just experimenting with multicore polyphase FIR, my goal is to interpolate x64 with three cascaded polyphase FIR. Multicore fits well with polyphase FIR, because all phases operates on the same delay line, so you can just update the delay line once, and then trigger all phases in parallel.