Multicore interpolation slower that single-core

satov · Post by **satov** » Tue Sep 09, 2025 8:04 pm

Hello! I made a simple interpolator x2 using a polyphase FIR. I want to see the advantage of going multicore by calculating the two phases of the FIR in parallel. To my surprise, calling the function interp_x2 sequentially on the same core takes much less than using the par statement.

interp_x2 uses the VPU and is written in ASM. The code below is the infinite loop of the task I run from main. I was inspired by the repo https://github.com/xmos/xmath_walkthrough/ where the par statement is used in a similar fashion. The interp_x2 function is not that long (~25 instruction bundles), could it be that there is so much overhead in spawning the two threads that the benefit of going multicore cannot be appreciated?

Code: Select all

#define R 2
#define NCOEFF 48
while (1) {
    rx_frame(datain, FRAME_SIZE, c_audio);

    xscope_start(PROBE1);
    for(int i=0; i<FRAME_SIZE; i++) {
        pos = update_state(state, pos, NCOEFF, datain[i]);
        unsafe {
            par {
                interp_x2(&pstate[pos1], &pcoeff[0],      &pout[i*R]);		// FIR phase 1
                interp_x2(&pstate[pos1], &pcoeff[NCOEFF], &pout[i*R+1]);	// FIR phase 2
            }
        }
    }
    xscope_stop(PROBE1);

    tx_frame(c_audio, out, R*FRAME_SIZE);
}

infiniteimprobability · Thu Sep 11, 2025 12:07 pm

Hi,
sounds like an interesting project! Can I ask how many tasks/threads you have in your par{} statement in total? I see 2 tasks par'd there but there may be more outside?

The xcore is essentially a barrel processor with a pipeline depth of 5. It was designed so that one one stage of the pipeline can be executed at the same time.

The practical effect from this is that the fastest a single thread can run is f/5, which for 600MHz core clock means 120MHz. So if you have 1,2,3,4 or 5 threads active each gets 120MHz. If you have more than 5 active then the worst case MHz is f/n. So all 8 threads active will yield you 75MHz per thread. So for pure DSP, it doesn't win you anything (MHz-wise) to go with more than 5 workers unless it is convenient to split that way.

If at any time, any thread is paused on an event or resource, it will no longer being in the run set so you may get something closer to f/5.

You can play with the scheduling using set_core_priority_mode() which forces it to be f/5, at the expense of MHz for any other active threads. There's a good appnote which explains use of this mode here https://www.xmos.com/documentation/XM-0 ... v1.0.0.pdf

Ross · Post by **Ross** » Thu Sep 11, 2025 2:12 pm

Id be sure to compile at a decent optimisation level, O2 at least.

satov · Post by **satov** » Thu Sep 11, 2025 9:14 pm

Hi, thanks for the replies.

The optimization is -O3. On that tile I'm running two tasks, plus the two in the par statement from the code I posted. So they must be 4 in total (the 2 I launch from main and the 2 in the par statement).

I added a long loop like this

Code: Select all

    ldc r4, 1
    shl r4, r4, 16
.loop:
    sub r4, r4, 1
    bt r4, .loop

inside the interp_x2 function, and then when I do multicore I actually got x1.99 speed improvement. If I remove the loop, I got that the multicore version takes ~0.95 us/sample, and the sigle-core version ~0.61 us. It's like there is a overhead comparable to the function size (which is ~20 instruction bundles) when using the par statement which kills the performance.

I was thinking about launching the two FIR tasks from main as well, and use a lock or similar to synchronize them with the task that gets new samples and update the delay line. But I cannot believe that the overhead, in terms of added instruction introduced by the lock mechanism, is smaller than the one of the par statement.

With this example I was just experimenting with multicore polyphase FIR, my goal is to interpolate x64 with three cascaded polyphase FIR. Multicore fits well with polyphase FIR, because all phases operates on the same delay line, so you can just update the delay line once, and then trigger all phases in parallel.

Multicore interpolation slower that single-core

Multicore interpolation slower that single-core

Re: Multicore interpolation slower that single-core

Re: Multicore interpolation slower that single-core

Re: Multicore interpolation slower that single-core