What is the best way to split audio processing into multiple

Technical questions regarding the XTC tools and programming with XMOS.
sonicemotion
New User
Posts: 3
Joined: Mon Jan 11, 2016 11:37 am

What is the best way to split audio processing into multiple

Post by sonicemotion »

I am currently evaluating if it is possible to run an existing audio algorithm on a XU2xx device. Most probably it will require more than 100 MIPS, so I will need to split the processing into different tasks. The structure of the algorithm is basically a chain of audio processing blocksThe following two options come to my mind:

1) Split audio processing blocks into parallel tasks on multiple cores. For example a multi channel EQ.

2) Process an audio block in one core and pass the output to the next core for processing the second block when ready and so on.

I favour option one, because I would not need to make a lot of changes to the structure of the current codebase. I fear however, that there will be a some overhead if I try to parallelize tasks at such a low level.From what I understand option two is better alligned to the design principles for xmos, but It means that I would need to restructure my codebase. I would also introduce an audio delay for every additional core I add to the chain.Are both options possible to implement? Which one would you recommend to go for?Read
User avatar
infiniteimprobability
Verified
XCore Legend
Posts: 1126
Joined: Thu May 27, 2010 10:08 am

Post by infiniteimprobability »

Good question - a lot of us do this on a regular basis so I am sure there will be different opinions. There is probably no universal answer to this..

There is the issue that a block of MIPS (a thread/core etc.) is generally between 62.5 and 100MIPS. I use the term MIPS here but actually it's wrong because xCore200 can do two instructions per clock - MHz is more accurate. Anyways, let's just use MIPS for now..

Personally, I like to map channels to cores. For example, I had an app which needed 30-80MIPS per channel processing (HQ sample rate conversion) and so it was an easy choice to make... One core per audio channel. It has the advantage of looking nice in the code too - one processing task instantiated multiple times. This is approach 2).

Approach 1) needs manual splitting of stages into groups that consume (as close as possible, minus overhead) the available MIPS per core.

I don't really see a strong case for choosing one over the other unless the MIPS usage of blocks maps really neatly into what you have in your application. I find the main effort goes into the bits at the end. For example converting from sample based (like most XMOS I/O) to block based (many DSP algos use this) format and FIFOs, where the rates take a bit to synch like when an external PLL is involved...

PS.. You know you can control the number of MIPS each core gets now. For example, you could have two 100MIPS cores and 6 x 50 MIPS cores? See set_core_priority_on() in xs1.h. This can help size the MIPS chunks to match the MIPS your DSP tasks take.
sonicemotion
New User
Posts: 3
Joined: Mon Jan 11, 2016 11:37 am

Post by sonicemotion »

Thanks for the warning about the sample/block based processing. This is definitely something i need to keep in mind when planning the porting of my existing code.

Mapping channels to cores sounds like a good solution to me.
Unfortunately most processing blocks in my algorithm are mixing channels and therefore it is not possible to create separate independent input to output tasks.
Parallelization is just possible within some blocks of the algorithm deeper down in the code.

Lets assume i would like to do the following audio processing:

L in --> Gain --> \ / --> Filter --> L out
[tab=120]Mixer
R in --> Gain --> / \ --> Filter --> R out


Would it pay off to map the 2 channels in the Gain and Filter blocks to two cores even if the task consists of just one (or a few) multiplications? Or is the overhead to big in this case?
User avatar
infiniteimprobability
Verified
XCore Legend
Posts: 1126
Joined: Thu May 27, 2010 10:08 am

Post by infiniteimprobability »

Would it pay off to map the 2 channels in the Gain and Filter blocks to two cores even if the task consists of just one (or a few) multiplications? Or is the overhead to big in this case?
It's up to you really - if you have bucket loads of cores spare, then why not? Personally, I'd just inline the mixing functions (ie. call a function) and use shared memory. This is because it's only a handful of instructions. There are a million ways to share memory...
One thing you may need to consider though when sharing memory is synchronisation - You can use a spin lock (https://www.xmos.com/support/libraries) or a simple channel to do this..

Task 1
- Do Task 1 processing code and write output to shared memory
- Send token ( c_synch <: 0;)

Task 2
- Wait for token (c_synch :> int tmp;)
- Do Task 2 processing code fetching inputs from shared memory


By the way, for mixing, a neat way of doing mixing with saturation, taking advantage of the new instructions in xCore200 is show below. It expects data in Q1.31m but can easily be modified by changing the "31" value to suit other fixed point formats.

Code: Select all

    long long result_tmp = 0;
    result_tmp += (long long) sample_0 * gain_0;
    result_tmp += (long long) sample_1 * gain_1;
    result_tmp = lsats(result_tmp, 31);
    sample = lextract(result_tmp, 31, 32);