I will push one realisation to github on monday.

For 3000 taps, one FIR thread in reality runs (5n+m)

Testing performance, Running FIR-filter for 1 sec on a single thread with 3000 filter taps

Filtered 6660 samples during 1 second

**19980 kTaps per sec.**

CRC32 checksum for all filtered samples was: 0xED9B0990

Calculating the CRC32 checksum from the XC implementation, this might take some time

Correct Checksum for filtered datasequence is: 0xED9B0990

4 threads on one core, plus a distributing thread to aviod any extra filter latency runs (5 threads on one core):

This tree structure of implementation should be the worst possible compared to a ring implementation, so this should be worst case numbers.

Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps

Filtered 25538 samples during 1 second

**76614 kTaps per sec.**

CRC32 checksum for all filtered samples was: 0x1E2D8273

Calculating the CRC32 checksum from the XC implementation, this might take some time

Correct Checksum for filtered datasequence is: 0x1E2D8273

Using 4 cores you have several choices. One way is to calclutate

sample 1 on stdcore[0]

sample 2 on stdcore[1]

sample 3 on stdcore[2]

sample 4 on stdcore[3]

sample 5 on stdcore[0] :> outputting the filter result from sample 1

sample 6 on stdcore[1] :> outputting the filter result from sample 2

e.g. a latency of 4 samples in the filter.

Another solution is to run maybe 15 or 19 FIR filter threads and one distribution thread to balance the load over the cores. Might be good if you can run one sample ahead.