Re: Howto build a FIR with 3000 points
Posted: Sat Oct 15, 2011 10:33 pm
Best optimazation, well that depend on many things. Is there memory enought to use double data, can we have some extra latecy through the FIR filter etc.
I will push one realisation to github on monday.
For 3000 taps, one FIR thread in reality runs (5n+m)
Testing performance, Running FIR-filter for 1 sec on a single thread with 3000 filter taps
Filtered 6660 samples during 1 second
19980 kTaps per sec.
CRC32 checksum for all filtered samples was: 0xED9B0990
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0xED9B0990
4 threads on one core, plus a distributing thread to aviod any extra filter latency runs (5 threads on one core):
This tree structure of implementation should be the worst possible compared to a ring implementation, so this should be worst case numbers.
Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps
Filtered 25538 samples during 1 second
76614 kTaps per sec.
CRC32 checksum for all filtered samples was: 0x1E2D8273
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0x1E2D8273
Using 4 cores you have several choices. One way is to calclutate
sample 1 on stdcore[0]
sample 2 on stdcore[1]
sample 3 on stdcore[2]
sample 4 on stdcore[3]
sample 5 on stdcore[0] :> outputting the filter result from sample 1
sample 6 on stdcore[1] :> outputting the filter result from sample 2
e.g. a latency of 4 samples in the filter.
Another solution is to run maybe 15 or 19 FIR filter threads and one distribution thread to balance the load over the cores. Might be good if you can run one sample ahead.
I will push one realisation to github on monday.
For 3000 taps, one FIR thread in reality runs (5n+m)
Testing performance, Running FIR-filter for 1 sec on a single thread with 3000 filter taps
Filtered 6660 samples during 1 second
19980 kTaps per sec.
CRC32 checksum for all filtered samples was: 0xED9B0990
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0xED9B0990
4 threads on one core, plus a distributing thread to aviod any extra filter latency runs (5 threads on one core):
This tree structure of implementation should be the worst possible compared to a ring implementation, so this should be worst case numbers.
Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps
Filtered 25538 samples during 1 second
76614 kTaps per sec.
CRC32 checksum for all filtered samples was: 0x1E2D8273
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0x1E2D8273
Using 4 cores you have several choices. One way is to calclutate
sample 1 on stdcore[0]
sample 2 on stdcore[1]
sample 3 on stdcore[2]
sample 4 on stdcore[3]
sample 5 on stdcore[0] :> outputting the filter result from sample 1
sample 6 on stdcore[1] :> outputting the filter result from sample 2
e.g. a latency of 4 samples in the filter.
Another solution is to run maybe 15 or 19 FIR filter threads and one distribution thread to balance the load over the cores. Might be good if you can run one sample ahead.