Howto build a FIR with 3000 points

Post by **lilltroll** » Sat Oct 08, 2011 3:08 pm

Have you considered multirate filtering ?

Do you you really need 3000 taps for frequencies between 20-48 kHz. ??

You could run 1500 taps at 48 kHz instead, or maybe 1000 taps at 32 kHz (same length in of the impulse response) and something with much less taps at 96 kHz for the highest frequencies.
(A human over the age of 20 typically doesn't hear much above 16 kHz in the industrial world anymore, since we are exposed to much more sound/noise compared to what evolution fitted the human hearing to handle. We would need a redesign of the Stapedius muscle in the ear for the new world of noise)

Tiny tiny muscle in the pic.

Anyway, multirate is very easy to implement on XMOS.
http://en.wikipedia.org/wiki/Multi-rate ... processing
http://en.wikipedia.org/wiki/Quadrature_mirror_filter
https://docs.google.com/viewer?a=v&q=ca ... b5ySdKchGQ

Post by **lilltroll** » Sun Oct 09, 2011 2:13 pm

Well maybe it wasn't so hard after all

I have written a program that is intended to calculate the
1:st sample on Core0
2:nd sample on Core1
3:rd sample on Core2
4:th sample on Core3
5:th sample on Core0
...

The impulse response is 3000 taps long.

Meaning that you will have a delay of 4 samples + CODEC.

I timed the results and it looks like this.

Console:
Starting
911ms total time for 24000 samples on 1 core
Stopping

And everything below 1 sec. is a sucess so :D (The other 3*24000 samples is calculated on the other 3 cores)

Distributet FIR filtering is a todo thing in the filter module anyway, helpink Henk a little :geek:
But I haven't controlled that the result is correct yet. The intension is to let MATLAB send the signal via UART to my XDK, and control the realtime result by making the same FIR filtering in MATLAB.

(It will work on any XC-1 card as well)

dirk1980 · Post by **dirk1980** » Sun Oct 09, 2011 2:35 pm

You have written the hole program?
And it is running on an L1 system?

Well i thing i have to learn more about the XMOS.
I don't understand your firAsm code, i need more comments in the code.
15 years of C and assembler is not enough.
New system, new problems!

Can you post the code somewhere?

Post by **lilltroll** » Sun Oct 09, 2011 2:53 pm

It is running on a G4 , the XDK kit.

Here is the final thing, what was the penalty to distribute it to 4 cores ????

Starting
913ms total time for 96000 samples on 4 cores
Stopping

only 2ms :D :D :D

It should only be minor bugs in the calculation if any. Everything seems to be working.
I need to comment things myself before I forget it, since I had to change things in firAsm to use a shared memory-space on each core.

Post by **lilltroll** » Sun Oct 09, 2011 3:00 pm

What CODEC chip is your intention to use ?
If it is a TI or Cirrus, I probably have the register setup already, but not for wolfson.

dirk1980 · Post by **dirk1980** » Sun Oct 09, 2011 6:17 pm

G4 is perfect, no problem for me.

The CODEC / ADC DAC is still open.
All the setup stuff is done from the ARM7.
The Filter-system is just running with a fix system in the middle of the ADC / DAC / ARM7 chain.

Maybe i can change this chain and replace the ARM7 with a second XMOS G4.

Post by **lilltroll** » Sun Oct 09, 2011 7:05 pm

The Audio-setup is done before the filtering starts. You have a lot of pins to use. I do not think you need another G4 if you do not want to run in stereo.

The threads isn't static. Since the distributorthread on one core that feeds all other thread takes some time, there is still time left on all other 3 cores if you need to do more in parallel with the filtering. As it it now. It could be optimized but I seems to be the case that it doesn't need to be.

The XDK has a stereo CODEC capable of running in 96 kHz, I can probably test-connect the filter to the CODEC.

The offset of the memory pointers to the old states of the in signal is still wrong in some threads. Hmm, you really have to think to get i right when distributing it.

It must pass this first test

MATLAB

Code: Select all

>> filter([0:2999],1,1:13)

ans =

     0     1     4    10    20    35    56    84   120   165   220   286   364

Before I push it to git. Otherwise you will be even more confused since the little code on gist was proven to be right.
If you haven't done it yet, install the GIT plugin in XDE. Even if you havn't a card yet you can run things in the simulator to see it for your self. Including timing :)

Your lucky day is due to that I'm in bed with a little of a cold and it is raining outside. But back working I probably cannot do everything ready to be deployed fot you, at least not for free. But this was a good training for me, since I'm need to write distributed adaptive FIR filter, and a lot of them, but not @ 96 kHz.

The XDK is expensive, but the XC-1A card is not. It has free pins for a codec including I2C/SPI to set it up.
But if it is a commercial project the money for the XDK maybe isn't a big deal.

If you prefer the L series it is possible to use that as well. The BGA have the need of many layers on the PCB. With the L can can do it with 2 or 4 layers. And if you need more power you can just add one more.
Maybe a cute little row of L1-48 is interesting, or dual L2's

The G4 has very heavy bandwidth in the interconnect between the channels, but I didn't use that all the way, since I wanted it to be L compatible.
The L series has less interconnections between the switches.

If it can be squeezed into 3 pc. of 500 MHz ?? Maybe, but it might be tight of time in the end.

bearcat · Post by **bearcat** » Mon Oct 10, 2011 5:22 pm

A FIR can be implemented in a little over 3 instructions per tap. Using indexed loads (and double buffers) allows for 10 taps (12 is possible) to execute per loop. The loop overhead is then spread out over the 10 taps, which is 3 instructions in my code. I preload the loop constants in registers. 10 taps x 3 + 3 = 3.3 instructions per tap. There might be a cache stall possibly in there, which might cause execution to be lowered. I have not timed my code to determine it's actual execution rate. Here's a snippet:

Code: Select all

    asm("ldw %0, %1[1]" :: "r" (j), "r" (sindex));
    asm("ldw %0, %1[1]" :: "r" (k), "r" (findex));
    asm("maccs %0, %1, %2, %3" :: "r" (h), "r" (l), "r" (j), "r" (k));
    asm("ldw %0, %1[2]" :: "r" (j), "r" (sindex));
    asm("ldw %0, %1[2]" :: "r" (k), "r" (findex));
    asm("maccs %0, %1, %2, %3" :: "r" (h), "r" (l), "r" (j), "r" (k));

Post by **lilltroll** » Mon Oct 10, 2011 7:33 pm

bearcat wrote:A FIR can be implemented in a little over 3 instructions per tap. Using indexed loads (and double buffers) allows for 10 taps (12 is possible) to execute per loop. The loop overhead is then spread out over the 10 taps, which is 3 instructions in my code. I preload the loop constants in registers. 10 taps x 3 + 3 = 3.3 instructions per tap. There might be a cache stall possibly in there, which might cause execution to be lowered. I have not timed my code to determine it's actual execution rate. Here's a snippet:
Code: Select all
    asm("ldw %0, %1[1]" :: "r" (j), "r" (sindex));
    asm("ldw %0, %1[1]" :: "r" (k), "r" (findex));
    asm("maccs %0, %1, %2, %3" :: "r" (h), "r" (l), "r" (j), "r" (k));
    asm("ldw %0, %1[2]" :: "r" (j), "r" (sindex));
    asm("ldw %0, %1[2]" :: "r" (k), "r" (findex));
    asm("maccs %0, %1, %2, %3" :: "r" (h), "r" (l), "r" (j), "r" (k));

Only the maccs can fetch 32 bits of instruction to the instruction-buffer in your example, that is not enough to fetch 2 ldw and 1 maccs. You will get FNOP. Test it in the simulator or XTA. FNOP is useless, so it is better to put an instruction that is actual doing something there instead including, instruction fetch.

See http://archive.xmoslinkers.org/conf-talk-6h-wf @ 49:25 -> 53:15
The idea with XMOS is to not place several LoadWords/StoreWords after each other.
See @ 55:13
They do not want to have a race between the data-acess and instruction-acess, because that destroys determinism.
But if you compile it in a smart way, this i away to actual be able to run the ALU@400 MHz, always doing an operation, and to access the SRAM@400 MHz with read or write, but never have an memory collision.

I do not believe you until I see a trace from the simulator :ugeek:

What might be possible is to hide the wraparound checking needed for a circular buffer between the ldw inst. if you unroll the loop, thus keeping 20MTaps @ 100 MHz without double data/coeff. method.

bearcat · Post by **bearcat** » Tue Oct 11, 2011 5:50 am

Great insight Lilltrol. My initial quick timing calculations didn't take into count the pipeline stalls. My implementation is 3.3 instructions / tap as stated, though. After reviewing my lab notes, prior timings on real hardware indicate your rule of thumb is pretty accurate at 20MTaps/sec/thread for a FIR.

Using those FNOP's for futher optimizations could be interesting. I had looked for "best" optimized filter routines from XMOS earlier and did not find them. I had thought XMOS might have a filter library that are fully optimized since they write these themselves. Are they available? Are your's best optimized?

Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points

Re: Howto build a FIR with 3000 points