"High-End Audio" IIR filters

XCore Project reviews, ideas, videos and proposals.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

[quote="lilltroll"]First the filter coeff's in Q30.2 format

Code: Select all

B0={1073741824,-1072758407,0}
A0={1073741824,-1073741824,0}
...
[/quote]
Meaning on a normalized scale
A[sub]0[/sub]=1 A[sub]1[/sub]=-1

Opps that is a nasty one. It will make a DC signal grow to infinity, it is simply written upside down, (a copy and paste error from MATLAB).


bearcat
Respected Member
Posts: 282
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

Thanks for the info, always informative.

I have written a 48 bit version of the IIR for the XMOS. 48 bits works out computational effecient with the XMOS architechure. A 32 bit BiQuad IIR is about 23 instructions, where as the 48 bit version was only about 60% more instructions, or around 44 instructions. Much better than 64 bits. You may want to try that option also.

I have also implemented a optional fixed bit shift on the feed forward coefficients to achive a few more bits of resolution for the coefficients as the input signal is garbage at the lowest bits anyways or non-existent for 24 bit inputs, for example. Granted, not as important as the feed back.

Where as 32 bit DACS are now available from TI, actually achieving that SNR for an analog output would be quite a challenge. Most audio equipment is pushing to get 105+dB of SNR (at least non-professional stuff without high voltage outputs). Good for marketing, though.

Edit: updated instruction count after looking at code
Last edited by bearcat on Fri May 11, 2012 3:18 am, edited 1 time in total.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

How does 48bit look like in pseudo code ?

Here is 96bit MAC and 64bit feedback states

int64Y=( (int32B0*int32X0+int32B1*int32X1+int32B2*int32X2)<<32-int32A1*int64Y1-int32A2*int64Y2) / int32A0
int64Y2=int64Y1; int64Y1=int64Y;
int32X2=int32X1; int32X1=int32X0;

It is calculated with help of some wrapping to aviod the instructions for memory copy of the internal filter states
while{
int96ACC=(int32B1*int32X2+int32B2*int32X1)<<32
cin :> X1
int96ACC+=( int32B0*int32X1)<<32-int32A2*int64Y1-int32A1*int64Y2)
int64Y1=int96ACC>>30
cout <: round32MSB(int64Y1)
int96ACC=(int32B1*int32X1+int32B2*int32X2)<<32
cin :> X2
int96ACC+=( int32B0*int32X2)<<32-int32A2*int64Y2-int32A1*int64Y1)
int64Y2=int96ACC>>30
cout <: round32MSB(int64Y2)
}
bearcat
Respected Member
Posts: 282
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

In my 48 bit code, I only implemented 48 bits for the two feedback stored states and their coefficients. Forward coefficients were 32 bits, but are 6 bit shifted if possible, and shifted sample 6 bits. Of course, these could also be 48 bits with more instructions. I am using indexed loads for speed in the assembler code, and 12 isn't enough to provide 48 bit forward coefficients also.

For the 48 bits just perform a long multiply with the appropriate shifts between the high and low values. With optimizations, correct shifts, and a couple tricks (like 1 bit shifting a HL pair), it can be done in about 44 instructions give or take. Which is a significant improvement over a full double implementation.

I did perform SNR and THD measurements at the time on the 32 and 48 bit filters, but did not find the data to include here.
User avatar
Gaston
Junior Member
Posts: 4
Joined: Wed Jan 27, 2010 10:39 am

Post by Gaston »

bearcat, would you mind sharing your code? We could then possibly remeasure its performance and have more options for people to choose from.

As long as we can meet SNR and THD good enough to allow digital volume control we have enough headroom.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

bearcat wrote:In my 48 bit code, I only implemented 48 bits for the two feedback stored states and their coefficients. Forward coefficients were 32 bits, but are 6 bit shifted if possible, and shifted sample 6 bits. Of course, these could also be 48 bits with more instructions. I am using indexed loads for speed in the assembler code, and 12 isn't enough to provide 48 bit forward coefficients also.

For the 48 bits just perform a long multiply with the appropriate shifts between the high and low values. With optimizations, correct shifts, and a couple tricks (like 1 bit shifting a HL pair), it can be done in about 44 instructions give or take. Which is a significant improvement over a full double implementation.

I did perform SNR and THD measurements at the time on the 32 and 48 bit filters, but did not find the data to include here.
OK, I get it, you avoid the shifting in /A0, when the result is shifted 32*n bits. Hmm, I will test Q2.32 for the A coeff, and x<<=2
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

That speeds up the code. Increasing the coeff. length speeds up the filter :)

I'm using 39 "real" instructions, making it in 42.5 clockcycles / sample including the FNOPS.
That includes round(y) at the end (instead of floor(y))

That means I can run a SOS link @ >2.8 MHz on a C5 thread.

B-coef. uses Q2.30, A-coef. uses Q2.32, feedback loop uses 64-bit internal states, fordward loop uses 32 bits, and the accumulator uses 96 bits.

Lets test several SOS link on one thread next.
bearcat
Respected Member
Posts: 282
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

The bit shifting is nothing new or unique, it's documented by many other people. So let's look at an actual case. Here's the coefficients for a 250Hz low pass 2nd order IIR:

Code: Select all

{  //crossML1 250Hz, 31bits, A1/2, B0,B1,B2,A1,A2,BitShift
42517323,  // 0.00030935424661235637
85034647,  // 0.00061870849322471274
42517323,  // 0.00030935424661235637
2093407256,  // 0.97481871793958885242
-2041988198,  // -0.95087485286562711817
6
},
My MATLAB code looks at the coefficients and determines how many bits are wasted by B0, B1, and B2. If its possible to bit shift by 6 bits, it sets the BitShift field accordingly. My filter "loader" then sets the appropriate data structures to run the bit shift IIR which shifts right the input sample by 6 bits.

This adds resolution since the sample value is going to be something like 0xffffff00 for a 24 bit audio sample and the lower 8 bits are zeros. My crossovers are the second to last operation to be performed on the data prior to going to the DAC's, which means they can be in 0.32 format for highest resolution. The last operation is a MACC for volumes.

Forgot to mention, that you do need a little headroom or the values will overflow from the filter due to a small overshoot sometimes. I reduce the volume by about 1dB earlier in the code.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

Bearcat, do you use 2-comp. longmult or do test the sign and then branch to unsigned longmult ?

An example of 2-comp. longmult adding earlier 96-bit accumulator value hi:mi:lo

Code: Select all

// s6 = 31
//Calc (int96) ACC+ =|2-complement|= Y2sign:Y2hi:Y2lo * A2sign:A2hi:Alo
	ldw s1, y2hi
	ldw s2, y2lo
	ldw s3, a2hi
	ldw s4, a2lo
.align 4
	lmul   s5, lo, s2, s4, zero, lo  	//Ylo*Alo	-- LO --
	lmul   s5, mi, s2, s3, s5, mi  	//Ylo*Ahi	-- MID --
	maccu  s5, mi, s1, s4		  		//Yhi*Alo	-- MID --
	lmul   s5, hi, s1, s3, s5, hi  	//Yhi*Ahi	-- HI --
	ashr   s3, s3, s6					//extract sign from Ahi
	ashr   s1, s1, s6					//extract sign from Yhi
	maccu  s5, hi, s2, s3		 		//Ylo*Asign	-- HI --
	maccu  s5, hi, s1, s4		 		//Ysign*Alo	-- HI --
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

while{
int96ACC=(int32B1*int32X2+int32B2*int32X1)<<32
cin :> int30X1
int32X1<<=2
int96ACC+=( int32B0*int32X1)<<32-int34A2*int62Y1-int34A1*int62Y2)
int62Y1=int96ACC>>32
cout <: round30MSB(int62Y1)
int96ACC=(int32B1*int32X1+int32B2*int32X2)<<32
cin :> int30X2
int32X2<<=2
int96ACC+=( int32B0*int32X2)<<32-int34A2*int62Y2-int34A1*int62Y1)
int62Y2=int96ACC>>32
cout <: round30MSB(int62Y2)
}

Which will work as long I have 2 bits of headroom in the input X and in the output Y, (including headroom for overshoot)