Problem when realizing DSP algorithm:LMS on xCORE-200 MC Audio

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

The problem with using Q31 is that you could get overflow when you multiply accumulate into a 64 bit register. Here is some explanatory text from ARM for their similar function (arm_fir_q31) that uses 32 bit input data and a 64 bit accumulator, with a 32 bit output:
Scaling and Overflow Behavior:

The function is implemented using an internal 64-bit accumulator. The accumulator has a 2.62 format and maintains full precision of the intermediate multiplication results but provides only a single guard bit. Thus, if the accumulator result overflows it wraps around rather than clip. In order to avoid overflows completely the input signal must be scaled down by log2(numTaps) bits. After all multiply-accumulates are performed, the 2.62 accumulator is right shifted by 31 bits and saturated to 1.31 format to yield the final result.
So it appears you must scale down your input by log2(num_taps) to ensure you don't get overflow. So if you had 128 taps you would need to shift all your data right by log2(128) = 7 to ensure you don't overflow. Hence that's why I said @CousinItt's suggestion of using Q7.24 to avoid overflow is so useful.


cjf1699
Active Member
Posts: 48
Joined: Fri Mar 16, 2018 2:30 pm

Post by cjf1699 »

akp wrote:The problem with using Q31 is that you could get overflow when you multiply accumulate into a 64 bit register. Here is some explanatory text from ARM for their similar function (arm_fir_q31) that uses 32 bit input data and a 64 bit accumulator, with a 32 bit output:
Scaling and Overflow Behavior:

The function is implemented using an internal 64-bit accumulator. The accumulator has a 2.62 format and maintains full precision of the intermediate multiplication results but provides only a single guard bit. Thus, if the accumulator result overflows it wraps around rather than clip. In order to avoid overflows completely the input signal must be scaled down by log2(numTaps) bits. After all multiply-accumulates are performed, the 2.62 accumulator is right shifted by 31 bits and saturated to 1.31 format to yield the final result.
So it appears you must scale down your input by log2(num_taps) to ensure you don't get overflow. So if you had 128 taps you would need to shift all your data right by log2(128) = 7 to ensure you don't overflow. Hence that's why I said @CousinItt's suggestion of using Q7.24 to avoid overflow is so useful.
Thank you! But why log2(num_taps) ? I can't figure out.
User avatar
CousinItt
Respected Member
Posts: 360
Joined: Wed May 31, 2017 6:55 pm

Post by CousinItt »

Sorry, been absent for a while. If it's not too late, yes you're correct. The more taps you have, the larger the accumulated result could be. Hence it makes sense to use a smaller fixed point format when you have more taps to avoid the risk of overflow.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

I think cjf1699's problem may be that he/she doesn't really understand fixed point numbers, otherwise he/she could figure out why overflow is possible.

It's pretty obvious intuitively that if you have 2, n bit numbers multiplied together you can get a 2n bit result. So then if you multiply that by an n bit number you can get a 3n bit result (or thereabouts). So that means you can get num_taps*n bits in your result. Prescaling down by log2(num_taps) is (essentially) the same as dividing by 2^(log2(num_taps)) = num_taps. So the final result will have num_taps*(n bits/num_taps) = n bits I guess (or close enough for intuition anyway).
Post Reply