How do I target the LDIV instruction?

Technical questions regarding the XTC tools and programming with XMOS.
Malcolm
Junior Member
Posts: 7
Joined: Wed Dec 11, 2013 1:53 pm

How do I target the LDIV instruction?

Post by Malcolm »

I'd like to access the LDIV instruction from C++ and am currently attempting to do so by:

Code: Select all

	static uint32_t divu_fract (uint32_t dividend, uint32_t divisor)
	{
		return (uint32_t)((((uint64_t)dividend) << 32) / divisor);
	}
Unfortunately this compiles to a library call to __udivdi3 rather than LDIVU.

I've found https://www.xmos.com/node/16493 which tells me of intrinsics to target maccs/maccu but searching xs1.h I can;t find a corresponding intrinsic for LDIVU.

I've also tried

Code: Select all

static inline void xmos_ldivu(uint32_t &quotient, uint32_t &remainder, uint32_t numeratorLo, uint32_t numeratorHi, uint32_t denominator)
{
    asm("LDIVU %0, %1, %2, %3, %4" : "=r"(quotient), "=r"(remainder) : "r"(numeratorLo), "r"(denominator), "r"(numeratorHi));
}
and this produces an error "A00000 Use of "LDIVU" outside of architectural syntax mode." which I can't find any further information on.


User avatar
sethu_jangala
XCore Expert
Posts: 589
Joined: Wed Feb 29, 2012 10:03 am

Post by sethu_jangala »

If you look at the XS1 Architecture manual in the following link (Page 136): https://www.xmos.com/download/public/Th ... 7879A).pdf

LDIVU instruction only works if the division fits in a 32-bit word, that is, if the higher word of the double word input is less than the divisor. This operation is intended to be used for the implementation of long division.

Sethu.
Malcolm
Junior Member
Posts: 7
Joined: Wed Dec 11, 2013 1:53 pm

Post by Malcolm »

I don't see how your answer addresses my question, unless you're telling me that I'll get LDIVU if I precede my snippet with some kind of pragma to tell the compiler to assume dividend < divisor?

I'm not asking if LDIVU is the instruction I need. I understand its operation but am asking how I get it into my program.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Write "ldivu" in lower case; the order of the arguments is
q r nh nl d.

Do you want optimised 64/32 and 64/64 routines as well?
Malcolm
Junior Member
Posts: 7
Joined: Wed Dec 11, 2013 1:53 pm

Post by Malcolm »

Thank you, problem solved. The A00000 error was indeed due to the upper case opcode and your operand order does appear to work for me.

However your order of operands differs from that stated on p136 of A9R518E. How did you find out the correct operand order is q r nh nl d?

My only other long division is the same but dividing signed by unsigned. Which I do by adding (divisor << 31) to the dividend, using ldivu and then subtracting 1<<31.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Malcolm wrote:Thank you, problem solved. The A00000 error was indeed due to the upper case opcode and your operand order does appear to work for me.

However your order of operands differs from that stated on p136 of A9R518E. How did you find out the correct operand order is q r nh nl d?
The architectural syntax (as used in the architecture book)
is written in uppercase; the normal assembler syntax is
in lower case, and has a much nicer operand order. Pretty
much everything about it is nicer to use really; unless you
think having to write "BRBU" and "BRFU" is nicer than just
"bu" for both. There is a doc called something like "assembly
language manual" that describes it all (it used to be called
xas99.pdf).
My only other long division is the same but dividing signed by unsigned. Which I do by adding (divisor << 31) to the dividend, using ldivu and then subtracting 1<<31.
That rounds down, you usually want to round towards zero.

here is my 64/64 (and 64/32) routine btw (just showing off, sorry :-) )
Malcolm
Junior Member
Posts: 7
Joined: Wed Dec 11, 2013 1:53 pm

Post by Malcolm »

Actually for my applications (audio processing), I usually want to round down and it's a bit of a pain that C rounds to zero. Because I want my roundoff error to be as independent of the data as feasible in the circumstances, but rounding to zero causes the roundoff error to completely change range as the data passes through zero.

I guess the set of multiple precision arithmetic fanciers isn't as big as it might be, so I have taken a peek along with the required two coffees.

I like the concept of using mkmsk & 2 lmuls to do a 64 bit shift - and positively sneaky to add r4 into the result because you haven't got a zeroed register to hand and know it won't carry into the word that you're interested in.

Performing a multiply-and-accumulate by a -ve operand with the unsigned lmuls and lsubs ensured that second mug got good use.

Nice

PS - thanks for pointing me towards the assembly book. How silly of me to think that when the architecture book gives me a load of opcodes that they're the real deal.
User avatar
Ross
XCore Expert
Posts: 972
Joined: Thu Dec 10, 2009 9:20 pm
Location: Bristol, UK

Post by Ross »

Malcolm wrote: How silly of me to think that when the architecture book gives me a load of opcodes that they're the real deal.
They quite literally are the "real deal". However, its easier to use the normal assembler syntax (as documented by the assembly programming manual) as sigher suggests. However, you can change using the syntax directive in an asm file:

.syntax architectural

or

.syntax default
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

In many cases of audio processing you have a signal x(n) and a variable that you would like to divide the signal with k, where the variable changes much less often than the signal.

It might be much more efficient to instead calculate a= 1/k based on division only when k is updated, and instead perform a*x(n) on each sample. Using signed MAC the a*x can be performed in one cycle with a 64 bit result.

Just a tips to consider.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

lilltroll wrote:Using signed MAC the a*x can be performed in one cycle with a 64 bit result.
How do you do that? It usually costs me three cycles...