Is there a trick to left shift a HL pair effeciently?

Technical questions regarding the XTC tools and programming with XMOS.
bearcat
Respected Member
Posts: 283
Joined: Fri Mar 19, 2010 4:49 am

Is there a trick to left shift a HL pair effeciently?

Post by bearcat »

I would really like to have a double register SHL instruction to be used after a MAC.
Can't find anything to do this without a brute force of maybe 3-4 instructions.

Any tricks? Miss something?


User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

The "brute force" way is 5 insns, unless shifting by a multiple of 8 bits:
(input is H:L and N, output is H:L):

sub nn,bpw,N
shl H,H,N
shr t,L,nn
or H,H,t
shl L,L,n

If N is a constant, the SUB can be replaced by an LDC or left out if nn is okay for a SHRI.
xcc uses this sequence for constant shifts, and does a libcall for variable shifts (and that
libcall does way too much work).


Here's a neat 3-insn sequence (2 insns if N is fixed, in a loop for example):

mkmsk m,N
shl H,H,N
maccu H,L,m,L
richard
Respected Member
Posts: 318
Joined: Tue Dec 15, 2009 12:46 am

Post by richard »

segher wrote:sub nn,bpw,N
shl H,H,N
shr t,L,nn
or H,H,t
shl L,L,n
This doesn't work if the shift amount is greater than 32.
mkmsk m,N
shl H,H,N
maccu H,L,m,L
Very nice :D
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

richard wrote:
segher wrote:sub nn,bpw,N
shl H,H,N
shr t,L,nn
or H,H,t
shl L,L,n
This doesn't work if the shift amount is greater than 32.
True, but there is no good way to do that anyway. Well, actually it's easier
on xs1 than on most other cpus:

sub nn,bpw,N # this is not a single insn actually, no "sub from" on xs1
shl H,H,N
shr t,L,nn
or H,H,t
sub nn,N,bpw
shl t,L,nn
or H,H,t
shl L,L,n

(still better than the libcall, and works for all (unsigned) N).
mkmsk m,N
shl H,H,N
maccu H,L,m,L
[/quote]
This works for N in 0..bpw inclusive. Making it work for all N isn't trivial
unfortunately. A "difference or zero" insn or a "mask select" insn would
help, but those don't exist.
Very nice :D
Glad you like it! Would you have more puzzles like this? :-)
bearcat
Respected Member
Posts: 283
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

Good idea with the shl and MACCU. I will have to scratch my head on that sequence a little more to reason it out.

Making the mask prior to an IIR loop would work fine. Using a register based SHL would allow for different shifts.

For fixed point scaling in an IIR, needing to shift more than 11 posititions would be terrible. So that is not an issue in general.

Have to run some measurements to see if it's worth the extra instructions in an IIR, but two instructions is alot better than I thought possible.

Thanks all.
bearcat
Respected Member
Posts: 283
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

After looking at it some more. The contents of new L register must be zero prior to the MACCU.

So more like:

mkmsk m,N
shl H,H,N
maccu H,0,m,L where 0 is a register with value of 0 and the 0 is overwritten, not able to be reused.

If a register is available with a zero and the mkmsk is not counted, then may be only 1 added instruction, since you must do a SHL anyways to shift the prior MAC H. But probably 2 still since the 0 register is destroyed and will be needed on the next loop.

Hmmm.... How about a "phatom" register that is always zero? Would make applying gains 2 instructions instead of 4, or save two instructions on the initial MAC's.

Thinking of the above, has anyone been able to use the "LSUB 0" trick (like lsub (h), (l), (h), (h), (h)) to zero two registers using inline assembly? Every combination I have tried ends up with the XCC compiler adding three "LDC 0"s right before the instruction????? I did not try using multiple asm instructions in a single inline assembly line. Will try that.

I have also learned that XCC can add instructions inbetween your inline assembly statements so if you are doing any branching, need to use a label (best practice anyways).
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

bearcat wrote:After looking at it some more. The contents of new L register must be zero prior to the MACCU.
No, it's correct as I wrote it.

We want to compute (H:L)*2**N
= (H:0)*2**N + (0:L)*2**N
= (H*2**N:0) + (0:L) + (0:L)*(2**N - 1)
= (H*2**N:L) + (0:L)*(2**N - 1)
Hmmm.... How about a "phatom" register that is always zero?
I use a register like that in my Forth system, since some important primitives
need it (byte accesses, some comparisons against 0, indeterminate counted
loops); and I don't do register allocation yet, so there are enough registers.
When you _do_ do register allocation, you don't want to sacrifice one to always
be zero; 12 registers is not a lot.
Thinking of the above, has anyone been able to use the "LSUB 0" trick (like lsub (h), (l), (h), (h), (h)) to zero two registers using inline assembly? Every combination I have tried ends up with the XCC compiler adding three "LDC 0"s right before the instruction????? I did not try using multiple asm instructions in a single inline assembly line. Will try that.
asm("lsub %0,%1,%0,%0,%0" : "=r"(h), "=r"(l));
should do the trick. This doesn't do what you want though; you'll need
asm("lsub %0,%1,%0,%0,%2" : "=r"(h), "=r"(l) : "r"(0));
which needs a zero in a reg already (only the low bit matters -- you can be
tricky and use some pointer you already have live).
I have also learned that XCC can add instructions inbetween your inline assembly statements
Write your asm as a single asm() if you don't want that.
so if you are doing any branching, need to use a label (best practice anyways).
It isn't permitted to jump out of an asm() at all.
bearcat
Respected Member
Posts: 283
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

After scratching my head some more, you are correct. The maccu is in fact:

maccu H,L,m,L

Great, this adds only 1 extra instruction after all (assuming the mask generation isn't included).

The lsub syntax I had used was a little more wordy, but similiar (used %0-%4). I will try this again, but different combinations always added 3 "ldc 0" instructions earlier.

I have used:
asm("label:"::);
....
asm("bt %0, label" :: "r" (loop));
Compiles and runs with no errors. So I do not think there is checking of branching outside of a single asm.

I was wrong thinking a "phatom" register of zero would reduce a gain calculation. It would not.

Thanks for the help.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

The lsub syntax I had used was a little more wordy, but similiar (used %0-%4). I will try this again, but different combinations always added 3 "ldc 0" instructions earlier.
You tell the compiler it needs to put 0 in three different registers. The compiler
puts 0 in three registers. Sounds just as expected to me!

You need to tell the compiler when it should use the same register in multiple
places.
I have used:
asm("label:"::);
....
asm("bt %0, label" :: "r" (loop));
Compiles and runs with no errors.
That is pure luck. The compiler is for example free to put a stack adjustment
between those two asm()s. Or swap the order of the two asm()s.
So I do not think there is checking of branching outside of a single asm.
The compiler does not check such branches _at all_, it knows nothing about the content
of your asm()s; as far as the compiler is concerned, it is some black box that it can
optimise just like anything else, taking into account the constraints you gave and the
(data flow) dependencies those result in.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

segher wrote:Here's a neat 3-insn sequence (2 insns if N is fixed, in a loop for example):

Code: Select all

mkmsk m,N
shl H,H,N
maccu H,L,m,L
This of course is for N=0..32 only; for N>32 it computes the same thing as
when N=32. To work for _all_ N, do this:

Code: Select all

shr N2,N,1
sub N,N,N2
mkmsk m,N2
shl H,H,N2
maccu H,L,m,L
mkmsk m,N
shl H,H,N
maccu H,L,m,L
You can put N2 in the same reg as m by reordering a bit:

Code: Select all

shr m,N,1
sub N,N,m
shl H,H,m
mkmsk m,m
maccu H,L,m,L
shl H,H,N
mkmsk m,N
maccu H,L,m,L
Any improvements to that? :-)