code alignment to 16 bytes

Technical questions regarding the XTC tools and programming with XMOS.
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

code alignment to 16 bytes

Post by errsuberlin »

Hello,

in the lib_xcore_math XS3 assembly code I see numerous ".align 16" statements, mostly for jump targets, like so:

Code: Select all

    {   ldaw r11, sp[STACK_VEC_A_SHR]           ;   bf len, .L_loop_bot                     }
    {                                           ;   bu .L_loop_top                          }

.align 16
.L_loop_top:
        {   sub len, len, 1                         ;   vclrdr                                  }
        {                                           ;   vlmacc b[0]                             }
        {   add b, b, _32                           ;   vlsat r11[0]                            }
        {   add a, a, _32                           ;   vstr a[0]                               }
        {                                           ;   bt len, .L_loop_top                     }
.L_loop_bot:

    {   mkmsk tail, tail                        ;   bf tail, .L_finish                      }
I wonder why this is done. I see no such restriction in the XS3 architecture doc, and forcing the code at .L_loop_top onto a 4-byte aligned but 16-byte non-aligned address (like 0x80414) does not induce any obvious errors, exceptions or the like.

Should I keep these .align 16 statements when writing own assembly code? In the end, the "bu .L_loop_top" seems an unnecessary statement if 4-byte alignment is sufficient.

Thanks,
Ralf
User avatar
fabriceo
Respected Member
Posts: 257
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hello
the goal here is to minimize the "fetch no-ops" as this add 1 thread cycle when the instruction pipeline is empty.

I ve used this also , successfully to some extend. the "bu loop" is used to jump at the 16 byte aligned location, this takes 1 instruction, but then the pipeline is systematically loaded with 16 byte x 8 bits = 128 which is the memory width (see ISA).
then the instructions as of loop_top can be executed without FNOPS

you can test this your self doing a function in pure assembly, using gettime instruction at the beginning and computing the difference at the end. With various alignment and using 1,2,3,4 successive instruction like LDW you will eventually capture FNOPS ! if you insert some non-memory instructions (using only registers) in between you will remove the FNOPS.

certainly some XMOS contributor will bring more clarity on this. hope this helps
fabriceo
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

Ah, cool, will test, thank you!
User avatar
upav
Verified
Active Member
Posts: 32
Joined: Wed May 22, 2024 3:30 pm

Post by upav »

You can also see FNOPs in xsim processor traces (can trace with xsim --trace-to your.file --enable-fnop-tracing). You can use it to evaluate FNOPs in your assembly (or C).
Personally, I go down to optimising for FNOPs only in super-tight cases (DSP loops, USB...), for the most part writing the API in assembly is enough of an optimisation.

Sorry, instruction buffer and alignments are only vaguely touched by the ISA guide, if you have any further questions, don't hesitate to ask here ;)
Pavel
xmos software engineer
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

I've tested with a vect_s16_scale on a 16kbyte buffer, and could obviously not cause an instruction cache miss, i.e. there is no significant difference of time spent in the function, on whatever address the branch target is put.
But I can see the instruction loading description of the ISA and I understand the .align 16 might avoid FNOPS depending on the actual number and kind of instructions in the loop.

So, thanks again for your quick answer, it allows me to go on and spend no more time on the topic.

PS. After seeing the last post, enabled the FNOP trace and indeed there are none in the vect_s16_scale loop. Cool analysis method, indeed. I have tight code to tune from time to time, that will help. Also it allows to understand in which situations the instruction cache runs empty.
User avatar
fabriceo
Respected Member
Posts: 257
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

FNOP hackathon
I m sharing here an example of an optimized opcode-dispatcher which benefit from the 16 byte alignement.
It is also using a retsp instead of a branch instruction to return to the loop entry point.
This gives the possibility to bundle another R instruction with it without risk of limiting the loop distance to 64bytes, here a simple add instruction,
which move the dsp execution pointer to the next op code.

a simple NOP opcode takes 4 instruction as a bare minimum (3 for dispatch, 1 for return and increment).
The coding used here is 1 word for each op code, the 16 lsb contains a skip value (in number of bytes) to care of potential parameters for the opcode,
and the 16 msb contains the opcode itself. a BRU is used to transfer the control to a jump table (4 bytes per jump)

this approach is used for creating dsp programs made of a list of opcode and is almost as performant as a hard coded list of calls-return.

Code: Select all

	.align	16
dspRuntimeExec:
	{ gettime r11			; dualentsp stack }
	{ mov ptr,r0  			; sub r11,r11,6 }
	stw r11,dp[timestart]
	{ ldc r0,0 			; bl dspRuntimeExec_load_opcode }		// LR is initialized once for all 
//this is aligned 16 byte here, due to above 4 dual issues instructions
dspRuntimeExec_load_opcode:
	{ nop			; ldw r11,ptr[0] }
	{ shr r2,r11,16 		; zext r11,16 }
	{ nop 			; bru r2 }

	//table of 32 bits jumps (using BRFU_lu6 to force 32bits encoding)
	{ gettime r0 		; retsp stack }			//0 dsp_END_OF_CODE
	{ add ptr,ptr,r11	; retsp 0 }				//1 dsp_NOP
	BRBU_lu6 	dsp_LOAD_PARAM1				//2 loading accu with stored value
	...
dsp_LOAD_PARAM1:							//takes 6 core instruction in total to load "accu" with a value from the program
	ldw r0,ptr[1]
	{ add ptr,ptr,r11	; retsp 0 }
	
User avatar
fabriceo
Respected Member
Posts: 257
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hi Guys,
I came to some results after testing on XU216 (dual issue) and please correct me if I m wrong:

it seems that when the cpu see a "resource" instruction which doesn't access memory to do a load/store, then it utilises the opportunity to load the pipeline with 16 instructions (= 32 bytes = 8 dual issue instructions or 8 prefixed instructions), but starting at the address where sits this instructions rounded at the lower 16 byte alignement due to the 128bit memory width.

so the following code was done in only 8 cpu instructions:

Code: Select all

.align 16
entry:
or r0,r0,r0
ldd r1,r2,sp[0] x 6
retsp 0
it seems also that the branch instruction is less power full and only loads 16bytes at the destination address, also rounded at the lower 16 bytes alignement.

as a consequence,
when a destination starts with 2 load/store instruction, it is advised to set this target as .align 8, otherwise there is a risk that the first one would sits at the end of a 16byte chunk and then a FNOP would be required just after it before the second one.

when a destination starts with 3 load/store instructions, it is required to set this target as .align 16, otherwise there is a risk that they sit across a 16byte chunk and then a FNOP would be required in between the 3 load/store.

when a destination starts with 4,5 or 6 load/store instructions, it is required to set this target as .align 16 and to put a "resource" instruction first which will load the full pipeline with up to 6 load/store without FNOP.


for feedback and thoughts
Last edited by fabriceo on Mon Jun 30, 2025 4:38 pm, edited 2 times in total.
User avatar
fabriceo
Respected Member
Posts: 257
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

in addition, the retsp X instruction always requires a FNOP to preload the pipeline with the return address instructions when the stack size X is not 0.
But when it is 0 then the retsp 0 behave like the branch instruction explained above. So for the same reason as branching on an .align 16 target, it is recommended to have the bl call instruction at the end of a 16 byte chunk , to maximise the pipeline reload by the retsp instruction, same for the restsp X

also it might be better in a source code to use (or force) a "ressource" instruction instead of letting the cpu issue a FNOP, because FNOP load only 128bits (8 instructions) instead of 2x128 bits (please confirm)