code alignment to 16 bytes

Technical questions regarding the XTC tools and programming with XMOS.
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

code alignment to 16 bytes

Post by errsuberlin »

Hello,

in the lib_xcore_math XS3 assembly code I see numerous ".align 16" statements, mostly for jump targets, like so:

Code: Select all

    {   ldaw r11, sp[STACK_VEC_A_SHR]           ;   bf len, .L_loop_bot                     }
    {                                           ;   bu .L_loop_top                          }

.align 16
.L_loop_top:
        {   sub len, len, 1                         ;   vclrdr                                  }
        {                                           ;   vlmacc b[0]                             }
        {   add b, b, _32                           ;   vlsat r11[0]                            }
        {   add a, a, _32                           ;   vstr a[0]                               }
        {                                           ;   bt len, .L_loop_top                     }
.L_loop_bot:

    {   mkmsk tail, tail                        ;   bf tail, .L_finish                      }
I wonder why this is done. I see no such restriction in the XS3 architecture doc, and forcing the code at .L_loop_top onto a 4-byte aligned but 16-byte non-aligned address (like 0x80414) does not induce any obvious errors, exceptions or the like.

Should I keep these .align 16 statements when writing own assembly code? In the end, the "bu .L_loop_top" seems an unnecessary statement if 4-byte alignment is sufficient.

Thanks,
Ralf
User avatar
fabriceo
XCore Addict
Posts: 254
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hello
the goal here is to minimize the "fetch no-ops" as this add 1 thread cycle when the instruction pipeline is empty.

I ve used this also , successfully to some extend. the "bu loop" is used to jump at the 16 byte aligned location, this takes 1 instruction, but then the pipeline is systematically loaded with 16 byte x 8 bits = 128 which is the memory width (see ISA).
then the instructions as of loop_top can be executed without FNOPS

you can test this your self doing a function in pure assembly, using gettime instruction at the beginning and computing the difference at the end. With various alignment and using 1,2,3,4 successive instruction like LDW you will eventually capture FNOPS ! if you insert some non-memory instructions (using only registers) in between you will remove the FNOPS.

certainly some XMOS contributor will bring more clarity on this. hope this helps
fabriceo
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

Ah, cool, will test, thank you!
User avatar
upav
Verified
Member++
Posts: 30
Joined: Wed May 22, 2024 3:30 pm

Post by upav »

You can also see FNOPs in xsim processor traces (can trace with xsim --trace-to your.file --enable-fnop-tracing). You can use it to evaluate FNOPs in your assembly (or C).
Personally, I go down to optimising for FNOPs only in super-tight cases (DSP loops, USB...), for the most part writing the API in assembly is enough of an optimisation.

Sorry, instruction buffer and alignments are only vaguely touched by the ISA guide, if you have any further questions, don't hesitate to ask here ;)
Pavel
xmos software engineer
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

I've tested with a vect_s16_scale on a 16kbyte buffer, and could obviously not cause an instruction cache miss, i.e. there is no significant difference of time spent in the function, on whatever address the branch target is put.
But I can see the instruction loading description of the ISA and I understand the .align 16 might avoid FNOPS depending on the actual number and kind of instructions in the loop.

So, thanks again for your quick answer, it allows me to go on and spend no more time on the topic.

PS. After seeing the last post, enabled the FNOP trace and indeed there are none in the vect_s16_scale loop. Cool analysis method, indeed. I have tight code to tune from time to time, that will help. Also it allows to understand in which situations the instruction cache runs empty.
User avatar
fabriceo
XCore Addict
Posts: 254
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

FNOP hackathon
I m sharing here an example of an optimized opcode-dispatcher which benefit from the 16 byte alignement.
It is also using a retsp instead of a branch instruction to return to the loop entry point.
This gives the possibility to bundle another R instruction with it without risk of limiting the loop distance to 64bytes, here a simple add instruction,
which move the dsp execution pointer to the next op code.

a simple NOP opcode takes 4 instruction as a bare minimum (3 for dispatch, 1 for return and increment).
The coding used here is 1 word for each op code, the 16 lsb contains a skip value (in number of bytes) to care of potential parameters for the opcode,
and the 16 msb contains the opcode itself. a BRU is used to transfer the control to a jump table (4 bytes per jump)

this approach is used for creating dsp programs made of a list of opcode and is almost as performant as a hard coded list of calls-return.

Code: Select all

	.align	16
dspRuntimeExec:
	{ gettime r11			; dualentsp stack }
	{ mov ptr,r0  			; sub r11,r11,6 }
	stw r11,dp[timestart]
	{ ldc r0,0 			; bl dspRuntimeExec_load_opcode }		// LR is initialized once for all 
//this is aligned 16 byte here, due to above 4 dual issues instructions
dspRuntimeExec_load_opcode:
	{ nop			; ldw r11,ptr[0] }
	{ shr r2,r11,16 		; zext r11,16 }
	{ nop 			; bru r2 }

	//table of 32 bits jumps (using BRFU_lu6 to force 32bits encoding)
	{ gettime r0 		; retsp stack }			//0 dsp_END_OF_CODE
	{ add ptr,ptr,r11	; retsp 0 }				//1 dsp_NOP
	BRBU_lu6 	dsp_LOAD_PARAM1				//2 loading accu with stored value
	...
dsp_LOAD_PARAM1:							//takes 6 core instruction in total to load "accu" with a value from the program
	ldw r0,ptr[1]
	{ add ptr,ptr,r11	; retsp 0 }