code alignment to 16 bytes

Technical questions regarding the XTC tools and programming with XMOS.
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

code alignment to 16 bytes

Post by errsuberlin »

Hello,

in the lib_xcore_math XS3 assembly code I see numerous ".align 16" statements, mostly for jump targets, like so:

Code: Select all

    {   ldaw r11, sp[STACK_VEC_A_SHR]           ;   bf len, .L_loop_bot                     }
    {                                           ;   bu .L_loop_top                          }

.align 16
.L_loop_top:
        {   sub len, len, 1                         ;   vclrdr                                  }
        {                                           ;   vlmacc b[0]                             }
        {   add b, b, _32                           ;   vlsat r11[0]                            }
        {   add a, a, _32                           ;   vstr a[0]                               }
        {                                           ;   bt len, .L_loop_top                     }
.L_loop_bot:

    {   mkmsk tail, tail                        ;   bf tail, .L_finish                      }
I wonder why this is done. I see no such restriction in the XS3 architecture doc, and forcing the code at .L_loop_top onto a 4-byte aligned but 16-byte non-aligned address (like 0x80414) does not induce any obvious errors, exceptions or the like.

Should I keep these .align 16 statements when writing own assembly code? In the end, the "bu .L_loop_top" seems an unnecessary statement if 4-byte alignment is sufficient.

Thanks,
Ralf
User avatar
fabriceo
XCore Addict
Posts: 253
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hello
the goal here is to minimize the "fetch no-ops" as this add 1 thread cycle when the instruction pipeline is empty.

I ve used this also , successfully to some extend. the "bu loop" is used to jump at the 16 byte aligned location, this takes 1 instruction, but then the pipeline is systematically loaded with 16 byte x 8 bits = 128 which is the memory width (see ISA).
then the instructions as of loop_top can be executed without FNOPS

you can test this your self doing a function in pure assembly, using gettime instruction at the beginning and computing the difference at the end. With various alignment and using 1,2,3,4 successive instruction like LDW you will eventually capture FNOPS ! if you insert some non-memory instructions (using only registers) in between you will remove the FNOPS.

certainly some XMOS contributor will bring more clarity on this. hope this helps
fabriceo
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

Ah, cool, will test, thank you!
User avatar
upav
Verified
Member++
Posts: 30
Joined: Wed May 22, 2024 3:30 pm

Post by upav »

You can also see FNOPs in xsim processor traces (can trace with xsim --trace-to your.file --enable-fnop-tracing). You can use it to evaluate FNOPs in your assembly (or C).
Personally, I go down to optimising for FNOPs only in super-tight cases (DSP loops, USB...), for the most part writing the API in assembly is enough of an optimisation.

Sorry, instruction buffer and alignments are only vaguely touched by the ISA guide, if you have any further questions, don't hesitate to ask here ;)
Pavel
xmos software engineer
errsuberlin
Junior Member
Posts: 4
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

I've tested with a vect_s16_scale on a 16kbyte buffer, and could obviously not cause an instruction cache miss, i.e. there is no significant difference of time spent in the function, on whatever address the branch target is put.
But I can see the instruction loading description of the ISA and I understand the .align 16 might avoid FNOPS depending on the actual number and kind of instructions in the loop.

So, thanks again for your quick answer, it allows me to go on and spend no more time on the topic.

PS. After seeing the last post, enabled the FNOP trace and indeed there are none in the vect_s16_scale loop. Cool analysis method, indeed. I have tight code to tune from time to time, that will help. Also it allows to understand in which situations the instruction cache runs empty.