code alignment to 16 bytes

Technical questions regarding the XTC tools and programming with XMOS.
errsuberlin
New User
Posts: 3
Joined: Mon Jun 30, 2014 10:22 pm

code alignment to 16 bytes

Post by errsuberlin »

Hello,

in the lib_xcore_math XS3 assembly code I see numerous ".align 16" statements, mostly for jump targets, like so:

Code: Select all

    {   ldaw r11, sp[STACK_VEC_A_SHR]           ;   bf len, .L_loop_bot                     }
    {                                           ;   bu .L_loop_top                          }

.align 16
.L_loop_top:
        {   sub len, len, 1                         ;   vclrdr                                  }
        {                                           ;   vlmacc b[0]                             }
        {   add b, b, _32                           ;   vlsat r11[0]                            }
        {   add a, a, _32                           ;   vstr a[0]                               }
        {                                           ;   bt len, .L_loop_top                     }
.L_loop_bot:

    {   mkmsk tail, tail                        ;   bf tail, .L_finish                      }
I wonder why this is done. I see no such restriction in the XS3 architecture doc, and forcing the code at .L_loop_top onto a 4-byte aligned but 16-byte non-aligned address (like 0x80414) does not induce any obvious errors, exceptions or the like.

Should I keep these .align 16 statements when writing own assembly code? In the end, the "bu .L_loop_top" seems an unnecessary statement if 4-byte alignment is sufficient.

Thanks,
Ralf
User avatar
fabriceo
XCore Addict
Posts: 253
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hello
the goal here is to minimize the "fetch no-ops" as this add 1 thread cycle when the instruction pipeline is empty.

I ve used this also , successfully to some extend. the "bu loop" is used to jump at the 16 byte aligned location, this takes 1 instruction, but then the pipeline is systematically loaded with 16 byte x 8 bits = 128 which is the memory width (see ISA).
then the instructions as of loop_top can be executed without FNOPS

you can test this your self doing a function in pure assembly, using gettime instruction at the beginning and computing the difference at the end. With various alignment and using 1,2,3,4 successive instruction like LDW you will eventually capture FNOPS ! if you insert some non-memory instructions (using only registers) in between you will remove the FNOPS.

certainly some XMOS contributor will bring more clarity on this. hope this helps
fabriceo
errsuberlin
New User
Posts: 3
Joined: Mon Jun 30, 2014 10:22 pm

Post by errsuberlin »

Ah, cool, will test, thank you!