Dual issue mode / 64 bit memory alignment

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
dsteinwe
XCore Addict
Posts: 144
Joined: Wed Jun 29, 2016 8:59 am

Dual issue mode / 64 bit memory alignment

Post by dsteinwe »

The dual issue mode is very detailed documented in "xCORE-200: The XMOS XS2 Architecture" (Chapter '5.2 Single and Dual Issue', p. 11):
An XS2 has two lanes: the memory lane can execute all memory instructions,
branches, and basic arithmetic, and the resource lane can execute all resource
instructions and basic arithmetic. Each thread can chose to execute in dual issue
mode, in which case the processor will execute two 16-bit instructions or a single
32-bit instruction in a single thread cycle. In dual issue mode, all instructions
must be aligned: 32-bit instructions must be 32-bit aligned and pairs of 16-bit
instructions must be aligned on a 32-bit boundary. The program counter is
always aligned two a 32-bit boundary and points to an issue slot rather than to
an individual instruction. [...]

Where two instructions are executed simultaneously, any destination operands
should be disjoint. If they are not disjoint, an exception will be raised.

When the resource lane stalls a thread, the other lane will be stalled also. This is
normally not observable, except when an interrupt or an exception is raised. On
an interrupt or exception, no registers will be overwritten, and the PC will point to
the instruction to be reexecuted.

If an instruction in one of the two lanes causes an exception, then this exception is
reported. If the other lane is executing an instruction then this second instruction
is aborted. If the instructions in both lanes cause an exception, then only one
exception is reported, and both instructions are aborted, but any memory store
which is in progress will complete. On an exception, the saved PC value is set to
the instruction that caused the exception.

[...]
In short: In dual issue mode, the processor can execute two instructions (real) concurrently, potentially doubling your performance.

I think, the details above are very important when you are writing assembler code. When you are writing xc or c code, the compiler cares about these details for you. You can switch between these modes for a function with [[dual_issue]] and [[single_issue]]. Even enabling optimization forces the compiler to use dual issue mode.

On p. 286 to 289 are the instructions listed, that can be called on the corresponding lane. The instructions i.e. for loading data, process 8, 16 or 32 bit values. The XS2 is a 32 bit MCU; so far, it is not surprising that maximum loadable value is 32 bit.

My questions:
============
1) There are some threads here on the forum claiming that for dual issue mode, the memory of arrays and structs must be aligned to 64 bit otherwise some unpredictable things happen. Is this really true? I have not found this restriction anywhere in the official XMOS documents, yet. And I think it would also contradict the XS2 architecture.
2) If it is true, is there a short code example, that shows, that unpredictable things happen on ignoring the 64 bit alignment?
3) And why must it be 64 bit and not 32 bit?
4) Why do you need alignment at all?
5) Are there other data types that have to be aligned?
6) What "unpredictable things" can happen in detail?


User avatar
dsteinwe
XCore Addict
Posts: 144
Joined: Wed Jun 29, 2016 8:59 am

Post by dsteinwe »

On p. 289, I have found some 64 bit load and store instructions. These instructions are 32 bit width. For loading the instruction starts with "ldd". For storing the instruction starts with "std". That means, you can also read and write 64 bit at once. That could be an evidence for the 64 bit alignment.
User avatar
fabriceo
XCore Addict
Posts: 186
Joined: Mon Jan 08, 2018 4:14 pm

Post by fabriceo »

Hi, I only use 64bit alignment for data that I want to load/store with ldd/std instruction. Also this helps the compiler to use these instructions when playing with long long variable. I don't see other reasons.

Dual issue is very convenient and efficient when you write some program in assembler by yourself. this is a nice exercise to optimize the code.
The compiler is not good at generating optimized code for dual lane.
For the sw usb audio , Ross is using the compiler flag :
-Os -mno-dual-issue
which brings good optimization without generating dual issue code.

what I find very good is the inlining capability. the compiler can really optimize usage of registers.
for a big say 1000 line .c file, I now prefer to break it down into several .h files with static inline function.
Also sometime it is better to do inline asm instructions than writing a routine in assembler. this way the registers allocation is optimized.
User avatar
CousinItt
Respected Member
Posts: 365
Joined: Wed May 31, 2017 6:55 pm

Post by CousinItt »

1) There are some threads here on the forum claiming that for dual issue mode, the memory of arrays and structs must be aligned to 64 bit otherwise some unpredictable things happen. Is this really true? I have not found this restriction anywhere in the official XMOS documents, yet. And I think it would also contradict the XS2 architecture.
Yes, it's really true. It's explained in section 7.3 of xCORE-200: The XMOS XS2 Architecture:
Pairs of words can be accessed in a single instruction. This requires the address to be aligned on a two-word boundary; it must be a multiple of Bpw x 2.
2) If it is true, is there a short code example, that shows, that unpredictable things happen on ignoring the 64 bit alignment?
How about this?

Code: Select all

#include <stdio.h>
#include <stdint.h>

uint64_t mydata[2] = {0xA1A1A1A1B2B2B2B2, 0xC3C3C3C3D4D4D4D4};

void test(const uint32_t * pdata)
{
    uint32_t x0;
    uint32_t x1;

    asm("ldd %0,%1,%2[0]":"=r"(x1),"=r"(x0):"r"(pdata));
    printf("%x, %x\n", x0, x1);
}

int main(void)
{
   uint32_t * ptr = (uint32_t *) mydata;

   test(ptr);
   ptr++;
   test(ptr);

   return 0;
}
In my build the simulator prints:

Code: Select all

b2b2b2b2, a1a1a1a1
Unhandled exception: LOAD_STORE, data: 0x00041c2c
3) And why must it be 64 bit and not 32 bit?
Because the architecture guide says so.
4) Why do you need alignment at all?
The core is making a 64-bit wide access to memory in one operation. Presumably it is less costly (in hardware and time) to require access to be aligned rather than to have access to two sequential 64-bit words if the data is not aligned (also see question 3!).
5) Are there other data types that have to be aligned?
As far as I know it only relates to the double word load and store instructions: LDDSP, STDSP, LDDI, STDI, LDD, STD.
6) What "unpredictable things" can happen in detail?
You will get an ET_LOAD_STORE exception, but it can happen apparently randomly, depending on the data alignment for a particular build. So some code might work or not work, for example, if you change an unrelated software module or a compiler/linker setting.
User avatar
dsteinwe
XCore Addict
Posts: 144
Joined: Wed Jun 29, 2016 8:59 am

Post by dsteinwe »

I thank you all for the interesting answers. Till now, I am not familiar with programming assembler, but I can read some simple assembler code.

I have played a little bit with the code example. Following also is valid:

Code: Select all

uint32_t mydata[4] = {0xA1A1A1A1, 0xB2B2B2B2, 0xC3C3C3C3, 0xD4D4D4D4};
The second thing, I have learned is that the address of the pointer must be dividable by 8 (when using ldd). That means, writing "ptr+=2;" instead of "ptr++;" fixes the exception. Now, I have written:

Code: Select all

uint32_t mydata[3] = {0xA1A1A1A1, 0xB2B2B2B2, 0xC3C3C3C3};
No exception happens. The memory after the array will be read, which will lead to unpredictable results. No exception occurs. This becomes even more dramatic when data is written after the array. Then, no one can predict the behaviour of the program anymore.

CousinItt wrote:
3) And why must it be 64 bit and not 32 bit?
Because the architecture guide says so.
4) Why do you need alignment at all?
The core is making a 64-bit wide access to memory in one operation. Presumably it is less costly (in hardware and time) to require access to be aligned rather than to have access to two sequential 64-bit words if the data is not aligned (also see question 3!).
5) Are there other data types that have to be aligned?
As far as I know it only relates to the double word load and store instructions: LDDSP, STDSP, LDDI, STDI, LDD, STD.
fabriceo wrote:
I only use 64bit alignment for data that I want to load/store with ldd/std instruction. Also this helps the compiler to use these instructions when playing with long long variable. I don't see other reasons.
I agree with you, CousinItt and fabriceo. So far I have read the "xCORE-200: The XMOS XS2 Architecture" document there is no general rule for applying always 64 bit alignment. It is only required for some instructions like LDDSP, STDSP, LDDI, STDI, LDD, STD, etc. that process double words. This also means that the 64 bit alignment is not related to the issue mode, but to the instructions.

If you write your own assembler code as fabriceo does and proposes you are responsible to care about the alignment yourself. Also you have to be familiar with programming assembler.

If you write c or xc code, the compiler is responsible for the alignment. I assume, that the compile does handle the alignment correctly for you. If not, please leave a comment and an example.

The statement
The compiler is not good at generating optimized code for dual lane.
For the sw usb audio , Ross is using the compiler flag :
-Os -mno-dual-issue
which brings good optimization without generating dual issue code.
has surprised me. It prompted me to compile and test one of my own runtime critical libraries with these settings. It works as well as with "-O3 -g". Amazing! Then I have taken a look to module lib_xud at version 2.2.4:

Code: Select all

VERSION = 2.2.4

MODULE_XCC_FLAGS = $(XCC_FLAGS) \
                   -O3 \
                   -fasm-linenum \
                   -fcomment-asm \
                   -DXUD_FULL_PIDTABLE=1 \
                   -g

XCC_FLAGS_XUD_IoLoop.S = $(MODULE_XCC_FLAGS) -fschedule

XCC_FLAGS_endpoint0.xc = $(MODULE_XCC_FLAGS) -Os
XCC_FLAGS_dfu.xc = $(MODULE_XCC_FLAGS) -Os
XCC_FLAGS_dfu_flash.xc = $(MODULE_XCC_FLAGS) -Os

XCC_FLAGS_XUD_Client.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_Main.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_PhyResetUser.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_Support.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_IOLoopCall.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_Signalling.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue -Wno-return-type
XCC_FLAGS_XUD_TestMode.xc = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_SetCrcTableAddr.c = $(MODULE_XCC_FLAGS) -mno-dual-issue
XCC_FLAGS_XUD_User.c = $(MODULE_XCC_FLAGS) -mno-dual-issue
Ok, so setting the optimal compiler optimization configuration is more challenging than expected. I guess, the developers of lib_xud has measured for each unit the time to tune the compiler options.

I have one more question: The cited text from "xCORE-200: The XMOS XS2 Architecture" uses the the term "lane". Is the lane essentially the same as a pipeline? If not, what is a lane?