Is the XMOS instruction set liable to change?

Technical questions regarding the XTC tools and programming with XMOS.
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Is the XMOS instruction set liable to change?

Post by Heater »

Firstly I must say I'm surprised there has been no threads started here by now. Are there really no ASM, or prospective ASM, coders out there?

Back to the question. One very good reason for avoiding ASM is the way in which processors and their instructions sets have a habit of disappearing, going obsolete or evolving in incompatible ways. Is the plan for the XMOS instruction set to be around as it is for a long time?
Or is it considered to be malleable in future iterations of devices?

Yes I know C and XC can compile to efficient and compact code and paper over any architectural changes but sometimes you just have to ring things out a bit:)
Last edited by Heater on Fri Mar 25, 2011 12:08 pm, edited 1 time in total.


User avatar
Berni
Respected Member
Posts: 363
Joined: Thu Dec 10, 2009 10:17 pm

Post by Berni »

Well i was planing on writing some SPI code in ASM to make it go as fast as possible.

And im guessing they are gonna change the assembler a bit.One of the biggest problems is the 64KB ram, a lot of people wanted more. There is a slight problem with more ram as the addressing is 16bit in assembler so that means they already reached the maximum allocatable ram.They might up the address length or use pageing.
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

Oh no! It would be a shame to have to be dealing with memory pages/segments again in a 32 bit CPU.

I haven't looked at the assembler language or the instruction layouts vary hard yet but it looks like memory space could be stretched quite a long way without modifying the instructions.

As far as I can see most regular data access is via the stack pointer, the data pointer or the constant pointer using an offset of up to 16 bits. Relative branches are limited to 16 bits, if I understand correctly. So presumably each of the 8 threads on a core could live in their own 64K space before we run into trouble. Not bad for starters.

I find it hard to believe the instruction set has been designed without future RAM expansion, either internal to the chip or external, in mind.
User avatar
dave
Member++
Posts: 31
Joined: Thu Dec 10, 2009 10:11 pm

Post by dave »

There are no plans to change the XMOS instruction set. It is likely that some extra instructions will be added in future products and provision has been made for this especially in the 32-bit format. In fact very few of the available codes in the 32-bit format are used at present. These instructions can have up to six register operands (as in the case of the long multiply instruction - LMUL).

The address range of the processor is 2**32 bytes. Although the instructions are mainly 16-bit, this does not affect the address range, which is determined by the size of the registers. All addressing instructions (that is load, store and load-address instructions) use one register as a (32-bit) base address and supply an offset either from the instruction or from another register.

The stack and data regions are accessed using instructions which have a 6-bit offset in the 16-bit format and a 16-bit offset in the 32-bit format. The offsets are scaled before they are used so in fact 2**16 words can be accessed relative to the SP or DP. Access to data structures can use any operand register as a (32-bit) base address - this is combined with a scaled (32-bit) offset which can come from another operand register.

The branch instructions that use offsets from the PC have 6-bit offsets in the 16-bit format and 16-bit offsets in the 32-bit format. The offsets are scaled before they are used so these instructions reach up to 2**16 (16-bit) instructions backwards or forwards. The call (branch and link) instructions have 10-bit offsets in the 16-bit format and 20-bit offsets in the 32-bit format so these instructions reach up to 2**20 (16-bit) instructions backwards or forwards. There are similar instructions which load an address relative to the PC allowing access to data regions within the program (or loading of program entry-points).

The CP is mainly intended for constant data and for holding function entry addresses. For example, calls can be made via the first 1024 words relative to CP using a 16-bit call instruction.

The use of both a 16-bit and a 32-bit format for these instructions makes things easy for compiling and programming whilst keeping programs compact - in practice the 16-bit format is large enough most of the time. The use of relative addressing means that that position independent code is efficient and easy to produce.

It is also worth pointing out that the instruction set could be used without modification for processors with a different wordlength - such as 64 bits. This is primarily because it separates address arithmetic and normal arithmetic.
nisma
Active Member
Posts: 53
Joined: Sun Dec 13, 2009 5:39 pm

Post by nisma »

What is the timing for the asm instructions ?
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

Thank you for the comprehensive reply David.

It's good to hear that any effort put into assembler coding will have a long shelf life. Not that I'm proposing a mass return to assembler level but for some tasks it's a natural. For example when wanting to emulate some other instruction set.

Time for some serious reading of the architecture manual...
User avatar
dave
Member++
Posts: 31
Joined: Thu Dec 10, 2009 10:11 pm

Post by dave »

As far as instruction timing goes - all instructions except the divides take four cycles to pass through the execution pipeline. Of course, some instructions (such as input-output instructions) may cause the thread to pause - with the result that they will be re-executed later.

With four or fewer threads active, the instructions in a single thread will execute in four cycles each - 10ns/instruction or 100 MIPS/thread at 400MHZ. With five threads active they will execute in five cycles each - 80 MIPS/thread at 400MHz; with eight threads active they will execute in eight cycles each - 50 MIPS/thread at 400MHz.

There are situations in which a thread may be unable to issue an instruction because its instruction buffer is empty - this will slightly reduce performance. This can be minimised with a little care in the ordering/aligning of instructions - sometime soon the assembler will be able to do this automatically. In any case, the performance is completely deterministic and predictable.

There is an explanation of instruction issue in the architecture manual - section 5 starting on page 8.
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

David, could you elaborate on your statement "In any case, the performance is completely deterministic and predictable" ?

From the Instruction Set Architecture document I find "multiple threads may share the same division unit." which implies that two threads performing divisions will influence each others execution timing in rather unpredictable ways.

Compounded by "The division may take up to bpw thread-cycles" which implies an execution time dependency on the value of the operands. Again unpredictable.

It looks to me as if the execution speed of a simple loop such as that shown below cannot be determined through simple instruction cycle counting.

Code: Select all

loop: in        d, r
         divu    d, x, y
         out     r,d
         brbu   loop
Presumably determinism is to be achieved through the use of timers and port timers. Given that this loop does not know how many threads it is running alongside it's execution speed is variable in any case and the use of timers is essential.
nisma
Active Member
Posts: 53
Joined: Sun Dec 13, 2009 5:39 pm

Post by nisma »

dave wrote:As far as instruction timing goes - all instructions except the divides take four cycles to pass through the execution pipeline.
Branch and specially optional branch instruction usually must clear the pipeline if there don´t exist a dedicated (look-ahead) branch cache or similar.
Does the branche and specially the follow instruction run at 100Mhz speed, presuming 4 threads,
or on a branche, the command prefetch buffer is invalidatet, code is loaded from ram at
tbd speed and after that 4 clock cycles later that 32bit buffer could be evaluated, resulting
in a exectution time far away from the 4 clock cycles instruction.
Further it´s seems that there is one core running at 400mhz doing max 8thread switching in HW
trought it´s instruction pipeline. it´s possible to exchange one sleeping thread using some sort of
register/privileged instruction at runtime (asm)? What is the time needed to load new opcode into the opcode register, if the instruction are not aligned, or it´s missing (branch).
How about ram, from docs, it seems that the cpu is internally clocked at 400mhz. Ram access
is shared between opcode fetch (four instances) and the prefetch pipeline or alu depending on
the implementation. Are there looks, that in case of concurrent usage, one thread/operation must be waiting extending the real execution time of the command?
User avatar
dave
Member++
Posts: 31
Joined: Thu Dec 10, 2009 10:11 pm

Post by dave »

As far as the Divide (and Remainder) instructions are concerned, I should have given a more extensive explanation.

The threads share the unit that executes these instructions; this unit runs at the core clock rate (400MHz). It takes bpw (32) core clock cycles to execute one of these instructions. This does not depend on the operands.

So if N threads are executing, it may take N*bpw core cycles to complete a divide instruction - a thread may have to wait for N-1 other threads to complete their divide instructions. And N core cycles is a thread cycle - so N*bpw core cycles is bpw thread cycles.

So - there is potential interference between divide instructions in different threads. But any one is guaranteed to complete in at most bpw thread cycles.

The principle used in the XMOS processors is that timed operations should be done using timers and timed ports; a thread simply has to have completed its processing before each of its timing deadlines.

Of course, it's possible to avoid use of division in real-time performance critical threads ...