Several questions about XMOS

Non-technical related questions should go here.
User avatar
infiniteimprobability
XCore Legend
Posts: 1126
Joined: Thu May 27, 2010 10:08 am

Post by infiniteimprobability »

Here are a couple of pictures which hopefully explain it. Basically each active thread gets one clock before the next thread is scheduled, which is true down to a minimum of 4 threads running. Than means you are never more than n clocks (where n = number of threads which is >=4 ) away from being executed.

The practical upshot of that is that you have a guaranteed worst case MIPS in each thread - that along with predictable instruction timing (and no interrupts needed due to the events) means that you can build software that is 100% deterministic - deterministic enough to deliver hardware interfaces in software.
You do not have the required permissions to view the files attached to this post.


Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

infiniteimprobability,

That is a very nice explanation and set of diagrams.

Problem is this statement:
that along with predictable instruction timing (and no interrupts needed due to the events) means that you can build software that is 100% deterministic
is NOT exactly true:)

Recently I have been running some timing tests like so:
1) A blob of code to be timed is wrapped in a timing loop that calculates how many timer ticks it takes and prints the result. This loop is run as a thread.

2) Another thread function is defined that is basically a "do nothing" loop.

3) The timing of the first thread is made when running from 2 to 7 instances of the "do nothing" loop.

Results:

a) When running a total thread count from 1 to 4 the timing loop repeatedly reports the same time for the blobs execution.

b) When running 4 of the "do nothing" threads (5 threads total) the blobs execution time increases by 25%, as expected.

c) When running 5 of the "do nothing" threads (6 thread in total) the blobs execution time increases by the same amount.

d) And so on up to 8 threads total.

HOWEVER, somewhere around 6 or 7 total threads (I forget exactly) the blobs execution time alternates between two values that differ by 1.
That is to say there is a 10ns jitter in it's execution time.

Not much you might say. True but it's not 100% deterministic.

I was about to post a question about this observation when I have boiled my test down to a few lines of code to post here.

Next issue is divide and modulus. I understood from comments made by David May a while back that the execution of these instructions could cause other threads to jitter. I have yet to demonstrate that to myself experimentally.
omega7
Active Member
Posts: 32
Joined: Thu Jun 03, 2010 12:16 pm

Post by omega7 »

I presume that the four different colours represent the four pipeline stages?

But, I cannot match this info with the info in "Programming XC on XMOS Devices, chapter thread performance, page 37". It says :

"Because individual threads may be delayed on I/O, their unused processor cycles can be taken by other threads. Thus, for more than four threads, the performance of each thread is often higher than the minimum shown above."

When XMOS is deterministic, (thread) performance should not be dependent on delaying I/O? Or am I missing something? I expect that I confuse two things here......

Martin
User avatar
jonathan
Respected Member
Posts: 377
Joined: Thu Dec 10, 2009 6:07 pm

Post by jonathan »

OK, so basically...

When a thread is "waiting" for an event to happen, it goes to sleep. Instructions (by default) are not issued from this thread, and therefore an instruction scheduling slot (of which there are between 4 and 8) is not allocated to it.

What this means is that if you have 5 threads, but one of them is "waiting" for an event to occur (such as a specific input on a specific port), you will in fact have the four threads that are not "waiting" - ie threads that are "ready" - actually running consecutively, whilst the fifth thread is descheduled.

This means that you can have more than 4 threads running simultaneously but will still get 4-threaded performance as long as no more than 4 threads are ever "ready".

It should be stated that this is the default behaviour - and as in a related thread - "fast mode" can be used which I believe guarantees a thread scheduling slot even if a thread is "waiting" - this is effectively polling behaviour that (probably) lowers the overall instruction throughput of the program in return for saving on average a few cycles event response time (because you do not have to reschedule the thread into an available slot once the event arrives).

Hope this helps.
Image
User avatar
jonathan
Respected Member
Posts: 377
Joined: Thu Dec 10, 2009 6:07 pm

Post by jonathan »

I think the wording on slide 17 above is misleading and wrong.

1. "Each thread executes a minimum every 4 clock ticks - f/4 MHz"
2. "Each thread executes a minimum every 8 clock ticks - f/8 MHz"

The first should really say: "Each ready thread can execute at most once every four clock ticks, maximum of f/4 MHz".

The second should really say: "Each ready thread executes at least every eight clock ticks, minimum f/8 MHz and maximum f/4 MHz".

Even those statements don't quite express the event-driven nature of the XMOS scheduling, though. However, I do think they should be corrected, as they are clearly confusing. The former implies threads can run faster than f/4 MHz (which they can't).
Image
User avatar
jonathan
Respected Member
Posts: 377
Joined: Thu Dec 10, 2009 6:07 pm

Post by jonathan »

Heater, would love to see your code that exhibits this "non-deterministic" behaviour.
Image
User avatar
jonathan
Respected Member
Posts: 377
Joined: Thu Dec 10, 2009 6:07 pm

Post by jonathan »

jonathan wrote:Heater, would love to see your code that exhibits this "non-deterministic" behaviour.
At present I can think of one explanation only... and it would occur only at exactly 7 threads.
Image
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

jonathan,

OK here is the minimal program I came up with that demonstrates a 1 clock jitter in the execution of a timing loop with various threads running. As shown 1 thread is running the timer loop and 4 are just idle looping. The results are like so:

Code: Select all

Determinism test:
Run time = 1 timer ticks
Run time = 1 timer ticks
Run time = 2 timer ticks
Run time = 1 timer ticks
Run time = 1 timer ticks
Results for various numbers of threads are like so:

Code: Select all

Threads    Clocks
8          2 
7          2/1
6          1
5          2/1
4          1
3          1
2          1
1          1
As you see a total of 5 or 7 threads results in a one clock jitter.

Here is the program:

Code: Select all

#include <stdio.h>
#include <platform.h>

void waste_thread()
{
	while(1)
    {
    }
}

void timed_thread()
{
    long startTime, endTime;
    timer t;

    printf ("Determinism test:\n");
    while (1)
    {
    	// Start benchmark timer
    	t :> startTime;

    	// Stop benchmark timer
    	t :> endTime;

    	printf ("Run time = %d timer ticks\n", endTime - startTime);
    }
}

int main()
{
	par
	{
        on stdcore[0]: timed_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        //on stdcore[0]: waste_thread();
        //on stdcore[0]: waste_thread();
        //on stdcore[0]: waste_thread();
	}
 	return 0;
}
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

Strangely enough this jitter problem does not show up when building that determinism test in debug mode. The execution times are just much longer.
User avatar
davidnorman
Junior Member
Posts: 6
Joined: Fri Mar 18, 2011 10:43 am

Post by davidnorman »

I have seen this effect too. It took a while to track down what was happening. At first we thought it was clock retiming across the L2 cores, then we thought it was unknown delays in the channels, but in the end it turned out to be thread scheduling.