Several questions about XMOS

Heater · Post by **Heater** » Tue Apr 12, 2011 5:51 pm

Taking up Woody's suggestion of collecting timing results and then dumping them with printf I now have a version of the original jitter test exercised for 1 timed_thread and from 1 to 7 waste_threads.
The timed_thread collects results of 100 runs into an array and then prints them.
This has been run for ref clock settings of 100, 200 and 400 MHz.

Here are the results:

Code: Select all

Sys Clk = 400MHz Ref Clk = 100MHz
---------------------------------
Thread  Clocks
8       10
7       8/9
6       7/8
5       6/7
4       5
3       5
2       5
1       5

Sys Clk = 400MHz Ref Clk = 200MHz
---------------------------------
Thread  Clocks
8       20
7       17/18
6       15
5       12/13
4       10
3       10
2       10
1       10

Sys Clk = 400MHz Ref Clk = 400MHz
---------------------------------
Same as 100MHz Ref Clk!!!

And here is the code:

Code: Select all

#include <stdio.h>
#include <platform.h>

void waste_thread()
{
	while(1);
}

int results[100];
int x;

void timed_thread()
{
    long startTime, endTime;
    timer t;
    int i;
    int loops = 100;

    printf ("Determinism test:\n");
    while (1)
    {
    	for (i = 0; i < loops; i++)
    	{
    		// Start benchmark timer
    		t :> startTime;

    		x++;

    		// Stop benchmark timer
    		t :> endTime;

    		// Record reulting run time
    		results[i] = endTime - startTime;
    	}
    	for (i = 0; i < loops; i++)
    	{
    		printf ("Run time %d = %d timer ticks\n", i, results[i]);
    	}
    }
}

int main()
{
	par
	{
        on stdcore[0]: timed_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        on stdcore[0]: waste_thread();
        //on stdcore[0]: waste_thread();
        //on stdcore[0]: waste_thread();
	}
 	return 0;
}

Conclusions?

1) If you want execution determinism you need to have 1, 2, 3, 4 or 8 threads.
2) If you want execution determinism in a thread it should not use div or mod (not sure about mul) or it should be the only thread that uses div/mod.
3) Tweaking with the clocks does not buy much and probably causes some confusion and inconvenience.
4) Baring complications with div/mod we are looking at 10ns jitter here which should be OK in many cases.
5) XMOS' claims of 100% execution determinism are a trifle optimistic.
8) For real determinism on the real world I/O you need to use port timers and such. The subject of my next experiments....

Did I get this right so far?

segher · Post by **segher** » Tue Apr 12, 2011 7:37 pm

The timing always is fully deterministic. But it is much easier to predict the timing
if you do not need to consider other threads; a big factor in that is not having more
than four threads running at once.

If you need fast I/O, it gets a lot more complicated, since you need to consider
the "outside world". Usually, your best choice is to design things in such a way
that exact timing of most things _does not matter_. This is true on all platforms
of course, not just on XS1.

The nice thing of the "jitter" you see is that it is <10ns (the clock period of the
slower of the two clocks in your crossed clock domains), not >10us as you see
on most other platforms!

For what it's worth, if you want to see the exact same number of reference clock
cycles on every iteration of your loop, you can do that for any number of threads:
just make sure every iteration takes an exact multiple of the reference clock
period, i.e. a multiple of 4 core cycles in the standard 400MHz/100MHz setup.

Heater · Post by **Heater** » Tue Apr 12, 2011 8:23 pm

segher,

The timing always is fully deterministic.

Yes and no.

Yes, I would expect that every time I run my program, barring interaction with the outside world, it will do exactly the same thing. In that way deterministic as all digital computers are.

No, if it takes me forever to analyse my program to figure out what it will do, or if it's excessivley hard to write my program to do what I want at the time I want it, or if I have to run the thing on a simulator or real chip to find out what it does, then that is not very deterministic. That is "I can't determine what will happen".

As you say, sticking to four threads and avoiding divides helps a great deal with my determining how it will behave. This is an issue that would not exist if we just had a core per thread but there we are.

As you say when interaction with the outside world is introduced things get harder. If I really need those hand full of instructions (and no more) to handle an event and produce a result then I really need determinism. I need to stick to less than 5 threads and avoid divides. This is probably just a symptom of working on the edge of the devices capabilities though.

The nice thing of the "jitter" you see is that it is <10ns

Oh yes.

For what it's worth, if you want to see the exact same number of reference clock
cycles on every iteration of your loop, you can do that for any number of threads:
just make sure every iteration takes an exact multiple of the reference clock
period, i.e. a multiple of 4 core cycles in the standard 400MHz/100MHz setup.

Yep, I have achieved that on occasion now.

segher · Post by **segher** » Wed Apr 13, 2011 8:45 am

If I really need those hand full of instructions (and no more) to handle an event and produce a result then I really need determinism. I need to stick to less than 5 threads

As I've said a hundred times now (and I will no more): no, it is still fully deterministic with more
than four threads. It just doesn't time exactly the same as with four or fewer threads (well, duh!)

It is *very* easy to analyse timing in such cases, compared to "usual" CPUs/MCU/s/whatever
you want to call it.

For most cases, buffered and timed ports make it easy to get the exact I/O timing you want; you
just need to make your code fast enough, and worst case time is easy to analyse, and very
very close to best case time, too! This holds whether you use one thread, or four, or seven,
or eight.

davelacey · Post by **davelacey** » Wed Apr 13, 2011 10:12 am

Hi,
I'll try and avoid the word deterministic here since it can cause confusion.
The xcore provides two really important properties:

1. Fast event signalling based on particular hardware events (e.g the tick of a clock be it the reference clock or an externally sourced clock).

2. Predictable (or deterministic, if you like) thread execution which makes it possible to determine worst case execution time of a thread independent of the other threads.

Between the two of these you can produce code that inputs or outputs data with strict timing requirements. With respect to this:

* There is nothing special about using 5 or fewer threads.
* There is the caveat that we are working on the assumption that the divide unit is not used.

What 2) above doesn't mean is that you can use instruction timing to time your interaction with the outside world. This was never the intention of the design.

Let's take an example. Here is some code that uses timers to output a clock in software. In between outputs to the port it call the do_some_stuff function:

Code: Select all

timer tmr;
int t;
int x = 0;
tmr :> t;
t += PERIOD;
while (1) {
   tmr when timerafter(t) :> void;
   p <: x;
   do_some_stuff();
   x = ~x;
   t += PERIOD;
}

There are two things that matter about this code. Firstly, there will be some variable amount of time between the timer event happening and the port output happening. This will give you jitter on the output and this jitter will depend on the number of threads runnning in a bit of an unpredictable but bounded way (I think it is certainly safe to say the cycle-to-cycle jitter will be < 100ns and probably a lot better than that).

Secondly, it is important that the worst case time between the timer inputs is less than PERIOD clock ticks. If nothing in do_some_stuff uses divide this is easy to determine statically. This is exactly what the XTA tool does.

If the jitter caused by the delay between the timer input and the port output is not good enough, as I mentioned before, the hardware provides a tighter mechanism for port output events:

Code: Select all

int t;
int x = 0;
p <: 0 @ t;
t += PERIOD;
while (1) {
   p @ t <: x;
   do_some_stuff();
   x = ~x;
   t += PERIOD;
}

In this case the port output will happen based directly on its clock. In this case there will still be some jitter on the output due to the physical design of the chip but it will be much smaller (< 2ns and probably a lot better than that). The other concern is the same: that the time between port outputs takes no more than PERIOD clock ticks.

I hope this helps understand the "determinism" of the architecture better and how it should be used. All I've said here is independent of the number of threads running.

Dave

Heater · Post by **Heater** » Wed Apr 13, 2011 10:34 am

segher,

Strangely enough I agree with you "in most cases" and that we should stop flogging a dead horse with this debate. We just have a slightly different perspective on this determinism issue.

I have to attempt one last lash because in the limit determinism vaporizes, imagine:

1) One of my threads is required suck bits in or blow bits out as fast as a single core will go. This might end up as a software timed thing as setting up timers and such is extra instructions. As a
trivial case imagine implementing an an 8 bit comparator, two times 8 bits in, one bit out, working on the performance limit of a core.

2) All is well until I decide I need to make use of another thread for some other task, perhaps adopting some nice open source object whose internals I know nothing about.

Boom, if that extra functionality requires I now use more than 4 threads my original task starts to fail.

Of course the reverse is true, I might want to adopt some existing code but it won't fly because of what my application is doing.

The point is that given an "object A" and an "object B" that work perfectly well by themselves I can't be sure that they will work together without analysing the combination. The determinism of one
is intimately linked to the other. I cannot look at my nice comparator code in isolation and say that it will work in all applications all the time.

Now, as you say, in most cases this should not be an issue. When working well within performance limits and using timers to straighten things out.

The xcores do indeed do a far better job at this than a traditional MCU with interrupts etc. However its not quite the determinism you would expect from having 8 separate cores instead of 8 threads
in one core.

It is only this little quibble that causes me to question when it is claimed that the xcores have 100% execution determinism.

Woody · Post by **Woody** » Wed Apr 13, 2011 2:00 pm

I guess the real issue here is why you need determinism.

You need determinism for 2 reasons to my mind:
1. to ensure that your I/O occurs at the correct time. This should usually be achieved using port times rather than relying on the time that code takes to execute being the same irrespective of the number of active threads.

2. to ensure that your code executes fast enough that it can keep up with the data rate. For this you need to know the worst case timing, you're not that worried how long it actually takes so long as it is quick enough. It is straightforward to compute this on an XCore using XTA, and this can often be a major headache for real time applications.

For reference, changes in code execution time come down to the following:
* threads becoming active/pausing
* interactions with other threads using the same synchroniser, link, lock and divide resources
* interactions between thread code and the reference clock
* external events

I'm bound to have forgotten a scenario above, but I can't think of it at the moment.

segher · Post by **segher** » Wed Apr 13, 2011 3:35 pm

* being interrupted by a real interrupt (not an event)
* debug mode fires

Yeah, too obvious, I know :-)

Woody · Post by **Woody** » Wed Apr 13, 2011 3:40 pm

Eek: 'real interrupts', what are they?!

Post by **lilltroll** » Wed Apr 13, 2011 9:23 pm

]May I recommend the wave- viewer
[attachment=0]Waves.png[/attachment

You can easily see waiting threads, but as well the thread ID and what instruction that is issued.

In this example thread 4 is running in the beginning but after an in with an emty buffer is falls asleep resulting in 012301230123

It might be easier to follow things over a larger time-frame this way.

Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS

Re: Several questions about XMOS