Threads, Stackspace and recursion

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Threads, Stackspace and recursion

Post by Folknology »

With the current toolchain how is multithread stackspace calculated or arranged to allow for recursion using XC/C/C++ or even just C/C++ with ASM threads.

I am guessing that the tools can calculate thread stack requirement from the code except where recursion is present is this the case? are there limits to recursion support given multiple stacks with threads or is some other method employed like some sort of fancy unified stack to solve the issue?

regards
Al


Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

Looks like you need:

#pragma stackfunction n

and/or

#pragma stackcalls n

See the XMOS Compiler Collection document for details.

If you look in the assembler output of the compilers you will see that for each function the stack usage is specified in assembler directives (or whatever they are called) so that he linker/mapper can figure out the required stack size for each thread.

Of course that all breaks down for recursive call graphs at which point one needs the above pragmas to force the issue. Then you are on your own regarding how big the stack needs to be I guess.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Right so with an XC project (which could include C/C++) the compiler calculates function stack requirements, it also annotates then with stackwords so that Xmapper can correctly predict stacksize. If however it detects that you are using multiple threads and it see's recursion that it cannot optimise out it then warns or produces an error?

At which point you then add the "#pragma stackfunction n" above the offending function where n is the estimated function requirement worse case based on expected recursion.

I also presume that in the case of a C/C++ only project (no XC) that uses ASM to create async or sync threads, the compiler wouldn't issue a warning error because it won't detect the threads?

Obviously in the latter case one has to allocate stacks directly, can one get access to the stack estimates already added by the compiler?
Or does anyone have any good strategies for this kind of usage other than just allocating large fixed stack blocks?

regards
Al
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

Only way I know is to compile your code to ASM and then look at those stackwords anotations.
Then figure out what you maximum recursion depth is going to be and the add the appropriate paragma to the source.
User avatar
Jamie
Experienced Member
Posts: 99
Joined: Mon Dec 14, 2009 1:01 pm

Post by Jamie »

I also presume that in the case of a C/C++ only project (no XC) that uses ASM to create async or sync threads, the compiler wouldn't issue a warning error because it won't detect the threads?
The stack usage is calculated in the same way for single threaded programs as input to the resource checks.
Obviously in the latter case one has to allocate stacks directly, can one get access to the stack estimates already added by the compiler?
Or does anyone have any good strategies for this kind of usage other than just allocating large fixed stack blocks?
As far as I'm aware there's no good answer to this question and there are a number of different approaches. For example, Google Go gets around the problem by allocating initial space on the stack and then when this is exhausted, it allocates more dynamically in fixed sized chunks on the heap, maintaining relations as a linked list structure. Alternatively, the Cilk language uses a 'cactus stack' which dynamically allocates and deallocates frames in the stack area in a similar way. Both these methods incur extra overheads in function calls though and as they employ dynamic allocation, break scheduling invariance.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

As far as I'm aware there's no good answer to this question and there are a number of different approaches. For example, Google Go gets around the problem by allocating initial space on the stack and then when this is exhausted, it allocates more dynamically in fixed sized chunks on the heap, maintaining relations as a linked list structure. Alternatively, the Cilk language uses a 'cactus stack' which dynamically allocates and deallocates frames in the stack area in a similar way. Both these methods incur extra overheads in function calls though and as they employ dynamic allocation, break scheduling invariance.
I was thinking about this earlier and figured if I was ever to design a multithreaded CPU I might redesign the stack section to operate in a different manner using partial dynamic lookup table addressing that could allocate frames/chunks of thread stack memory from an overall common stack space. It would appear (to the thread) through its dynamic lookup table address mapping, as a continual sequence of stack memory, despite being made up of interleaved chunks of common stack space. But back here in the real world we have XS1 so I will have to do something in the meantime, maybe just blocks for now ;-)
The stack usage is calculated in the same way for single threaded programs as input to the resource checks.
Could you expand on this a little please to help me understand it a little better?

regards
Al
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Let me also ask more specific questions regarding that stackspace calculation/allocation in the C/C++ and ASM example:

Assume I have a bunch of functions named A,B,C and D. and my thread creation is something like :

Code: Select all

void main()
{
..
A();
B();
..
create_async_thread(*c, stackoffset)
..
}

void C()
{
..
D();
..
}
create_async_thread allocates the thread resource, sets the thread's cp, dp, sp, lr and pc = &C , then starts the thread.
Assume that funcs A,B are called in the initial main thread that creates the first async thread but D is only called from C and C is only processed by the async thread. Thus would the optimised toolset only calc the stack size to include effects of A and B but not C and D which I would have to include in stackoffset?

regards
Al
User avatar
jonathan
Respected Member
Posts: 377
Joined: Thu Dec 10, 2009 6:07 pm

Post by jonathan »

Been a while since I looked at this but Per Brinch Hansen had a good scheme to permit parallel recursion, not sure if anyone's mentioned it above.

http://dl.dropbox.com/u/7049108/brinchhansen.pdf
Image
User avatar
Jamie
Experienced Member
Posts: 99
Joined: Mon Dec 14, 2009 1:01 pm

Post by Jamie »

create_async_thread allocates the thread resource, sets the thread's cp, dp, sp, lr and pc = &C , then starts the thread.
Assume that funcs A,B are called in the initial main thread that creates the first async thread but D is only called from C and C is only processed by the async thread. Thus would the optimised toolset only calc the stack size to include effects of A and B but not C and D which I would have to include in stackoffset?
Not sure what you mean by 'optimimsed toolset' but the compiler would have no knowledge of the new thread and hence its stack requirements. This must be accounted for by the programmer in the stack offset.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

The Per Brinch Hansen is interesting and would be great for building virtual threads, however in this case using C for example it would require its own thread for allocation/deallocation. It would also add latency to blocks/functions and generally slow things down.
I think it would need an internal hardware implementation around the stack/scheduler to work effectively without hitting performance. Definitely worth looking at from a CPU design perspective though, implementing hardware support for this sort of scheme on a threaded processor would be a great improvement.

But for now it seems I'm back with fixed stack sizes/estimations for XS1.

regards
Al