Fast Distributed Process Creation with the XMOS XS1 Arch.

Jamie · Post by **Jamie** » Tue May 24, 2011 9:07 am

For anyone interested: I've recently written up a load of the work I did last year based around the XMP-64 board in a paper which is being published in the proceedings of this year's Communicating Process Architectures conference (http://www.wotug.org/cpa2011/).

Here's the abstract:

The provision of mechanisms for processor allocation in current distributed parallel programming models is very limited. This makes difficult, or even prohibits, the expression of a large class of programs which require a run-time assessment of their required resources. This includes programs whose structure is irregular, composite or unbounded. Efficient allocation of processors requires a process creation mechanism able to initiate and terminate remote computations quickly. This paper presents the design, demonstration and analysis of an explicit mechanism to do this, implemented on the XMOS XS1 architecture, as a foundation for a more dynamic scheme. It shows that process creation can be made efficient so that it incurs only a fractional overhead of the total runtime and that it can be combined naturally with recursion to enable rapid distribution of computations over a system.

You can view the paper on the arXiv: http://arxiv.org/abs/1105.3843

Although I don't really mention it in the paper, all of the results are obtained from my 'sire' implementation which is an XCore project on Github (https://github.com/xcore/tool_sire). I've made a load of changes since then so it's now not in a particularly functional state, but you can check out the 'v0.1' tag to get a version which reflects the described functionality.

Comments, questions or criticisms welcome!

phalt · Post by **phalt** » Tue May 24, 2011 10:00 am

Just reading through it now, this is a brilliant paper!

Well done :)

Post by **Folknology** » Tue May 24, 2011 12:09 pm

Hey Jamie really like Sire and the paper

I have looked through quite a bit of your sire code over last few months when investigating C thread creation etc.. its a very interesting implementation and clearly a lot of good work has gone into it, thanks for sharing. I guess one of your next steps is to improve things like stack management and move from fixed size to something more dynamic, given the recent C threads/stackspace thread here on Xcore are you likely to implement Per Brinch Hansen method, as true determinism doesn't seem to be an obstacle for Sire?

regards
Al

Jamie · Post by **Jamie** » Tue May 24, 2011 9:01 pm

I guess one of your next steps is to improve things like stack management and move from fixed size to something more dynamic, given the recent C threads/stackspace thread here on Xcore are you likely to implement Per Brinch Hansen method, as true determinism doesn't seem to be an obstacle for Sire?

The way it works at the moment is each thread is allocated a fixed-size chunk of memory for its stack. This keeps it simple and deterministic although it is clearly not a particularly efficient use of memory; it should be sufficient though for the programs I will want to run to perform experiments.

Another way of dealing with the problem of determinism and unbounded stack usage is to exploit parallelism, permitting only one recursive thread per stack and distributing other recursive threads over different cores. Then, instead of threads competing for resources resulting in non-deterministic behaviour, each unbounded thread will coexist with threads whose stack usage is bounded and known at compile time so it occupies a fixed-sized area of memory. I'm not sure this is something I will peruse with my work but it's an interesting idea that makes you think that there could be merit in completely different parallel execution models.

Post by **Folknology** » Tue May 24, 2011 9:29 pm

Yeah I had some similar thoughts about passing out deterministic and non deterministic tasks in an intelligent matter, that would work nicely on many core systems.

However it gets tricky with single or small core designs, however in certain applications manual partitioning can be applied to a similar effect.

regards
Al

Jamie · Post by **Jamie** » Wed May 25, 2011 10:17 am

Folknology wrote:Yeah I had some similar thoughts about passing out deterministic and non deterministic tasks in an intelligent matter, that would work nicely on many core systems.

However it gets tricky with single or small core designs, however in certain applications manual partitioning can be applied to a similar effect.

You might find this paper called "The Role and Future of Occam" by Peter Welch interesting as it mentions some of this stuff: http://books.google.co.uk/books?hl=en&l ... &q&f=false

Post by **Folknology** » Wed May 25, 2011 1:19 pm

HeHe

"This paper assumes no previous experience with occam, but C-programmers should sit down first"

LOL

P.S. I notice there are pages missing from that section..

regards
Al

Fast Distributed Process Creation with the XMOS XS1 Arch.

Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.

Re: Fast Distributed Process Creation with the XMOS XS1 Arch.