Topology and failure

Post by **Folknology** » Mon Jul 15, 2013 12:47 pm

I am currently working on a new project that potentially involves modular Xmos boards which manage communication among other things. By design the system needs to be modular and can be scaled from just a few boards to tens of boards. One of the key reasons for putting forward a modular Xmos solution is the benefits of seamless communication and CSP model offered for agile development, other candidates would require intermediate communications protocols, overheads and complex tool setups. However the modular system needs to be robust and I would also like to be able to add redundancy into the design, thus I have some questions around selecting suitable Xmos link topologies (beyond my basic current experiences).

Each module would contain a single L1, there could be anything from two to tens of these modules, and additional LX modules on the network (topology) that act as routers/supervisors/converters...

Initially I assumed a simple master slave chain :

Code: Select all

M-->S0-->S1-  ->SX

But then I figured naturally that this could be unreliable as a single board failure can break the network (communications), isolating other Slaves further on the chain. Thus I figured maybe a star topology may be more reliable:

Code: Select all

   Sax<-  -Sa1<--Sa0 <--\   / --> Sb0-->Sb1-  ->Sbx
                          M 
   Scx<-  -Sc1<--Sc0 <--/   \--> Sd0-->Sd1-  ->Sdx

Obviously this could also be expanded to multiple masters for redundancy. However I have only seen double chains and not quad chains before, are quad chains supported out of the box on Xmos toolchain or would I be entering unchartered territory having to roll my own so to speak?

The other question I then fall back to is what happens when any of the modules in the above topologies fail, does it mean that the network itself still fails even for the quad more star like topology?

Another possibility I have considered is a ring:

Code: Select all

->M-->S0-->S1-  ->SX-\
\___________________/

This has the advantage that a single break could still allow communication to occur, is this sort of topology supported?

Just really wanted to get some feedback from the Xcore community and Xmos around this area, please if you have any experience or just ideas around Xmos topologies for these (or other applications) I would be very interested.

Thoughts?

regards
Al

Post by **lilltroll** » Mon Jul 15, 2013 7:51 pm

Do you know when the star topology will be supported by the toolchain ?
12.x do not support stars for L1.

edit:
Ohh, found this topic today:
http://www.xcore.com/forum/viewtopic.php?p=14275#p14275

richard · Post by **richard** » Tue Jul 16, 2013 12:03 pm

I've posted a bit more information about what will be supported in the 13 tools on the following thread.

http://www.xcore.com/forum/viewtopic.php?f=27&t=2245

Folknology wrote:Another possibility I have considered is a ring:
Code: Select all
->M-->S0-->S1-  ->SX-\
\___________________/
This has the advantage that a single break could still allow communication to occur, is this sort of topology supported?

A ring isn't necessarily any more fault tolerant than a line. The routing tables determine the route that a packet takes on network. The routing tables are normally setup once, before the application starts running. If the routing tables specify that a packet from node A to node B goes via route R and there is a problem with route R then that packet will be lost. It doesn't matter if there is a another route that the packet could have taken since only the routing tables are used to decide which direction to send to packet (i.e. the hardware won't automatically route around the problem by dynamically changing the routing tables).

It may be possible to detect the link error state in your application and reconfigure the routing tables dynamically based on this, but this is something you would need to hand roll and I suspect it would be lots of work.

What error cases are you trying to guard against? What is the desired behaviour when one of those error cases occours?

Post by **Folknology** » Tue Jul 16, 2013 12:26 pm

richard wrote: What error cases are you trying to guard against? What is the desired behaviour when one of those error cases occours?

Well its damage limitation on one hand in that if a node goes down it has minimum effect on any other running nodes (or minimising such effects). The more complex solution being considered is in having redundancy in the topology (running spare nodes already mapped in) such that a failing node could be ignored and externally patched into the spares.

Topology and failure

Topology and failure

Re: Topology and failure

Re: Topology and failure

Re: Topology and failure