Fault-Tolerant Programming

jonathan · Post by **jonathan** » Fri Dec 11, 2009 8:19 am

Fault-tolerant parallel programming (and tools) using arrays of XMOS cores/chips seems an obvious project. It will probably involve some inventiveness in designing automated systems for fault-tolerant programming. For example, you can replicate code onto several different threads, different cores, different chips in some automated manner and collect/collate results from executed code. Wide range of options for the techniques used: self-checking cores, chips etc. What techniques you use probably depend on the application and reliability criteria.

Nice thing if you implement the techniques in software is that for correctness-critical tasks, you can turn on fault-tolerance and then for latency or bandwidth-sensitive tasks with a margin for error you can simply dynamically adapt the code and turn off the fault-tolerance on the fly. This would enable combinations of sensing and control equipment to be used within a single system, with a degree of tolerance for sensing input errors but retaining absolutely mission-critical fault-tolerant control.

Post by **Folknology** » Fri Dec 11, 2009 10:36 am

This thread is really interesting, particularly given my recent experience using Erlang/OTP.

Erlang/OTP has some great built in features to help fault tolerance.

For example processes are designed to be able to fail and can be automatically restarted by the virtual machine in which they run.

If a process in some way depends upon another then it can also keep an eye on the health of that second process. Erlang/OTP has a built in function called link which enables this processes to monitor each other. If a linked process dies the other process is notified. As such exit and error handler can be written to deal with these situations. There are different standard idioms for dealing with the error types and error trapping signals. It is common in Erlang/OTP to have supervisory hierarchies of process to help manage monitoring and fault tolerance. As a minimum to make a fault tolerant system you require two systems, this can be done as 2 threads inside a core but is better with multiple systems depending on possible causes of failure and amount of redundancy, Erlang/OTP does not differentiate between the two and uses the same model across processes as threads and process on other local or remote systems.

As I am still coming up to speed on XC I am not sure how much of this can be translated to XCores but I would image the use of channels to be critical to implementing a similar fault-tolerant linking system along with standard code templates and idioms. Either way I am definitely interested in such a project.

regards
Al

leon_heller · Post by **leon_heller** » Fri Dec 11, 2009 1:33 pm

Plessey, Roke Manor, bought one of my 16-module transputer systems 24 years ago, and designed a fault-tolerant system for it. They'd show it to a prospective customer and ask him (it probably was a him in those days) to pull out two or three modules at random. It kept running, of course. My modules were mounted vertically, unlike the later Inmos ones, which was ideal for that type of system. They weren't designed for hot-swapping, but it worked. Something like that would be quite easy with the XMOS chips, from a mechanical point of view.

That system of mine cost about £13,000, by the way!

Leon

Fault-Tolerant Programming

Fault-Tolerant Programming

Re: Fault-Tolerant Programming

Re: Fault-Tolerant Programming