I'd like to see it used as a crypto tool, generating key sets or even cracking (educationally of course) codes I can imagine it would handle WEP/WPA cracking in a heartbeat if coded right :D
An implementation of Folding@Home or Seti@Home might be fun too...
What would you do with an XDK or XMP-64?
-
- Member++
- Posts: 21
- Joined: Fri Dec 11, 2009 3:42 pm
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
Regarding the memory/bandwidth issue
Maybe the design could be improved by having more modular hyper cubes interleaved with sram or so such between certain geometric sections. Thats is some of the nodes within the hypercube are specialised towards storage (may dual ported in nature). Actually the board should contain a mixture of perhaps 3 specialised nodes:
1) Interface nodes - connecting outside the hypercube.
2) Standard processing/transforming nodes - inside the hypercube
3) Storage cubes - inside the hypercube
One might even have sub specialisations of the interfaces into external I/O and external memory
Thoughts?
regards
Al
Maybe the design could be improved by having more modular hyper cubes interleaved with sram or so such between certain geometric sections. Thats is some of the nodes within the hypercube are specialised towards storage (may dual ported in nature). Actually the board should contain a mixture of perhaps 3 specialised nodes:
1) Interface nodes - connecting outside the hypercube.
2) Standard processing/transforming nodes - inside the hypercube
3) Storage cubes - inside the hypercube
One might even have sub specialisations of the interfaces into external I/O and external memory
Thoughts?
regards
Al
-
- Respected Member
- Posts: 377
- Joined: Thu Dec 10, 2009 6:07 pm
Al - this is an interesting suggestion.
As long as the different nodes use a homogeneous instruction set architecture and computational model I would be in favour of something constructed like this.
Heterogeneous system architectures on a large scale become unmanageable for the compiler and therefore pass the complexity onto the programmer. It is unlikely that as a programmer you want to have to worry about this level of complexity.
In my opinion (and paraphrasing people like Leslie Valiant and Per Brinch Hansen, amongst others) the only scalable and simple way to program large concurrent systems is to use a variety of concurrent design patterns or by exploiting very simple (concurrent) algorithms on a homogeneous underlying architecture.
The more complex and specialized you make parts of the board the more you lose the computational model of "many threads with their own memory and channels", and therefore the harder it becomes to program the system to do anything sensible.
Now, this said, you could already envisage a board with some L1s near the edges and G4s at the centre. The L1s would be used for the I/O and the G4s as the "brains". You wouldn't necessarily need to bring out the G4 I/O, just expose enough internally on the board to give very high-speed internal board communication.
With memory then the question becomes how you get data into the board from the outside, and where you store it during computation. These are two separate problems.
Dealing with high data rates externally may mean finding a way to break down the data in such a way that it can be received and buffered to internal memory. The hardest problem for concurrent programming systems is handling a single and insanely fast serial stream of data - this represents a huge bottleneck. Using the degree of parallelism available to you involves distributing data rapidly, which makes it far simpler to handle multiple lower-bandwidth streams than a single aggregated one.
During computation is a different matter. Assuming you have been able to distribute the input stream sensibly and rapidly, on the current revision of the XMP-64 you already have 4MB of internal RAM, distributed between the cores. Now the question is - how much do you need to keep in local memory at any one time, how dependent are you on external memory latency, how dependent are you on external memory bandwidth? Various algorithms, such as raytracing (mentioned earlier), have extremely favourable locality properties (if you arrange the scene appropriately in memory) meaning even maintaining a relatively slow external memory interface using software caching ought to provide good (good enough?) performance.
There is no doubt that for some applications having available a chip or storage node with substantially more memory and a hardware memory controller would be very useful. I'd be interested to see a list of these: both hobbyist and commercial.
With this in mind, I think it would be equally interesting to see what you definitely couldn't do with an XMP-64.
As long as the different nodes use a homogeneous instruction set architecture and computational model I would be in favour of something constructed like this.
Heterogeneous system architectures on a large scale become unmanageable for the compiler and therefore pass the complexity onto the programmer. It is unlikely that as a programmer you want to have to worry about this level of complexity.
In my opinion (and paraphrasing people like Leslie Valiant and Per Brinch Hansen, amongst others) the only scalable and simple way to program large concurrent systems is to use a variety of concurrent design patterns or by exploiting very simple (concurrent) algorithms on a homogeneous underlying architecture.
The more complex and specialized you make parts of the board the more you lose the computational model of "many threads with their own memory and channels", and therefore the harder it becomes to program the system to do anything sensible.
Now, this said, you could already envisage a board with some L1s near the edges and G4s at the centre. The L1s would be used for the I/O and the G4s as the "brains". You wouldn't necessarily need to bring out the G4 I/O, just expose enough internally on the board to give very high-speed internal board communication.
With memory then the question becomes how you get data into the board from the outside, and where you store it during computation. These are two separate problems.
Dealing with high data rates externally may mean finding a way to break down the data in such a way that it can be received and buffered to internal memory. The hardest problem for concurrent programming systems is handling a single and insanely fast serial stream of data - this represents a huge bottleneck. Using the degree of parallelism available to you involves distributing data rapidly, which makes it far simpler to handle multiple lower-bandwidth streams than a single aggregated one.
During computation is a different matter. Assuming you have been able to distribute the input stream sensibly and rapidly, on the current revision of the XMP-64 you already have 4MB of internal RAM, distributed between the cores. Now the question is - how much do you need to keep in local memory at any one time, how dependent are you on external memory latency, how dependent are you on external memory bandwidth? Various algorithms, such as raytracing (mentioned earlier), have extremely favourable locality properties (if you arrange the scene appropriately in memory) meaning even maintaining a relatively slow external memory interface using software caching ought to provide good (good enough?) performance.
There is no doubt that for some applications having available a chip or storage node with substantially more memory and a hardware memory controller would be very useful. I'd be interested to see a list of these: both hobbyist and commercial.
With this in mind, I think it would be equally interesting to see what you definitely couldn't do with an XMP-64.
-
- Respected Member
- Posts: 296
- Joined: Thu Dec 10, 2009 10:33 pm
jonathan: Re having G's in the middle and L's around the edge. As far as I can tell the external links are not compatible between L's and G's. One has a HELLO token the other does not.
Do you know if I'm right in saying you really cannot connect an L link to a G link?
Do you know if I'm right in saying you really cannot connect an L link to a G link?
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
Once again Heater incompatibility between Ls and `G's is something I was unaware of, thanks for the heads up!
Jonathan I need to think more about the storage issue, your points are very much valid and good counter arguments. There is perhaps another solution:
How about XMOS building something like a 'G3' which uses less cores (3,2) and adds much more onboard memory, these then become the storage specialised nodes. Remember in code one can already target specific cores so targeting special G3 nodes to perform the more memory intensive functions is feasible. To make it work one would have to devise common idioms (with xc supporting code/libraries) that specialise in common concurrent realtime storage design patterns. I can not yet see what this would look like in an event driven paradigm, but I am sure something like this could work. Maybe it is just a capacity problem where by the nearer you get to the centre of the hypercube the greater the storage requirement becomes. In reality the benefit might come by placing several of the G3's a critical intervals throughout the hypercube.
PP.S. I am way out of my league here having not actually used the XMP-64 or similar, I'm just going on what I have read. It is also entirely possible that the storage is just a problem with the programmer ;-)
*Update - you kind of alluded to this in your second to last paragraph - sorry I missed that Doh!
Jonathan I need to think more about the storage issue, your points are very much valid and good counter arguments. There is perhaps another solution:
How about XMOS building something like a 'G3' which uses less cores (3,2) and adds much more onboard memory, these then become the storage specialised nodes. Remember in code one can already target specific cores so targeting special G3 nodes to perform the more memory intensive functions is feasible. To make it work one would have to devise common idioms (with xc supporting code/libraries) that specialise in common concurrent realtime storage design patterns. I can not yet see what this would look like in an event driven paradigm, but I am sure something like this could work. Maybe it is just a capacity problem where by the nearer you get to the centre of the hypercube the greater the storage requirement becomes. In reality the benefit might come by placing several of the G3's a critical intervals throughout the hypercube.
PP.S. I am way out of my league here having not actually used the XMP-64 or similar, I'm just going on what I have read. It is also entirely possible that the storage is just a problem with the programmer ;-)
*Update - you kind of alluded to this in your second to last paragraph - sorry I missed that Doh!
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
I was trying to recall some of my thinking regarding concurrency and memory issues from other things I have read. One of the interesting things II looked at a few years ago was Sun's Fortress language, which actually looked to me like a language to be DSLs (I actually wrote a piece about it for some analysts). Although Fortress is designed for very different architectures to XMOS I remember reading about how it to had issues with processes/threads and storage access. Although I cannot find the original documents I read I did find this link which may be worth a quick glance :http://projectfortress.sun.com/Projects ... ss.1.0.pdf (PDF).
In Fortress one can specify and use special memory/thread combinations and idioms called regions, allowing design patterns to be chosen to help optimise concurrent code sections or tasks etc..
Just thought it might be worth mentioning..
In Fortress one can specify and use special memory/thread combinations and idioms called regions, allowing design patterns to be chosen to help optimise concurrent code sections or tasks etc..
Just thought it might be worth mentioning..
-
- Respected Member
- Posts: 363
- Joined: Thu Dec 10, 2009 10:17 pm
Well making these "storage nodes" just have a RAM chip on them might be good enough. Since the links between them already are a bit slower so there is time to do read/write to a external chip. This also means you can have a huge amount of memory. Using a 32 bit port you can address up to 4GB of memory. Not that you would use that much but a single 64MB ram chip on those storage nodes would be very useful. The amount of ram used in PCs has driven prices of fast RAM chips really down.
In the ray tracing app you could have those 4 storage nodes keep the whole 3D scene in memory and stream it out to its group of processing nodes. Once all the nodes are done they can all pass there little square piece in to one of the storage nodes where it gets assembled in to the whole picture. Then one of the nodes that handles ethernet asks the storage node for the image and displays it on the website. It would be really cool is if it can be made to render at 10fps or even more so the image would constantly update on the website and create a video. Or you can also have it render a really complex scene with lots of reflexion and everything in a short time(say 10 seconds)I just have a fealing that this massive parallel processing can do the difficult job of raytracing very fast as you can calculate over 200 pixels simultaneously. Since thats what a good demo program for the XMP-64 should be about using the huge number of parallel threads to do a job very quickly, that would have taken a normal computer a long time.
Xmos should try experimenting with some extra memory in there XMP-64.
In the ray tracing app you could have those 4 storage nodes keep the whole 3D scene in memory and stream it out to its group of processing nodes. Once all the nodes are done they can all pass there little square piece in to one of the storage nodes where it gets assembled in to the whole picture. Then one of the nodes that handles ethernet asks the storage node for the image and displays it on the website. It would be really cool is if it can be made to render at 10fps or even more so the image would constantly update on the website and create a video. Or you can also have it render a really complex scene with lots of reflexion and everything in a short time(say 10 seconds)I just have a fealing that this massive parallel processing can do the difficult job of raytracing very fast as you can calculate over 200 pixels simultaneously. Since thats what a good demo program for the XMP-64 should be about using the huge number of parallel threads to do a job very quickly, that would have taken a normal computer a long time.
Xmos should try experimenting with some extra memory in there XMP-64.
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
Hmm There is another problem with actual placement of the external memories on the pcb. it could pose an issue unless they perhaps went underneath the G's? That could scupper the designed interconnect convenience, anyone familiar with the XMP-64 layout or laying their own multichip board care to comment?
Last edited by Folknology on Wed Dec 16, 2009 3:52 pm, edited 1 time in total.
-
- Member++
- Posts: 26
- Joined: Sat Dec 12, 2009 6:45 am
I'm planning to buy an XMP-64 and see if it can be used (almost) as a co-processor to process data coming from our new particle tracker:
Particle detectors can be regarded as large area camera's making snapshots of the collisions of high energy particles. The way that the particles get distributed in your detectors provide you with signatures of interesting events. An XMP-64 would be able to take a fast decision on whether an event is interesting enough to warrant further processing/storage or that it can be thrown away.
The above picture shows a rough layout of how 4 4-core processors would work together. I got so far as to successfully implement this scheme on 1 XC-1. I have 4 XC-1's but it is difficult to get 2 of them to talk to each other over their XLinks. Linking 4 XC-1's together is virtually impossible and I decided to stop wasting my time trying to figure out a way of doing this. The XMP-64 was supposed to come out in December and I was a bit disappointed that I couldn't play with it under the Christmas tree. Hopefully in January.
Gerrit
Particle detectors can be regarded as large area camera's making snapshots of the collisions of high energy particles. The way that the particles get distributed in your detectors provide you with signatures of interesting events. An XMP-64 would be able to take a fast decision on whether an event is interesting enough to warrant further processing/storage or that it can be thrown away.
The above picture shows a rough layout of how 4 4-core processors would work together. I got so far as to successfully implement this scheme on 1 XC-1. I have 4 XC-1's but it is difficult to get 2 of them to talk to each other over their XLinks. Linking 4 XC-1's together is virtually impossible and I decided to stop wasting my time trying to figure out a way of doing this. The XMP-64 was supposed to come out in December and I was a bit disappointed that I couldn't play with it under the Christmas tree. Hopefully in January.
Gerrit
You do not have the required permissions to view the files attached to this post.
-
- Member++
- Posts: 21
- Joined: Fri Dec 11, 2009 3:42 pm
Wow now thats a cool project idea! I was watching a blurb about Brookhaven on History or Nat Geo last night, I've always been intrigued by particle accelerators! I think this would be a great project and could help XMOS in the Education/Research fields... Please keep us up to date on this, even though its way abouve my head I'd love to follow progress on this.
nieuwhzn wrote:I'm planning to buy an XMP-64 and see if it can be used (almost) as a co-processor to process data coming from our new particle tracker:
Particle detectors can be regarded as large area camera's making snapshots of the collisions of high energy particles. The way that the particles get distributed in your detectors provide you with signatures of interesting events. An XMP-64 would be able to take a fast decision on whether an event is interesting enough to warrant further processing/storage or that it can be thrown away.
The above picture shows a rough layout of how 4 4-core processors would work together. I got so far as to successfully implement this scheme on 1 XC-1. I have 4 XC-1's but it is difficult to get 2 of them to talk to each other over their XLinks. Linking 4 XC-1's together is virtually impossible and I decided to stop wasting my time trying to figure out a way of doing this. The XMP-64 was supposed to come out in December and I was a bit disappointed that I couldn't play with it under the Christmas tree. Hopefully in January.
Gerrit