This is all on a 500 MIPS L1 Device.
I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?
*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.
Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once
Two questions:
1) Which would perform better, if at all different?
2) Is it possible to terminate threads, and then restart? I don't want the sampling thread to be slowed down while the other threads don't have anything to do.
EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?
Which Program Flow?
-
- Respected Member
- Posts: 395
- Joined: Tue May 18, 2010 12:25 am
-
- XCore Expert
- Posts: 844
- Joined: Sun Jul 11, 2010 1:31 am
An L1 device has one core, which has only eight threads.rp181 wrote:This is all on a 500 MIPS L1 Device.
I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?
*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.
Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once
You cannot easily kill threads. You normally let threads kill themselves, instead.2) Is it possible to terminate threads, and then restart?
When threads are waiting for something, they aren't scheduled (unless they are in fast mode).I don't want the sampling thread to be slowed down while the other threads don't have anything to do.
Synchronisation.EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
You will need to use either a G4, 2*L2 or 4*L1 for that number of threads.
What is the sample rate, and what needs doing to the sample values, can 1 thread deal with more pairs to reduce threads?
regards
Al
What is the sample rate, and what needs doing to the sample values, can 1 thread deal with more pairs to reduce threads?
regards
Al
-
- Respected Member
- Posts: 395
- Joined: Tue May 18, 2010 12:25 am
Yes, i decided on doing 16 pairs a thread. The sampling rate of the ADC is probably going to be 40 million samples a second, multiplexed to the 32 pairs of sensors (64 photodiodes).
I need to go through each pair, sample 3 to 5 times to normalize the data, and do ((a-b)/(a+b)), as fast as possible. This information will be used to drive actuators. The target is 100 kHz, but faster is better.
As for the operation: ((a-b)/(a+b)). This is a ratio, so it is a float. Right now, I have it pass the information to a C++ file, which can handle floats, and then i multiply by 10000000 and pass it back. Eventually, the XC file will have to use a channel to give the information to another thread. This is my current solution:
Is this a good solution? The 4 lines means a channel to each of the 4 threads in the middle column. Blocks are synchronization points.
I need to go through each pair, sample 3 to 5 times to normalize the data, and do ((a-b)/(a+b)), as fast as possible. This information will be used to drive actuators. The target is 100 kHz, but faster is better.
As for the operation: ((a-b)/(a+b)). This is a ratio, so it is a float. Right now, I have it pass the information to a C++ file, which can handle floats, and then i multiply by 10000000 and pass it back. Eventually, the XC file will have to use a channel to give the information to another thread. This is my current solution:
Is this a good solution? The 4 lines means a channel to each of the 4 threads in the middle column. Blocks are synchronization points.
-
- XCore Expert
- Posts: 844
- Joined: Sun Jul 11, 2010 1:31 am
You shouldn't use floating point. It is way too slow, and depending on what exactly you
are trying to do, it might not even be exact enough.
Instead, you can do one of various kinds of scaled integer, which if you're smart about
it you can do without any or with only few divide instructions.
What are the range and accuracy of you ADC data? How many bits?
are trying to do, it might not even be exact enough.
Instead, you can do one of various kinds of scaled integer, which if you're smart about
it you can do without any or with only few divide instructions.
What are the range and accuracy of you ADC data? How many bits?
-
- Respected Member
- Posts: 395
- Joined: Tue May 18, 2010 12:25 am
I havn't nailed down the specific ADC, but it will be 8 bit, probably 0-1.5v
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
As segher suggests use integer rather than floating point math to keep it deterministic and fast enough for your application, you could probably do it with a single L2.
regards
Al
regards
Al
-
- Respected Member
- Posts: 395
- Joined: Tue May 18, 2010 12:25 am
I actually did move up to a QFN L2 device, but 1 core was going to be devoted for USB (the output is through USB, the XMOS needs to be the host). Do you not think it is possible with 1 core?
-
- XCore Legend
- Posts: 1274
- Joined: Thu Dec 10, 2009 10:20 pm
For your averaging perhaps choose 4 samples as this is a nice easy binary division -> right shift by 2 bits. Given that you are only needing 8 bit samples you could use a 10 bit accumulation and then shift possibly to get the averaging nice and fast, I am sure Segher could give you the ASM for that bit easy enough ;-)
The ((a-b)/(a+b)) will be more tricky and my integer DSP kung fu is weak, anyone else down here got an idea to streamline/shortcut that operation?
regards
Al
The ((a-b)/(a+b)) will be more tricky and my integer DSP kung fu is weak, anyone else down here got an idea to streamline/shortcut that operation?
regards
Al
-
- Respected Member
- Posts: 395
- Joined: Tue May 18, 2010 12:25 am
Wait, how exactly does bit shifting average it? I have never heard of these shortcuts...
You say bitshift right 2, bitshift the 8 or 10 bit value? And where do the other 3 numbers come in? Care to provide a simple example? :D
EDIT: Ok, I think i get it. So if i have 255,240,214, and 235, I add all of them to get 944. The binary of this is 1110110000. 1110110000 >> 2 is 11101100, or 236. I must say, thats really cool!
You say bitshift right 2, bitshift the 8 or 10 bit value? And where do the other 3 numbers come in? Care to provide a simple example? :D
EDIT: Ok, I think i get it. So if i have 255,240,214, and 235, I add all of them to get 944. The binary of this is 1110110000. 1110110000 >> 2 is 11101100, or 236. I must say, thats really cool!