Which Program Flow?

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Which Program Flow?

Post by rp181 »

This is all on a 500 MIPS L1 Device.

I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?

*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.

Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once

Two questions:
1) Which would perform better, if at all different?
2) Is it possible to terminate threads, and then restart? I don't want the sampling thread to be slowed down while the other threads don't have anything to do.

EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?


User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

rp181 wrote:This is all on a 500 MIPS L1 Device.

I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?

*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.

Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once
An L1 device has one core, which has only eight threads.
2) Is it possible to terminate threads, and then restart?
You cannot easily kill threads. You normally let threads kill themselves, instead.
I don't want the sampling thread to be slowed down while the other threads don't have anything to do.
When threads are waiting for something, they aren't scheduled (unless they are in fast mode).
EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?
Synchronisation.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

You will need to use either a G4, 2*L2 or 4*L1 for that number of threads.

What is the sample rate, and what needs doing to the sample values, can 1 thread deal with more pairs to reduce threads?

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Yes, i decided on doing 16 pairs a thread. The sampling rate of the ADC is probably going to be 40 million samples a second, multiplexed to the 32 pairs of sensors (64 photodiodes).

I need to go through each pair, sample 3 to 5 times to normalize the data, and do ((a-b)/(a+b)), as fast as possible. This information will be used to drive actuators. The target is 100 kHz, but faster is better.

As for the operation: ((a-b)/(a+b)). This is a ratio, so it is a float. Right now, I have it pass the information to a C++ file, which can handle floats, and then i multiply by 10000000 and pass it back. Eventually, the XC file will have to use a channel to give the information to another thread. This is my current solution:



Is this a good solution? The 4 lines means a channel to each of the 4 threads in the middle column. Blocks are synchronization points.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

You shouldn't use floating point. It is way too slow, and depending on what exactly you
are trying to do, it might not even be exact enough.

Instead, you can do one of various kinds of scaled integer, which if you're smart about
it you can do without any or with only few divide instructions.

What are the range and accuracy of you ADC data? How many bits?
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

I havn't nailed down the specific ADC, but it will be 8 bit, probably 0-1.5v
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

As segher suggests use integer rather than floating point math to keep it deterministic and fast enough for your application, you could probably do it with a single L2.

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

I actually did move up to a QFN L2 device, but 1 core was going to be devoted for USB (the output is through USB, the XMOS needs to be the host). Do you not think it is possible with 1 core?
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

For your averaging perhaps choose 4 samples as this is a nice easy binary division -> right shift by 2 bits. Given that you are only needing 8 bit samples you could use a 10 bit accumulation and then shift possibly to get the averaging nice and fast, I am sure Segher could give you the ASM for that bit easy enough ;-)

The ((a-b)/(a+b)) will be more tricky and my integer DSP kung fu is weak, anyone else down here got an idea to streamline/shortcut that operation?


regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Wait, how exactly does bit shifting average it? I have never heard of these shortcuts...

You say bitshift right 2, bitshift the 8 or 10 bit value? And where do the other 3 numbers come in? Care to provide a simple example? :D

EDIT: Ok, I think i get it. So if i have 255,240,214, and 235, I add all of them to get 944. The binary of this is 1110110000. 1110110000 >> 2 is 11101100, or 236. I must say, thats really cool!