Which Program Flow?

rp181 · Post by **rp181** » Sat Mar 26, 2011 11:34 pm

This is all on a 500 MIPS L1 Device.

I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?

*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.

Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once

Two questions:
1) Which would perform better, if at all different?
2) Is it possible to terminate threads, and then restart? I don't want the sampling thread to be slowed down while the other threads don't have anything to do.

EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?

segher · Post by **segher** » Sun Mar 27, 2011 1:06 am

rp181 wrote:This is all on a 500 MIPS L1 Device.

I have 32 pairs of pixels, which I sample via an ADC. I need to perform some operations on each pair as fast as possible. Which would be better?

*Common: 1 Thread to sample all of the data. This thread has a channel for every other thread for data comms.

Method 1: Launch 4 threads that deal with 16 pixels each, so that each thread has full speed.
Method 2: Launch 32 threads all at once

An L1 device has one core, which has only eight threads.

2) Is it possible to terminate threads, and then restart?

You cannot easily kill threads. You normally let threads kill themselves, instead.

I don't want the sampling thread to be slowed down while the other threads don't have anything to do.

When threads are waiting for something, they aren't scheduled (unless they are in fast mode).

EDIT: Whats it called when you have set points in each thread that are blocking, and only "release", or allow the code to continue, until all of the blocks are met?

Synchronisation.

Post by **Folknology** » Sun Mar 27, 2011 12:16 pm

You will need to use either a G4, 2*L2 or 4*L1 for that number of threads.

What is the sample rate, and what needs doing to the sample values, can 1 thread deal with more pairs to reduce threads?

regards
Al

rp181 · Post by **rp181** » Sun Mar 27, 2011 3:17 pm

Yes, i decided on doing 16 pairs a thread. The sampling rate of the ADC is probably going to be 40 million samples a second, multiplexed to the 32 pairs of sensors (64 photodiodes).

I need to go through each pair, sample 3 to 5 times to normalize the data, and do ((a-b)/(a+b)), as fast as possible. This information will be used to drive actuators. The target is 100 kHz, but faster is better.

As for the operation: ((a-b)/(a+b)). This is a ratio, so it is a float. Right now, I have it pass the information to a C++ file, which can handle floats, and then i multiply by 10000000 and pass it back. Eventually, the XC file will have to use a channel to give the information to another thread. This is my current solution:

Is this a good solution? The 4 lines means a channel to each of the 4 threads in the middle column. Blocks are synchronization points.

segher · Post by **segher** » Sun Mar 27, 2011 6:38 pm

You shouldn't use floating point. It is way too slow, and depending on what exactly you
are trying to do, it might not even be exact enough.

Instead, you can do one of various kinds of scaled integer, which if you're smart about
it you can do without any or with only few divide instructions.

What are the range and accuracy of you ADC data? How many bits?

rp181 · Post by **rp181** » Sun Mar 27, 2011 6:57 pm

I havn't nailed down the specific ADC, but it will be 8 bit, probably 0-1.5v

Post by **Folknology** » Sun Mar 27, 2011 8:25 pm

As segher suggests use integer rather than floating point math to keep it deterministic and fast enough for your application, you could probably do it with a single L2.

regards
Al

rp181 · Post by **rp181** » Sun Mar 27, 2011 8:27 pm

I actually did move up to a QFN L2 device, but 1 core was going to be devoted for USB (the output is through USB, the XMOS needs to be the host). Do you not think it is possible with 1 core?

Post by **Folknology** » Sun Mar 27, 2011 9:01 pm

For your averaging perhaps choose 4 samples as this is a nice easy binary division -> right shift by 2 bits. Given that you are only needing 8 bit samples you could use a 10 bit accumulation and then shift possibly to get the averaging nice and fast, I am sure Segher could give you the ASM for that bit easy enough ;-)

The ((a-b)/(a+b)) will be more tricky and my integer DSP kung fu is weak, anyone else down here got an idea to streamline/shortcut that operation?

regards
Al

rp181 · Post by **rp181** » Sun Mar 27, 2011 9:09 pm

Wait, how exactly does bit shifting average it? I have never heard of these shortcuts...

You say bitshift right 2, bitshift the 8 or 10 bit value? And where do the other 3 numbers come in? Care to provide a simple example? :D

EDIT: Ok, I think i get it. So if i have 255,240,214, and 235, I add all of them to get 944. The binary of this is 1110110000. 1110110000 >> 2 is 11101100, or 236. I must say, thats really cool!

Which Program Flow?

Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?

Re: Which Program Flow?