Which Program Flow?

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Yes you normally accumulate the samples in a register then shift for max perf.

Regarding the L2 usb usage question, you could use the usb core for some other operations which may give you more room for manoeuvre see the following from the XUD library docs:
Due to I/O requirements the library requires a guaranteed MIPS rate to ensure cor-
rect operation. This means that thread count restrictions are in place that depend
on the speed of the device. The USB thread must run at at least 80 MIPS, and the
threads that communicate with the USB thread must have a guaranteed 80 MIPS.
This means that for an XS1 running at 400MHz there should be no more than five
threads executing at any one time that USB is being used. For a 500MHz device no
more than 6 threads shall execute at any one time.
regards
Al


User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Do i use a logical shift or arithmetic shift (>> or >>>)?
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

I would use unsigned with logical shifts (SHR), unless you have a reason for using signed.

Your next problem might be not having enough registers!

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

What do you mean not enough registers? All of this low level stuff is new to me.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Well normally you would accumulate and shift directly in registers without memory access, but given the number of accumulators you need per thread that isn't directly possible. so basically you have to compromise on the ideal somewhat, either by multiplexing the pixels in blocks or by using memory rather than registers for accumulators.

To be honest ASM is my forte, segher is the man for that, or one of the Xmos folk perhaps. You could try putting it together in XC and running it through the timing analysers to see how fast it can be optimised and perhaps go from there, moving to ASM as you need to.

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Is register access a ASM only thing? I don't see why so many registers would be needed. As it stands now, it would only be averaging 1 set of 4 points at a time, at the same time of collection. Processing is what occurs in parallel, and as far as I can see, it is simply doing (a-b)/(a+b) on a bunch of data. These threads never see the un-normalized data.

The code is slowly coming out :)
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Sorry I am not privy to your exact applications operation and can't therefore understand the exact sampling requirements, but let me take a general example to explain my point (which may or may not be relevant):

Say I had 16 pixel values to sample and average, I would normally instruct the multiplexer to give them to me in order starting at pixel 0 then 1,2,3.. until F. This inputting of data would then cycle, to average these I would need to keep 16 accumulators to do the averaging. This would also give me even sampling across the array of pixels.

But maybe (in you app) you are expecting to sample each pixel 4 times before moving to the next pixel, thus the data you receive would be {0.0,0.1,0.2,0.3},{1.0,1.1,1.2,1.3}...{F.0,F.1,F.2,F.3}. Then you only need a single accumulator because you are averaging each pixel before moving onto the next. This gives a somewhat more lumpy sample across the pixel array because you are no longer averaging across the entire array over time (but this might be ok if this is your app aim).

Alternatively you could block or interlace the sample which is like a half way house between the two approaches above, maybe I average all the odd samples first, then the even samples, this effectively halves the accumulators required. or perhaps I conjure up some sort of hybrid using blocks and or interlacing of averaging etc..

Hopefully this explains what I meant :?

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Yes, in my application, it should be fine to average and then move on (it may change...).

So I have basic code, and as far as I can tell (never used the timer before, and this is a simulator run), it is way to slow - and this is with substituting reading the acutal ADC with return 211;, and no driving the actuators. The xmos project is here:

http://ge.tt/4N3RsdK

main:
processorThread:
ADC:
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Okay. So you have 8-bit a and b, and need to do something with a quantity (a-b)/(a+b).
The a and b are actually averaged over 4 samples.

First note that you do not need to average, you can just as well just take the sum over
those 4 samples; that will become a 10-bit number, still small enough for what we will
do, and no overflow of course. All of a, b, a-b, a+b will be four times too big, but
that magically disappears in the (a-b)/(a+b) formula.

How now to preceed depends on what you want to _do_ with that quantity. You can
compute a scaled integer for it, by multiplying a-b with some biggish scalefactor, make
sure it cannot overflow, and then dividing by a+b; the result will be scaled by that
scalefactor. The division takes quite a few cycles though, you don't want to do many
divisions.

Another option is to use a table for the division, since you only have 2^11 entries.
Call the scalefactor M, then the table entry j contains M/j. This gives a bit more
rounding error and uses 8kB of memory, of course (you can trade more rounding
error for smaller table size, if you wish).

But, do you need the division at all? If e.g. all you do is comparing two of those
quantities, you can use

(a-b)/(a+b) < (c-d)/(c+d)
<=>
(a-b)(c+d) < (a+b)(c-d)

(assuming a+b and c+d are positive; have to play a bit with signs otherwise).

So, what do you need to do with the quantities?


Oh, btw: I wouldn't write this in machine code, higher level languages are just fine
for this kind of stuff. Only when it is too slow and you need that extra little bit
speed you should use asm (or if you are doing something really weird, of course).
Algorithmic improvements help WAY more. The main trick for making code faster
is to do less work.
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

But, do you need the division at all? If e.g. all you do is comparing two of those
quantities, you can use

(a-b)/(a+b) < (c-d)/(c+d)
<=>
(a-b)(c+d) < (a+b)(c-d)

(assuming a+b and c+d are positive; have to play a bit with signs otherwise).

So, what do you need to do with the quantities?
First off, thanks for such an in-depth reply, always appreciated!

What are the values c and d? and is '<' actually an operation? I don't know how much I can talk about the application (i'le find out), but basically, the pair of values will coordinate with a target value, which is ideally zero, but in reality, will be slightly off. Say (a-b)/(a+b) gives 1 in the first case, and .5 in the second case. I use the difference between the calculated and target value to drive an actuator, in a direction that would bring the calculated closer to the target.

EDIT:
I don't know if i am doing this right. If I use the timer to get a time difference of 12458 for 1 cycle, how do i convert this into time? I am doing:

12458 * 10 ns = 124580 ns
WolframAlpha says this is ~8 kHz (http://www.wolframalpha.com/input/?i=124580+ns)
Is this right? If it is, this is way to slow - even without doing "real" stuff. I can't even find out where this time is coming from.