Which Program Flow?

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

c,d in my example is just another pair a,b, for example, your target value (as a fraction).

My point was that you can perhaps keep the fractions as fractions, i.e. a pair of numbers,
numerator and denominator, and never compute the fraction as scaled integer or floating
point or whatever.

If your timer clock is 100MHz (the default), then 12k samples is 8kHz yes. What did that
code do? 12k cycles is a lot. Did it use floating point?


User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Ah, i didn't think of keeping it as a fraction. I will see if that works.

I did a timer before the code that "collected" the information, and after it transmitted. All of the code is in my previous post. Here is the timer:



Now that I think about it, could all of that time be because of synchronization issues? I am going to add some more timers to the actual processing thread and see...

EDIT: I added some more timing marks, and it turns out that a huge chunk (~9000 out of 12000) is from the collectData routine. I really don't see how, as it isn't doing much at all.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Did you perhaps accidentally compile without optimisation?

Ideally, that code should be almost completely optimised away; you probably will need
to declare the local functions as "static" for that to work though.

It cannot be a synchronisation issue since the code you mention being slow does not
contain any I/O or channel comm or such.
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Optimization was on, but I upped it from -O2 to -O3 (most). Static methods didn't do much. However, I did run as a release, instead of debug. Now, the timer difference is 2671, about 37 kHz. Much better, but still way too slow. Here is the timing output:

Code: Select all

Process Thread: 1317, 67 	 1384
Process Thread: 1754, 67 	 1821
Process Thread: 2212, 70 	 2282
Process Thread: 2683, 70 	 2753
ADC Time: 849,450,443,458,471 	 2671
For the Process Thread, it is: Time to recieve data, time to process data total
For the ADC: Time to collect data, send to process thread 1, thread 2, thread 3, thread 4 total

Shouldn't the time of sending match the recieving?

Processor Code:
ADC Code:

EDIT: I changed the channels to streaming channels, it is now 75 kHz.
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Does it speed up if you turn processPair into a macro?

Also I would be interested in see if using transactions would change the performance, it would certainly lend itself to this application.

regards
Al
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

OK I have now got my caffeine levels up to normal and have had a good look at your code, here are a few points:

1) If you were to assume perfect concurrency of this application then it shouldn't be 75Khz it should be 4*75Khz or 300Khz because you are processing in parallel, after all that's what XS1 is about. Clearly this is an "Ideal" and not realistic due to inefficiencies and the real world will make it less than 4 times faster.

2) Currently your example is not running the threads in parallel, rather its running them in sequence. I'm referring here to the running of processorThread.

3) The reason it is important for your simulation and timings is because processorThread uses a division which not only takes multiple cycles but relies on common resources shared amongst the threads. That is, timing as a single thread, is not a accurate indicator for several threads due to the resource contention in this case of the division unit.

4) The other problem is the assumption about how the data comes in which may make it even more complex, at the moment you are assuming an array of data, that is like saying it was the last collected array from the ADC, in reality the data is coming at you continually and thus there are possible optimisations to lower the latency if that is important to the application (it would translate into reaction/response time).

Try restructuring it to take account of above and see what the results look like.

regards
Al
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

Ok, so I decided on an isolated approach - eliminate as much inter-thread communication as possible. As it stands right now, the 64 sensors will be split into 16 sets, each connected to 1 of 4 separate ADCs. Each of these ADCs will be connected to different pins, to allow different threads to access them. Now, every single thread has its own ADC, and as a result, does not need to rely on the other threads (except for the common USB output, which is on another core). This offers:
1) I believe a better reaction time. It can sample, and then immediately process, without worrying about giving it to another thread. Rather than storing it in an array, I may just sample-process-sample-process. This might be faster/better for the application.
2) Scalability. It shouldn't be hard to just add an L1, and double the capability (e.g, twice as fast, or twice as many detectors/actuators).

I tried this out, with 16 points each thread, and I am now getting ~480 kHz. (Again, I only measured 1 thread, but if each quadrant is updated at 480 kHz, then the whole array would take 480 kHz to update, right? Or because it is a round-robin sequencer, would it actually by 4 times that?).

If i didn't mention earlier, I switched to an L2. 1 Core will have these 4 processing threads, and 1 core will be responsible for controlling the actuators, which has a USB input (The XMOS will be a USB host).

Some Questions:
1) Do you say it is running in sequence because that is how the earlier version transmitted the data? Or am I using the par{} statement wrong?
2) Is division the only resource, in this case, that is shared? Does this really have a significant impact?
3) What do you mean by creating a macro?

Here is the current code (all squished into 1 file, quadrant is 1,2,3 or 4)

Code: Select all

#include <stdio.h>

static void collectData(unsigned char data[16], int quadrant); //fill an array with new data
static unsigned char readADC(int ADC_Port); //read a single value
static unsigned char readNormalizedADC(int ADC_Port); //read an averaged value
static int processPair(int a, int b);

void ProcessorThread(streaming chanend usbOut, char quadrant) {
	unsigned char currentData[16];
	short counter1;
	short ret;

	timer t;
	unsigned long time;
	unsigned long time2;
	unsigned long time3;
	unsigned long time4;
	unsigned long time5;


	short cycles;
t	:> time;
	for (cycles = 0; cycles < 10; cycles++) {
		t	:> time2;
		collectData(currentData, quadrant);
		t	:> time3;
		for (counter1 = 0; counter1 < 14; counter1+=2) {
			ret = processPair(currentData[counter1],currentData[counter1+1]);
		}
		t	:> time4;
	}
	t :> time5;

	printf("Time Quadrant %i: %ld,%ld,%ld\n",(int)quadrant,((time5-time)), (time3-time2), (time4-time3));
}

static void collectData(unsigned char data[16], int quadrant) {
	short counter;
	short start = (quadrant-1)*16;
	for (counter = start; counter < quadrant*16; counter++) {
		data[counter-start] = readNormalizedADC(counter);
	}
}

static unsigned char readNormalizedADC(int ADC_Port) {
	short sum = 0;
	sum += readADC(ADC_Port);
	sum += readADC(ADC_Port);
	sum += readADC(ADC_Port);
	sum += readADC(ADC_Port);
	return (sum >> 2);
}

static unsigned char readADC(int ADC_Port) {
	return 1;
}
static int processPair(int a, int b) {
	return (1000000 * (a - b) / 1000000 * (a + b));
}
In read normalizedADC, I got rid of the for loop, and just copy-pasted 4 times. This was showing it to get ~900 ns faster. Taking away the division, however, (just returning 1) showed a very little decrease in time. Forgot the numbers, but I will try again.

EDIT: BTW, Thanks for look through my code :D

EDIT: I just tried directly processing the values, without storing it, and the speed is now 927.6 kHz! Now I am just hoping I am calculating the frequency right =p. New segment:

Code: Select all

for (counter1 = 0; counter1 < 14; counter1+=2) {
			ret = processPair(readNormalizedADC(counter1),readNormalizedADC(counter1+1));
			//ret = processPair(currentData[counter1],currentData[counter1+1]);
		}
User avatar
Folknology
XCore Legend
Posts: 1274
Joined: Thu Dec 10, 2009 10:20 pm

Post by Folknology »

Going with 4 ADCs makes it much easier from a concurrency POV good news.

To test you stuff concurrently just use a replicator:

Code: Select all

streaming chan c[4];
par{
  on stdcore [0] : usbThread(c);
  par ( int i = 0; i < 4; i++)
    on stdcore [1] : ProcessorThread(chanend c[i], i+1) ;
}

Obviously change the quadrant char to an int.

regards
Al
MaxFlashrom
Experienced Member
Posts: 82
Joined: Fri Nov 05, 2010 2:59 pm

Post by MaxFlashrom »

I would like to offer a word of warning on using arithmetic shift right (shift all the bits right, while preserving the sign bit MSb) for performing integer division. You'll frequently read that this divides an integer by two. Well, sort of, but it's not the whole story. This trick is popular with embedded hackers trying to squeeze the last clock cycle out of their processor. On modern designs like the XMOS L1, this may be misguided, especially for multiplies as they execute in a single cycle. For division it may be possible to save a few clock cycles.
The problem with using arithmetic shift right for integer division is particularly problematic for negative, signed 2's complement numbers. The assembler instruction "ashr" does not work the same as "divs" (signed integer divide). It rounds the answer to minus infinity rather than 0. In addition you'll find that -1 shifted any number of places to the right is still -1!

Take -11/4. In 8-bit 2's complement binary that's 11110101 shifted twice right, losing 2 LSbs giving 11111101 which is -3 and not -2, which is probably not what you wanted.
Take -1, that's 11111111 arithmetic shifted 4 times right, well that's 11111111. Yes, still -1. Yes, I've been there and found out the hard way.

The ANSI C standard does not specifically define an arithmetic right shift, it only defines >> (In Java >>> is defined as an ASR). My ANSI K&R says that

Code: Select all

signed number >> unsigned shift amount
is implementation defined. Most compilers seem to generate an arithmetic shift right (XMOS does).

Writing val >>= 8 rather than val /=256, where val is signed,does NOT generate the same code or result on a compiler that's not broken.

The following when compiled for XMOS XS1 gives:

Code: Select all

	signed val = -11;
	val /= 2;
	val >>=2;

	signed val = -11;
0x000102dc <main+32>:  ldw   (lru6)      r11, cp[0x2]
0x000102e0 <main+36>:  stw   (ru6)       r11, sp[0x9]
	val /= 2;
0x000102e2 <main+38>:  ldw   (ru6)       r1, sp[0x9]
0x000102e4 <main+40>:  ldc   (ru6)       r0, 0x2
0x000102e6 <main+42>:  divs  (l3r)      r0, r1, r0
0x000102ea <main+46>:  stw   (ru6)       r0, sp[0x9]
	val >>=2;
0x000102ec <main+48>:  ldw   (ru6)       r0, sp[0x9]
0x000102ee <main+50>:  ashr  (l2rus)    r0, r0, 0x2
0x000102f2 <main+54>:  stw   (ru6)       r0, sp[0x9]
So you've saved a few bytes of RAM and clock cycles and got the wrong answer.
For a more complex combination of shifts(some kind of magic division by 3 for instance) it'll may be slower even if you get it right.
Henry S. Warren's book, Hacker's Delight is a great source of bit hacking gems see
http://www.hackersdelight.org/
Here's a signed division by 3. Very clever, but possibly not quicker than a hardware divide instruction

Code: Select all

int divs3b(int n) {
   int q, r;

   n = n + (n>>31 & 2);         // Add 2 if n < 0.
   q = (n >> 2) + (n >> 4);     // q = n*0.0101 (approx).
   q = q + (q >> 4);            // q = n*0.01010101.
   q = q + (q >> 8);
   q = q + (q >> 16);
   r = n - q*3;                 // 0 <= r <= 14.
   if (r < 0) printf("Negative r = %d\n", r);
   return q + (11*r >> 5);      // Returning q + r/3.
// return q + (5*(r + 1) >> 4);         // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}

Max.
User avatar
rp181
Respected Member
Posts: 395
Joined: Tue May 18, 2010 12:25 am

Post by rp181 »

I tried a replicator earlier, turns out I just forgot the second par statement, thought it was a for loop :D

As for shifting for dividing - I should never, ever get a negative number, so I don't see an issue. I may just leave it as division, as it showed a very little difference in time, and memory shouldn't be an issue.

EDIT: Tried it again, and the timers show 0 difference between division and shift.