I have an xCore-200 Explorer board and I have a problem with effective parallel execution.
I have essentially 2 thread; One generate data over streaming channel, and another elaborate this data.
The data stream are 32bit wide and I want elaborate separately and simultaneusly the first 16bit and the last 16bit, then build the 32bit data again.
The out port TT are used to verify externally the effective time execution of the elabora() function with an oscilloscope.
Code: Select all
//Produce test thread
void ProduceThread(streaming chanend c_out){
int i, j;
for (i=0; i<10; i++) {
c_out <: (i + (i<<16));
Wait_us(5);
}
}
//Elabora test function
unsigned int elabora(unsigned int DataIn, unsigned int h[]){
unsigned int i, DataOut;
DataOut = 0;
for (i=0; i<10; i++) {
DataOut = DataOut + DataIn*h[i];
}
return DataOut;
}
//Elaborate test thread
void ElaborateThread(streaming chanend c_in, streaming chanend c_out, unsigned int Array[], out port TT){
unsigned int I,Q, InData, OutI, OutQ;
unsigned int i;
unsigned int hQ[100], hI[100];
for (i=0; i<100; i++) {
hI[i] = Array[i];
hQ[i] = Array[i];
}
for (i=0; i<10; i++) {
c_in :> InData;
//split data into I and Q
I =InData >> 16;
Q =InData && 0xFFFF;
TT <: 1;
#if Q_ENABLE
par {
OutI = elabora(I, hI);
OutQ = elabora(Q, hQ);
}
#else
OutI = elabora(Q, hI);
OutQ = 0;
#endif
TT <: 0;
//build 32 bit data
c_out <: ((OutI & 0xFFFF) << 16) + (OutQ & 0xFFFF);
}
}
int main(void) {
streaming chan c, cOut32;
on tile[0]: {
ProduceThread(c);
}
on tile[0]: {
ElaborateThread(c, cOut32, TestArray, TT2);
}
//Other thread with cOut32 channel input
}
If Q_ENABLE=0 the execution time of elabora() function is about 850ns.
If Q_ENABLE=1 the execution time of elabora() function is about 1500ns.
If Q_ENABLE=1 sems that the code inside par{} statement are executed sequentially!
If I remove the par{} statement as below, I obtain the same result.
Code: Select all
...
//par {
OutI = elabora(I, hI);
OutQ = elabora(Q, hQ);
//}
...
Also, when I compile the code with Q_ENABLE=0 the resource occupation are 2 thread, with Q_ENABLE=1 the resource occupation are 3 thread.
The code is executed into splitted thread by sequentially.
Any suggestion? I want to execute the tread in parallel.
Thanks