High Speed (10MHz) DAQ

Technical questions regarding the XTC tools and programming with XMOS.
ikellymo
Junior Member
Posts: 7
Joined: Tue Jun 11, 2013 4:23 am

High Speed (10MHz) DAQ

Post by ikellymo »

Hi All,

I'm relatively new to XMOS and am starting what I'm realizing is an ambitious project - an expandable high speed waveform generator/analyzer. I'd really like to see this through though. With a little bit more success I'll create a formal project for it. My concept is that separate chips will be responsible for driving DAC and reading ADC. Because my target data rate is so high (16bit@10MHz) there are some challenges with communication across the xlinks - but I think this can be overcome (calculations would be done on-chip, results pushed off-chip - sections of waveform could be sampled in real-time, viewable from ethernet connected software).

Right now I'm just working on the basics though. Trying to do basic sine wave generation at 10MHz using a look-up table.

I'll also say that I've looked at

Web Oscilloscope
http://archive.xmoslinkers.org/node/284

High Speed Data Collection project
http://www.xcore.com/projects/high-spee ... ac-and-adc

High Speed data collection (DAC and ADC) topic
http://www.xcore.com/forum/viewtopic.ph ... +speed+DAC

Generating Sine Waves topic
http://www.xcore.com/forum/viewtopic.php?f=26&t=2213

and not sure if I should respond to one of these or start a new topic... but new topic it is!

~:~

So far I'm using the ethernet module to send and receive TELNET commands to my XC-2 board (G4 device), parsing the packets into channel commands, and I can drive my LTC1668 16bit parallel DAC, at first thought without a problem.

But my sine-waves should be 100kHz (10MHz/100samples) - 10us period, and they sit at 15us. This makes sense once I run the simulation and see that the DAC gets updated at a rate alternating between 100 and 200ns - but I don't know why this happens.

waveform.png
This comes after a lot of debugging and headscratching, and is the result of simplifying the code as much as I know how to do at the moment.

Code: Select all

on stdcore[1] : out  			port	 	dacOut	= XS1_PORT_16A;
on stdcore[1] : out			port		clkOut	= XS1_PORT_1E;
on stdcore[1] : clock					clk		= XS1_CLKBLK_1;
on stdcore[0] : out			port		x0ledB 	= PORT_LED_0_1;


// cosine look-up table in header
static short cosLUT[100] = {
		65535	,
		65470	,
		65277	,
		64955	,
		64506	,
		.
                .
                .
		64955	,
		65277	,
		65470
};

void initialize(out port dacOut, out port clkOut, clock clk)
{
	configure_clock_rate(clk, 100, 10); // first/second, 10MHz Clock
	configure_out_port(dacOut, clk, 0);
	configure_port_clock_output(clkOut, clk);
	start_clock(clk);
}

void outputMain(out port dacOut, out port clkOut, clock clk)
{
	short value = 0;
	initialize(dacOut, clkOut, clk);

	while(1) {
		dacOut <: cosLUT[value];
		value++;
		if(value>99) {
			value = 0;
		}
	}
	return;
}


int main(void) {

    chan c_xtcp[1];
    chan c_tasks;
    chan wv[4];

	par
	{
          // The main ethernet/tcp server
          on ETHERNET_DEFAULT_TILE:
             ethernet_xtcp_server(xtcp_ports,
                                  ipconfig,
                                  c_xtcp,
                                  1);


          // The webserver
          on tile[0]: xprot(c_xtcp[0], c_tasks);

          // The taskmaster
          on tile[0]: taskmaster(c_tasks);

          on tile[1]: outputMain(dacOut, clkOut, clk);

	}
	return 0;
}

So I'd like to ask:
A) for help in meeting my 100ns timing deadline
B) for thoughts about using the XMOS architecture to achieve 10MHz signal acquisition and response. After extensive reading I'm starting to doubt this was a good choice, maybe I'd be better off with FPGA - but I really like the XMOS environment.
C) Is there interest in this project? Possibly having the end result available as a dev platform?

End goal is DC to 500kHz modulation/demodulation, highly modular/expandable (n-channels), multi-frequency mod/demod possible, real-time amplitude / frequency control, ethernet controlled for waveform viewing, industrial communication protocols for analytical result communication.
You do not have the required permissions to view the files attached to this post.
Last edited by ikellymo on Sun Jul 21, 2013 6:24 pm, edited 1 time in total.


User avatar
dan
Experienced Member
Posts: 102
Joined: Mon Feb 22, 2010 2:30 pm

Post by dan »

Hi Ikellymo,

This is not actually a difficult thing to solve with XMOS. 10 MHz output is not a big ask!

You should use a buffered port. See the HOWTO example "How to use buffering for port output" in the tools. In fact, you should probably work through all the port HOWTO examples as a start.

Anyway, back to your problem. You are using a 16 bit port but we can set its transfer width to 32 like so

Code: Select all

out buffered port:32 dacOut = XS1_PORT_16A;
Leave the rest of the setup as is.

Now, each time you should output to dacOut a 32 bit value, which is two of your sine wave samples (alter the LUT to hold two samples in each location). The port will take 32 bit values, and automatically clock them out in 16 bit chunks at 10 mhz. This should give you time to get round your loop.

If your code gets back to doing the next 32 bit output to dacOut before the port has emptied itself of its previous data, the output instruction will just block until the port is ready.

That should work to achieve your target, but to make it even faster it wouldn't hurt to unroll your loop.

Exactly the same technique can be used for reading from the ADCs.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Hi Ian,

In your example code you're not clocking the output port from your
10MHz clock, so it runs from the reference clock, which is 100MHz.
Your code obviously is not fast enough to make that all of the time,
so you get some 20ns intervals as well (which is pretty impressive
already -- if the system clock is 500MHz and you have not more than
four threads running, you get an instruction on one particular thread
every 8ns).
ikellymo
Junior Member
Posts: 7
Joined: Tue Jun 11, 2013 4:23 am

Post by ikellymo »

Thanks for the replies.

@ segher, I realized I made a mistake in my post. I said the DAC gets updated at a rate alternating between 10 and 20ns, this should be 100 and 200ns (100ns or 10MHz is the target).

I believe the lines in the initialize block

Code: Select all

void initialize(out port dacOut, out port clkOut, clock clk)
{
   configure_clock_rate(clk, 100, 10); // first/second, 10MHz Clock
   configure_out_port(dacOut, clk, 0);
   configure_port_clock_output(clkOut, clk);
   start_clock(clk);
}
set the port to be clocked at 10MHz ?

Some basic timing analysis tells me that my while loop

Code: Select all

   
while(1) {
      dacOut <: cosLUT[value];
      value++;
      if(value>99) {
         value = 0;
      }
   }
takes 150/90ns
timing.png
and extracting the conditional block

Code: Select all

      
if(value>99) {
         value = 0;
      }
leaves me with 90ns. I'm still a little confused why the out statement has unknown timing associated with it.

@ dan,

I understand that buffering the port will speed things up a bit but I didn't initially try this because I would think that buffering a 16bit port with a 32bit buffer only buys a little bit of time because I still have to run the calculation/LU twice and do a bitshift. I was trying to understand why the simple output and loop tends to be slower. I'm looking to expand this very simple code to allow for active control of the output waveform, including an input chanend and maybe a multiplier/divider to control the amplitude, another input chanend to control the frequency would also be nice.

I could potentially split my 16bit ports into 2 8bit ports each buffered by 32bits (allowing 400ns for processing) or 4 4bit ports (allowing 800ns for processing) which I'm exploring in the timing analyzer right now, but it seems like synchronization will become the challenge, as well as the tradeoff of the added bitshift operations. Need to think about it more.

But thanks a lot for your suggestions, I'll think about the 32bit wide LUT and if its possible to still do some math on it.
You do not have the required permissions to view the files attached to this post.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

ikellymo wrote:I believe the lines in the initialize block [...]
set the port to be clocked at 10MHz ?
Yeah, I cannot read.
Some basic timing analysis tells me that my while loop

Code: Select all

   
while(1) {
      dacOut <: cosLUT[value];
      value++;
      if(value>99) {
         value = 0;
      }
   }
takes 150/90ns
Your "cosLUT" array is an array of signed short ints. If you make it
unsigned, the timing will become better. Same for "value". Try something
like

Code: Select all

unsigned short cosLUT[] = { ... };

...

unsigned j;
for (;;) {
    for (j = -100; j; j++)
        dacOut <: cosLUT[j+100];
}
which should generate something like

Code: Select all

    <set r4 to -100>
    <set r5 to cosLUT+200>
    <set r6 to dacOut>
0:  mov r0,r4
1:  ld16s r1,r5[r4]
    out res[r6],r1
    addi r4,r4,1
    bt r4,1b
    bu 0b
which is 6 insns worst case, 4 insns normally. Unrolling can bring it
closer to 2 insns per iteration; and doing the 32-bit buffer halves that
again. 6 insns with 8 threads and system clock 500MHz gives 96ns
already, I think it will just work :-)
I'm still a little confused why the out statement has unknown timing associated with it.
Probably because that code can stall, when the port buffer is full?
ikellymo
Junior Member
Posts: 7
Joined: Tue Jun 11, 2013 4:23 am

Post by ikellymo »

Ok, thanks again for all the advice. The first code was more of a speed test, and now I'm implementing real functionality. I decided to go with two buffered 8bit ports, which buys me 400ns to output 4 samples at a 10Mhz update rate. I'm using DDS techniques to output a dynamically variable frequency, variable amplitude signal.

Code: Select all

void initialize(out buffered port:32 dacOutA, out buffered port:32 dacOutB, out port clkOut, clock clk)
{
	configure_clock_rate(clk, 100, 10); // first/second, 10MHz Clock
	configure_out_port(dacOutA, clk, 0);
	configure_out_port(dacOutB, clk, 0);
	configure_port_clock_output(clkOut, clk);
	start_clock(clk);
}
phaseGen() increments the LUT index (phase) based on requested frequency and streams out the phase. 190ns loop time.

Code: Select all

void phaseGen(chanend freq, streaming chanend phaseOut)
{
	unsigned int frequency = 100000;
	unsigned int inc = 256;
	unsigned short phaseAccum = 0;

	while(1) {
		select {
			case freq :> frequency:
				inc = (frequency*65535/(10000000));
				break;
			default:
				phaseOut <: (unsigned int)(phaseAccum >> 8);
				phaseAccum += inc;
				phaseOut <: (unsigned int)(phaseAccum >> 8);
				phaseAccum += inc;
				phaseOut <: (unsigned int)(phaseAccum >> 8);
				phaseAccum += inc;
				phaseOut <: (unsigned int)(phaseAccum >> 8);
				phaseAccum += inc;
				break;
		}
	}
}
amplitudeConversion() converts that phase to an amplitude (range:0-0xFFFF) using >> 16 +1 as fast divide, streaming out the value. 390ns loop time.

Code: Select all

void amplitudeConversion(streaming chanend phase, chanend amplitude, streaming chanend valueOut)
{
	unsigned int ph[4] = {0,0,0,0};
	unsigned int amp = 65535;

	while(1) {
		select {
			case amplitude :> amp:
				break;

			default:
				phase :> ph[0];
				phase :> ph[1];
				phase :> ph[2];
				phase :> ph[3];

				valueOut <: ((cosLUT[ph[0]]*amp) >> 16) + 1;
				valueOut <: ((cosLUT[ph[1]]*amp) >> 16) + 1;
				valueOut <: ((cosLUT[ph[2]]*amp) >> 16) + 1;
				valueOut <: ((cosLUT[ph[3]]*amp) >> 16) + 1;
				break;
		}
	}
}
outputMain() takes the values and outputs them to the ports (combined as 1 16bit port). 260ns loop time.

Code: Select all

void outputMain(out buffered port:32 dacOutA, out buffered port:32 dacOutB, out port clkOut, clock clk, streaming chanend valueIn)
{

	unsigned int value1[4] = {0,0,0,0};
	unsigned int outA;
	unsigned int outB;

	initialize(dacOutA, dacOutB, clkOut, clk);

	outA = 0xFFFFFFFF;
	outB = 0xFFFFFFFF;

	dacOutA <: outA;
	dacOutB <: outB;
	sync(dacOutA);
	sync(dacOutB);

	while(1) {

		dacOutB <: outB;
		dacOutA <: outA;

		valueIn :> value1[0];
		valueIn :> value1[1];
		valueIn :> value1[2];
		valueIn :> value1[3];

		outB = (((value1[3] & 0x000000FF) << 24) | ((value1[2] & 0x000000FF) << 16) | ((value1[1] & 0x000000FF) << 8) | (value1[0] & 0x000000FF));
		outA = (((value1[3] & 0x0000FF00) << 16) | ((value1[2] & 0x0000FF00) << 8)  | (value1[1] & 0x0000FF00)        | ((value1[0] & 0x0000FF00) >> 8));

	}
	return;
}
Looks like everything works, except I can't for the life of me get the ports synced/aligned.
forForum.png
I'm also optimizing at -O3. My hypothesis is that outputMain calculates the portA output bits and outputs them, then calculates the portB output bits and outputs them, but the calculation takes more than 100ns so the ports get misaligned. Even though I separated the calculation in the code using outA/outB variables, I think the optimization of the compiler kills my separation. I've tried many combinations of sync() calls on A,B,A/B, tried inline calls to sync() (it just delays), tried different orders of dacOutA/dacOutB, a lot.

Any advice of how to force this using assembly or XC? Thanks a million.
You do not have the required permissions to view the files attached to this post.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

ikellymo wrote:My hypothesis is that outputMain calculates the portA output bits and outputs them, then calculates the portB output bits and outputs them, but the calculation takes more than 100ns so the ports get misaligned.
Look at the generated code to see if that is true. xobjdump -d...
Even though I separated the calculation in the code using outA/outB variables, I think the optimization of the compiler kills my separation. I've tried many combinations of sync() calls on A,B,A/B, tried inline calls to sync() (it just delays), tried different orders of dacOutA/dacOutB, a lot.

Any advice of how to force this using assembly or XC?
Immediately before the outputs, you can do

Code: Select all

asm("" : : "r"(outA), "r"(outB));
which a) forces both outA and outB to be in registers by then, and b) forces
the outputs to go after that. That should be enough.
ikellymo
Junior Member
Posts: 7
Joined: Tue Jun 11, 2013 4:23 am

Post by ikellymo »

Hi Segher,
I checked the decompilation using the timing analyzer and found I was right - the outputs were significantly separated:
Screen Shot 2013-07-25 at 11.52.01 AM.png
Inserting your code did the charm:
Screen Shot 2013-07-25 at 11.50.54 AM.png
Leaving successful generation:
photo.jpg
SINCERE THANKS!!!
You do not have the required permissions to view the files attached to this post.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Congrats! And good luck with the rest of the project.