Optimizing ADC Read

rp181 · Post by **rp181** » Tue Dec 30, 2014 7:12 pm

Hi guys,

In my current project, it's gotten to the point of reading 2x parallel ADCs (AD7938) is the bottleneck in the operating speed. I'm posting here to see if anyone has ideas to speed up reading. The ADC is capable of 1.5 MSPS, but i'm achieving a read rate of about half that.

Physically speaking, there are 2 ADCs that share a 12b data bus (3 4b ports), RD, and WR, and each with a CS pin. An extended channel of 0-15 passed into the function translates into channel 0-7 for either adc0 or adc1.

Code: Select all

unsigned int CLOCK_OUTPUT = 0xF5555555;

void initADC(ADCPair &adc) {
    //25.5 MHz max square wave for ADC
    configure_clock_rate(adc.clkBlk, 100, 2);
    configure_out_port(adc.clk, adc.clkBlk, 1);
    start_clock(adc.clkBlk);

    adc.convst <: 1;
    adc.cs[0] <: 1; //ADC0 chip select
    adc.cs[1] <: 1; //ADC1 chip select
    adc.clk <: 1;
    adc.wr <: 1;
    adc.rd <: 1;
}

inline int readADC(ADCPair &adc, Bus12b &bus, int exChan) {
    int data;
    int msb, mid, lsb;
    //Convert extended channel 0-15 to adc number and channel 0-7
    int adcNum = exChan > 7;
    int channel = exChan - adcNum * 8;
    //Based on ADC registers
    int output = channel << 5;

    //Write the channel to read
    adc.cs[adcNum] <: 0;
    adc.wr <: 0;

    bus.msb <: (output >> 8);
    bus.mid <: ((output >> 4) & 0b1111);
    bus.lsb <: (output & 0b1111);

    adc.wr <: 1;
    adc.cs[adcNum] <: 1;

    //Conversion
    adc.convst <: 0;
    adc.clk <: CLOCK_OUTPUT; //At least 14 pulses required
    sync(adc.clk);
    adc.convst <: 1;
    //Read
    adc.cs[adcNum] <: 0;
    adc.rd <: 0;
    adc.rd <: 0;
    adc.rd <: 0; //At least 30ns low

    bus.msb :> msb;
    bus.mid :> mid;
    bus.lsb :> lsb;

    data = ((msb << 8) + (mid << 4) + lsb);

    adc.cs[adcNum] <: 1;
    adc.rd <: 1;

    return data;
}

Any insight would be great!

mon2 · Post by **mon2** » Wed Dec 31, 2014 2:11 pm

Hi. A few comments...

Code: Select all

    //25.5 MHz max square wave for ADC
    configure_clock_rate(adc.clkBlk, 100, 2);

Does that not equate to 50 Mhz for the clock rate ? The max for the ADC is 25.5 Mhz so this should be:

Code: Select all

    //25.5 MHz max square wave for ADC
    configure_clock_rate(adc.clkBlk, 100, 4);

or is it that your manually created pulse train via the following bit pattern and a 50 Mhz clock rate for the clock to the ADC results in 25 Mhz ?

Code: Select all

unsigned int CLOCK_OUTPUT = 0xF5555555;

Please confirm the true clock present at the ADC's clock input with your logic analyzer.

Code: Select all

adc.clk <: CLOCK_OUTPUT; //At least 14 pulses required

so, manually clocking out, 1 bit at a time, the following bit pattern:

(ie. 28 bits of 'adc.clock')

Code: Select all

adc.clk <: b1111 0101 0101 0101 0101 0101 0101 0101; //At least 14 pulses required

Could this not be reduced to 16 bits ? Best to confirm if the above bit pattern results in the required max value of 25 mhz clock to the ADC.

Why generate the clock manually (bit bang) for the ADC ? Why not allow the XMOS device to generate the clock @ 25 mhz and feed directly to the ADC devices ?

Consider to remap the data ports to allow for a lower number of transactions with the ADC.

Consider to use buffered ports rather than the sequential setup code. For guidance, have a review of the SDRAM component which I recall squeezing out bandwidth using similar buffered port methods.

Disclaimer: Still very much in learning mode about XMOS devices but the above comments may help to achieve your goal.

segher · Post by **segher** » Wed Dec 31, 2014 5:12 pm

Hi rp181,

About 0.75MS/s means about 160 processor cycles per
sample. I count about 100 in your code (two thirds of
that "wasted" in the sync instruction); that would give
about 1.2MS/s.

So where are all those extra cycles gone? You do run
the xcore at 500MHz, don't you? Look at LA traces to
see what went where (or use the timing analyser if you
know how to drive that); look at the generated machine
code to see what is wrong.

rp181 · Post by **rp181** » Wed Dec 31, 2014 7:00 pm

Thanks for the feedback!

mon2:
The clk pin is actually buffered to 32 bits (and clocked). 14 pulses equates to 28 pin states, so there's not many wasted cycles. The 25.5 MHz is the frequency of the square wave (rising edge to rising edge) which is two states, so a pin transition rate of 50 MHz equates to 25 MHz square wave (confirmed with a scope).

segher:
The core is running at 500MHz (as far as the XN file dictates). The timing analyzer says that 1 function call takes 520ns worst case, but when timed (100000 executions with a timer), a function call storing into an array was taking ~1300 ns. I'm going to re-do some testing using traces instead of timers and try and see where the slow down is happening.

rp181 · Post by **rp181** » Wed Dec 31, 2014 7:26 pm

Ok, so i timed it again and I am getting 1360ns. Using a scope and a DAC output, 1500 ns (makes sense because the DAC takes a bit of time to write to). I don't have a great scope/LA right now, so making do with a 20 MHz scope.

I'm not sure where the discrepancy between the timing analyzer (520ns) and execution (1360ns) is coming from. If it matters, the release is -O2.

Timing code:

Code: Select all

        t :> startTime;
        for (unsigned int i = 0; i < 100000; i++) {
            readADC(adc, bus, 1);
        }        
        t:> stopTime;

        printf("%i0ns\n", (stopTime - startTime)); //Prints 136000010ns

Assembly for the function from the timing analyzer. Never really dealt with assembly so it'll take me a bit to go through it.

Code: Select all

0x1017c readADC:
           0x1017c 	entsp (u6) 0x5
           0x1017e 	stw (ru6) r4, sp[0x4]
           0x10180 	stw (ru6) r5, sp[0x3]
           0x10182 	stw (ru6) r6, sp[0x2]
           0x10184 	stw (ru6) r7, sp[0x1]
           0x10186 	stw (ru6) r8, sp[0x0]
           0x10188 	mkmsk (rus) r3, 0x3
           0x1018a 	lss (3r) r4, r3, r2
           0x1018c 	ldw (3r) r3, r0[r4]
           0x1018e 	ldc (ru6) r11, 0x0
           0x10190 	out (r2r) res[r3], r11
           0x10192 	ldw (2rus) r6, r0[0x3]
           0x10194 	out (r2r) res[r6], r11
           0x10196 	shl (2rus) r4, r4, 0x3
           0x10198 	sub (3r) r4, r2, r4
           0x1019a 	shl (2rus) r2, r4, 0x5
           0x1019c 	ashr (l2rus) r5, r2, 0x8
           0x101a0 	ldw (2rus) r2, r1[0x0]
           0x101a2 	out (r2r) res[r2], r5
           0x101a4 	shl (2rus) r4, r4, 0x1
           0x101a6 	ldc (ru6) r5, 0xe
           0x101a8 	and (3r) r4, r4, r5
           0x101aa 	ldw (2rus) r5, r1[0x1]
           0x101ac 	out (r2r) res[r5], r4
           0x101ae 	ldw (2rus) r4, r1[0x2]
           0x101b0 	out (r2r) res[r4], r11
           0x101b2 	mkmsk (rus) r1, 0x1
           0x101b4 	out (r2r) res[r6], r1
           0x101b6 	out (r2r) res[r3], r1
           0x101b8 	ldw (2rus) r6, r0[0x2]
           0x101ba 	out (r2r) res[r6], r11
           0x101bc 	ldw (2rus) r7, r0[0x5]
           0x101be 	ldw (lru6) r8, dp[0x3]
           0x101c2 	out (r2r) res[r7], r8
           0x101c4 	syncr (1r) res[r7]
           0x101c6 	out (r2r) res[r6], r1
           0x101c8 	out (r2r) res[r3], r11
           0x101ca 	ldw (2rus) r6, r0[0x4]
           0x101cc  	out (r2r) res[r6], r11
           0x101ce 	out (r2r) res[r6], r11
           0x101d0 	out (r2r) res[r6], r11
           0x101d2 	setc (ru6) res[r2], 0x1
           0x101d4 	in (2r) r0, res[r2]
           0x101d6 	setc (ru6) res[r5], 0x1
           0x101d8 	in (2r) r11, res[r5]
           0x101da 	setc (ru6) res[r4], 0x1
           0x101dc 	in (2r) r2, res[r4]
           0x101de 	out (r2r) res[r3], r1
           0x101e0 	shl (2rus) r0, r0, 0x8
           0x101e2 	shl (2rus) r3, r11, 0x4
           0x101e4 	add (3r) r0, r3, r0
           0x101e6 	add (3r) r0, r0, r2
           0x101e8 	out (r2r) res[r6], r1
           0x101ea 	ldw (ru6) r8, sp[0x0]
           0x101ec 	ldw (ru6) r7, sp[0x1]
           0x101ee 	ldw (ru6) r6, sp[0x2]
           0x101f0 	ldw (ru6) r5, sp[0x3]
           0x101f2 	ldw (ru6) r4, sp[0x4]
           0x101f4 	retsp (u6) 0x5
           0x101f6 	add (2rus) r0, r0, 0x0

EDIT: So the timing analyzer apparently is ignoring the waiting for the clock pulse, which is ~600ns (scoped) by itself.

I read over the data sheet again, and it looks like the only way to read it that quickly is to read old data while the clock signal is converting the new data. It looks like using this random read method, i'm not going to be able to go much faster.

I think I'll just have to have a new thread constantly reading adc data sequentially into a buffer to get 1.5 MSPS.

segher · Post by **segher** » Wed Dec 31, 2014 8:50 pm

Another way to get a nice performance boost is to operate
both ADCs in parallel. You already are doing that -- just
reading the result from only one of-em!

I'll have a look at your generated machine code.

segher · Post by **segher** » Wed Dec 31, 2014 9:00 pm

Some ways to get better generated code:

-- Use unsigned vars wherever possible;
-- Use bitmasking, not e.g. "bla > 7";
-- Use inshr/outshr, in XC: ":> >>" etc.;
-- Inline this function in a loop to decrease overhead a lot.

rp181 · Post by **rp181** » Wed Dec 31, 2014 11:54 pm

Thanks for all of the advice! I implemented most of it, and my in-application speed went from 30 kHz to 60kHz!

segher · Post by **segher** » Thu Jan 01, 2015 5:45 pm

Happy new year!

Do you maybe know which optimisation helped how much?
And/or can you show your current source code and generated
machine code?

Cheers,

Segher

rp181 · Post by **rp181** » Thu Jan 01, 2015 7:51 pm

Converting everything to unsigned gave a boost of a couple kHz (didn't measure the impact of changing the shifts). I didn't really find much information on outshr and inshr, so not sure if that's applicable.

Previously, the operation was as follows:
- Read 2 adc channels
- Compute stuff with adc values
- Output to the DAC
- Loop through channels 2k and 2k+1, k = [0,7]

The new new flow is:
- Read all 16 ADC channels (8 function calls) in an ADC thread
- Get the ADC values from the ADC thread
- Compute all DAC outputs
[tab=30]--> While this is happening, the ADC thread is reading the next batch of ADC values

The change in flow is what gave the bulk of the improvement (at a slight cost in latency), mostly due to the parallel reading. Having the ADCs read while doing other stuff also gave a nice boost. I may be able to go slightly faster, but the ADC datasheet is iffy on what happens when the RD pin is low before the CS pin is low.

ADC function to read the ADCs in parallel (combined into an int):

Code: Select all

inline unsigned int readBothADC(ADCPair &adc, Bus12b &bus, int channel) {
    unsigned short data1, data2;
    unsigned short msb, mid, lsb;

    //Write the channel to read to both ADCs
    adc.cs[0] <: 0;
    adc.cs[1] <: 0;
    adc.wr <: 0;

    bus.msb <: 0; //Always zero, so cut out some instructions
    bus.mid <: (channel << 1);
    bus.lsb <: 0; //Always zero, so cut out some instructions

    adc.wr <: 1;
    adc.cs[0] <: 1;
    adc.cs[1] <: 1;

    //Conversion
    adc.convst <: 0;
    adc.clk <: CLOCK_OUTPUT;
    sync(adc.clk);
    adc.convst <: 1;

    //Read channel
    adc.cs[0] <: 0;
    adc.rd <: 0;
    adc.rd <: 0;
    adc.rd <: 0;

    bus.msb :> msb;
    bus.mid :> mid;
    bus.lsb :> lsb;

    data1 = ((msb << 8) + (mid << 4) + lsb);

    adc.cs[0] <: 1;
    adc.rd <: 1;

    //Read channel + 8
    adc.cs[1] <: 0;
    adc.rd <: 0;
    adc.rd <: 0;
    adc.rd <: 0;

    bus.msb :> msb;
    bus.mid :> mid;
    bus.lsb :> lsb;

    data2 = ((msb << 8) + (mid << 4) + lsb);

    adc.cs[1] <: 1;
    adc.rd <: 1;

   return (data1 << 16)+data2;
}

Assembly for that function:

Code: Select all

0x10a40 readBothADC:
           0x10a40 	entsp (u6) 0x6
           0x10a42 	stw (ru6) r4, sp[0x5]
           0x10a44 	stw (ru6) r5, sp[0x4]
           0x10a46 	stw (ru6) r6, sp[0x3]
           0x10a48 	stw (ru6) r7, sp[0x2]
           0x10a4a 	stw (ru6) r8, sp[0x1]
           0x10a4c 	stw (ru6) r9, sp[0x0]
           0x10a4e 	ldw (2rus) r6, r0[0x0]
           0x10a50 	ldc (ru6) r5, 0x0
           0x10a52 	out (r2r) res[r6], r5
           0x10a54 	ldw (2rus) r3, r0[0x1]
           0x10a56 	out (r2r) res[r3], r5
           0x10a58 	ldw (2rus) r7, r0[0x3]
           0x10a5a 	out (r2r) res[r7], r5
           0x10a5c 	ldw (2rus) r4, r1[0x0]
           0x10a5e 	out (r2r) res[r4], r5
           0x10a60 	shl (2rus) r2, r2, 0x1
           0x10a62 	ldw (2rus) r11, r1[0x1]
           0x10a64 	out (r2r) res[r11], r2
           0x10a66 	ldw (2rus) r2, r1[0x2]
           0x10a68 	out (r2r) res[r2], r5
           0x10a6a 	mkmsk (rus) r1, 0x1
           0x10a6c 	out (r2r) res[r7], r1
           0x10a6e 	out (r2r) res[r6], r1
           0x10a70 	out (r2r) res[r3], r1
           0x10a72 	ldw (2rus) r7, r0[0x2]
           0x10a74 	out (r2r) res[r7], r5
           0x10a76 	ldw (2rus) r8, r0[0x5]
           0x10a78 	ldw (lru6) r9, dp[0xc]
           0x10a7c 	out (r2r) res[r8], r9
           0x10a7e 	syncr (1r) res[r8]
           0x10a80 	out (r2r) res[r7], r1
           0x10a82 	out (r2r) res[r6], r5
           0x10a84 	ldw (2rus) r7, r0[0x4]
           0x10a86 	out (r2r) res[r7], r5
           0x10a88 	out (r2r) res[r7], r5
           0x10a8a 	out (r2r) res[r7], r5
           0x10a8c 	setc (ru6) res[r4], 0x1
           0x10a8e 	in (2r) r0, res[r4]
           0x10a90 	setc (ru6) res[r11], 0x1
           0x10a92 	in (2r) r9, res[r11]
           0x10a94 	setc (ru6) res[r2], 0x1
           0x10a96 	in (2r) r8, res[r2]
           0x10a98 	out (r2r) res[r6], r1
           0x10a9a 	out (r2r) res[r7], r1
           0x10a9c 	out (r2r) res[r3], r5
           0x10a9e 	out (r2r) res[r7], r5
           0x10aa0 	out (r2r) res[r7], r5
           0x10aa2 	out (r2r) res[r7], r5
           0x10aa4 	shl (2rus) r0, r0, 0x8
           0x10aa6 	shl (2rus) r5, r9, 0x4
           0x10aa8 	add (3r) r0, r5, r0
           0x10aaa 	add (3r) r0, r0, r8
           0x10aac 	shl (2rus) r0, r0, 0x10
           0x10aae 	setc (ru6) res[r4], 0x1
           0x10ab0 	in (2r) r4, res[r4]
           0x10ab2 	setc (ru6) res[r11], 0x1
           0x10ab4 	shl (2rus) r4, r4, 0x8
           0x10ab6 	in (2r) r11, res[r11]
           0x10ab8 	shl (2rus) r11, r11, 0x4
           0x10aba 	add (3r) r11, r11, r4
           0x10abc 	setc (ru6) res[r2], 0x1
           0x10abe 	in (2r) r2, res[r2]
           0x10ac0 	add (3r) r2, r11, r2
           0x10ac2 	zext (rus) r2, 0x10
           0x10ac4 	or (3r) r0, r2, r0
           0x10ac6 	out (r2r) res[r3], r1
           0x10ac8 	out (r2r) res[r7], r1
           0x10aca 	ldw (ru6) r9, sp[0x0]
           0x10acc 	ldw (ru6) r8, sp[0x1]
           0x10ace 	ldw (ru6) r7, sp[0x2]
           0x10ad0 	ldw (ru6) r6, sp[0x3]
           0x10ad2 	ldw (ru6) r5, sp[0x4]
           0x10ad4 	ldw (ru6) r4, sp[0x5]
           0x10ad6 	retsp (u6) 0x6

Happy new years to you too, and thanks for all the help!

Optimizing ADC Read

Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read

Re: Optimizing ADC Read