Overhead of select and interface notifications.

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
CousinItt
Respected Member
Posts: 360
Joined: Wed May 31, 2017 6:55 pm

Post by CousinItt »

Well, that's more information. Without the optimisation your code will run slower, so it may be better to keep the -O2 and ensure minimal side-effects sufficient to prevent code being optimised out. Running at 10 MHz with a four-bit port is likely to be risky anyway. I vaguely remember it would take a significant chunk of a microsecond for the ready notification and a single word transfer. It should be easy to confirm that with a separate test project that transfers dummy date using the same ready/get arrangement, with test pins to show the timing, and you could then estimate the effect of different optimisation levels.

For comparison, I wrote an alternative I2S implementation for four-bit ports. The bit clock runs at about 8.2 MHz and it can manage this transfer (a single movable pointer) using an interface, but it does not use ready notification. The client task just blocks on the get function until the receiver has more data. You might want to consider doing something similar. Alternatively there's always streaming channels.

Using a single-bit port instead, there should be no trouble in fitting it all into a 3.2 us cycle.


User avatar
RedDave
Experienced Member
Posts: 77
Joined: Fri Oct 05, 2018 4:26 pm

Post by RedDave »

I take back my conclusion about optimisation...

I decided to start pretty much from scratch.

The code below does nothing but raising a line while the port is being processed and copies one of the bits of the four bit port onto an output.
This works and I can see the pulses and copied trace on the scope.

Uncommenting out the get_data() case cause it to only get every other port read.

You can see from the width of the pulse that the code takes less than a third of the available processing time, why does adding the case even affect it?
Without case
Image
With case
Image


Code: Select all

on tile[0] : in buffered port:4 p_comms0 = XS1_PORT_4C;
#define FRAME1  (comms0 & 0x01)
#define SD03  (comms0 & 0x04)

on tile[0] : out port p_debug = XS1_PORT_1E;
on tile[0] : out port p_debug2 = XS1_PORT_1F;

void tdc_loop(server tdc_if i_tdc)
{
    int comms0;
    int sd03;
    unsigned int data;

    while(TRUE)
    {
        select
        {
        case p_comms0 :> comms0:
            p_debug2 <: 1;
            sd03 = (SD03 ? 1 : 0);
            p_debug <: sd03;
            p_debug2 <: 0;
            break;
 /*       case i_tdc.get_data() -> int x:
            x = data;
            break;
 */       }
    }
}
Attachments
scope_4.png
(20 KiB) Not downloaded yet
scope_4.png
(20 KiB) Not downloaded yet
scope_5.png
(19.92 KiB) Not downloaded yet
scope_5.png
(19.92 KiB) Not downloaded yet
User avatar
CousinItt
Respected Member
Posts: 360
Joined: Wed May 31, 2017 6:55 pm

Post by CousinItt »

OK, so we're now on 50 ns per div. Your top trace shows the cycle time, which is increasing when you add the get_data case because it can (and will) happen pretty much any time. You don't need to include the port read in the select statement - this is allowing the processor to deal with whatever event happens first. To force things into the right order you can do this:

Code: Select all

   while(TRUE)
   {
      p_comms0 :> comms0:
      p_debug2 <: 1;
      sd03 = (SD03 ? 1 : 0);
      p_debug <: sd03;
      p_debug2 <: 0;

      select
      {
         case i_tdc.get_data() -> int x:
            x = data;
            break;
      }
   }
The input will block until data is available, and the sequence enforces one transfer per port read.
User avatar
RedDave
Experienced Member
Posts: 77
Joined: Fri Oct 05, 2018 4:26 pm

Post by RedDave »

That's not the logic needed though. This is waiting for get_data() to be called after every bit read. If I wanted that it would be better to push the data from a client rather than notifying from a server.

Logic is...

Read incoming data.
When FRAME becomes low, then SD03 contains the most significant bit of a value. The next 31 clocks will clock in the rest of the value.
At this point, I copy out the value and call data_ready().
Wait for FRAME to go low again.

Client task has until the end of the next 32 bit value to get_data. Waiting for the get_data to have been called is likely to result in missing the first bit(s) of the next value. Especially in this initial test pattern case, where data is coming in continually.

A solution without a select may help if I use volatile unsafe memory to get the data into another task.
User avatar
mon2
XCore Legend
Posts: 1913
Joined: Thu Jun 10, 2010 11:43 am
Contact:

Post by mon2 »

Hi. Could you use:

Single Bit buffered clocked ports with a 32 bit depth?

Then mate the clock for such port(s) to your LCLKIN. Now, the respective buffered port will autonomously read in your 32 bits of data and you can dissect as required after the capture.
User avatar
RedDave
Experienced Member
Posts: 77
Joined: Fri Oct 05, 2018 4:26 pm

Post by RedDave »

Single bit ports would be easiest. I need to run three (or four) of these channels and am controlling various other hardware. All of which work best with single bit ports. These are, therefore, in short supply.

So, I'm desperately trying to make this work with a multibit port.
User avatar
RedDave
Experienced Member
Posts: 77
Joined: Fri Oct 05, 2018 4:26 pm

Post by RedDave »

I have something working...

I'm using volatile shared memory to get the data out.
It is all very tight. With a for loop rather than the eight "copies" of the bit call, it doesn't run quick enough. The hard coded lines mean that the bit shifts etc. are done at compile time and there is not need for the loop checking.

Thanks for your help.

If anyone spots any further efficiency improvements. Ideally I'd be running two of these channels from each four bit port, but that is unlikely to happen.

Having made this work, I am now going to see whether I can find enough spare one bit ports to do it the best way.

Code: Select all

// -------------------------------------------
#define READ_PORT(val)   p_comms0 :> val
#define REPORT_DATA(val)   unsafe {*p_comms = val;}
// -------------------------------------------

#define FRAME_ACTIVE_HIGH   (0)

#define FRAME_BIT(N)    (0x01 << (N*4))

#if FRAME_ACTIVE_HIGH
#define FRAME_NONE  (0)
#define FRAME(val, N)   (comms & FRAME_BIT(N))
#else
#define FRAME_NONE  FRAME_BIT_MOST_RECENT_4
#define FRAME(val, N)   (!(val & FRAME_BIT(N)))
#endif

#define SD03_BIT(N)     (0x04 << (N*4))
#define SD03(val, N)    (val & SD03_BIT(N))

#define CHECK_NYBBLE(port_read, n)\
        if (bit_pos == -1)                                      \
        {                                                       \
            if (FRAME(port_read, n))                            \
            {                                                   \
                REPORT_DATA(data);                              \
                bit_pos = 30;                                   \
                data = SD03(port_read, n) ? (0x1 << 31) : 0;    \
            }                                                   \
        }                                                       \
        else                                                    \
        {                                                       \
            data |= SD03(port_read, n) ? (0x1 << bit_pos) : 0;  \
            bit_pos--;                                          \
        }

void tdc_loop()
{
    volatile int * unsafe p_comms;
    unsafe {
        p_comms = &g_comms;
    }

    int port_read;
    int data = 1;
    int bit_pos = -1;

    // Read until most recent is not frame
    do
    {
        READ_PORT(port_read);
        unsafe {*p_comms = port_read;}
    } while (FRAME(port_read, 7));

    while(TRUE)
    {
        READ_PORT(port_read);

        p_debug2 <: 1;
        CHECK_NYBBLE(port_read, 0)
        CHECK_NYBBLE(port_read, 1)
        CHECK_NYBBLE(port_read, 2)
        CHECK_NYBBLE(port_read, 3)
        CHECK_NYBBLE(port_read, 4)
        CHECK_NYBBLE(port_read, 5)
        CHECK_NYBBLE(port_read, 6)
        CHECK_NYBBLE(port_read, 7)
        p_debug2 <: 0;
    }
}
User avatar
CousinItt
Respected Member
Posts: 360
Joined: Wed May 31, 2017 6:55 pm

Post by CousinItt »

OK, if you really have to use a four bit port and are receiving 32 bits on all of the pins, you should increase the buffer size to 32 bits simply to give yourself maximum time for processing. You can use a conditional trigger* on your frame signal to start reading the port - see section 2.5 in the XS1 ports document . You may need to flush your buffer before starting and if you ever lose sync. Each sample will contain 4 bits, so you will need to read the port four times to get all four blocks of eight samples, which you can hand over to your client over the interface, e.g. using a movable pointer. This minimises the overhead in your receiver task in case it gets another frame immediately.

Your client can then separate the interleaved samples following the method shown in AN10129.

HTH

[EDIT:] * Assuming no other signals change before the frame signal does.
User avatar
RedDave
Experienced Member
Posts: 77
Joined: Fri Oct 05, 2018 4:26 pm

Post by RedDave »

Sorry, I only included the functional snippet of code. This is the code for 32 bit buffer. Making it work without that would have not been in any way feasible!

Code: Select all

on tile[0] : in buffered port:32 p_comms0 = XS1_PORT_4C;
Post Reply