XMOS Port Max Sampling Rate

Technical discussions around xCORE processors (e.g. xcore-200 & xcore.ai).
Jcvc
Member++
Posts: 18
Joined: Wed May 07, 2025 11:13 pm

XMOS Port Max Sampling Rate

Post by Jcvc »

Hi,

I'm developing an application using XU-316 where I want to sample data from 1 up to 4 pins at as high sample rate as possible (ideally at least 100MHz). The data sampled from the pins is then to be stored into a buffer (or buffers, depending on processing time) for on-chip processing.

To do the above, I've started to evaluate the sampling rate that I could achieve from a single input port. The challenge I'm facing is the read from the input port and incrementation of the buffer position compromises quite a bit (much more than I expected) how fast I can sample from the input port.

A snapshot of my sample code is below:

Code: Select all

    ...
    port_enable(p_sampler);
    clock_enable(samplerClk);
    clock_set_source_clk_ref(samplerClk); //I'm currently using this one because there's no point at the moment even using the 600MHz (but tried and it's the same)
    //clock_set_divide(samplerClk, 0);   
    port_set_clock(p_sampler, samplerClk);
    clock_start(samplerClk);
    
    timeStampStart = get_reference_time();
    while(buffer_pos < BUFFER_SIZE)
    {
    	buffer0[buffer_pos++] = port_in(p_sampler);
    }
    timeStampFinish = get_reference_time();
    ...
When I check the number of ticks (10ns each) that it took to get to outside the loop and then calculate its frequency, I can't really go beyond 6MHz. The reading + incrementing seems to be taking ~150ns (which seems quite high to me).
Any thoughts on what could I do in order to improve the sampling rate by a few folds? What is the expected maximum sampling rate from an input port?
User avatar
xhuw
Verified
Active Member
Posts: 57
Joined: Wed May 22, 2024 2:36 pm

Post by xhuw »

you probably want to look into enabling the serdes in the port. Have a look at https://www.xmos.com/documentation/XM-0 ... t_buffered and the corresponding parts of the architecture manual. You can configure the port to read into the serdes based on its clock and then the processor needs to service (port_in) the port much less frequently, which will maximise your read rates
XMOS Software Engineer

Image
Jcvc
Member++
Posts: 18
Joined: Wed May 07, 2025 11:13 pm

Post by Jcvc »

Thanks for replying so quickly :)
Yup, I had tried it... but unfortunately I'm still bottle necked by the single read operation that will have to execute whether I use 1, 4, 8 or 32 bit serial port. And this single read + incrementing still takes ~150ns, which means that every time that I need to read from the port, it will take ~150ns, which means that I'll lose those samples corresponding to the ~150ns.
Joe
Verified
Experienced Member
Posts: 118
Joined: Sun Dec 13, 2009 1:12 am

Post by Joe »

You would need to increase the transfer width of the port to maximise performance. Just declaring a port as buffered isn't enough - the transfer width defaults to the width of the port.

The transfer width is the width of the data you will IN/OUT from the port.

Setting the transfer width to 32 bits will maximise performance as you are transferring bigger chunks of serialised data so need to do so less often. For a 1-bit port running at 100MHz with a 32 bit transfer size you would only need to do INs at a rate of 100/32 = 3.125MHz.

This is the key to the buffered mode of the ports in that they offload all the fast data transferring from the processor allowing it to do other things while the port is busy moving data around.

For a 4-bit port, the data rate is 4x higher so the INs will have to be at 12.5MHz for this example. Still perfectly possible.

I've sampled data from a 1-bit port at over 200MHz for some testing. The only limitation is how fast you can run your IN, STORE, IN, STORE loop.

You should be fine at the speeds you're talking about but to get close to absolute maximum performance you might need to unroll the loops (#pragma unroll) or at least check the disassembly and/or xsim to check the timing.

For your case you'd use e.g. port_start_buffered(p_sampler, 32); to set the transfer width to 32.

A final note is because the port has two registers (the 32 bit shift register transfers to the 32 bit transfer register when full), you don't always have to do an IN at perfectly regular intervals. You can be a bit late every so often with the INs as long as you do two INs within two cycles you won't lose data. i.e after an IN, the transfer register is empty and you wait for the shift reg to fill (1 cycle) and this is then transferred to the transfer register and won't be overwritten for another clock cycle until the shift register has filled again. This gives you a bit of slack to have extra instructions for loop counts etc.

Joe
XMOS hardware grey beard.
Jcvc
Member++
Posts: 18
Joined: Wed May 07, 2025 11:13 pm

Post by Jcvc »

Joe, thank you very much for your detailed answer.
I've sampled data from a 1-bit port at over 200MHz for some testing. The only limitation is how fast you can run your IN, STORE, IN, STORE loop.
This is my problem at the moment(unless I have at the moment some other problem that I'm not picking up). My reading is being done as simple as 'buffer[buffer_pos++] = port_in(p_sampler);', but during the reading process, I'm losing samples.
When the processor is reading from the buffer on the input port, I had thought/assumed that it would not block the input port to continue to capture samples. However, as I do the reading, port_in(), I lose the samples during the time period that the processor takes to execute the port_in(), which seems to be quite high.
You can be a bit late every so often with the INs as long as you do two INs within two cycles you won't lose data
Does it mean that a port_in() action should effectively only take 1 clock cycle? That would be 1.66(7)ns, but it's not quite what I'm observing.

I'm on the XU-316, which 'limits' me to a 600MHz internal clock and therefore I can only clock the input port up to 300MHz. In order to prevent aliasing (Nyquist theorem), means that I can only be sampling from the input port signals that have maximum frequency of 150MHz.
Joe
Verified
Experienced Member
Posts: 118
Joined: Sun Dec 13, 2009 1:12 am

Post by Joe »

The IN instruction does run in a single cycle. The IN instruction blocks until the transfer register is full (meanwhile the shift register is filling with data).

If you are losing samples then it's because the INs are happening too slowly, essentially if you don't do an IN to read the transfer register before the shift register fills again that data is lost.

XU316 does come in an 800MHz version also.

Remember the thread(core) speed is different to the processor speed. If you are using a 600MHz part and using 5 threads or less then the thread speed is 600/5 = 120MHz. If using more than 5 threads, the thread speed is 600/thread count so down to 75MHz with 8 threads.

The thread speed is the rate at which instructions will run.

You will have no problem sampling at 100MHz (clocking the port at 100MHz) with a buffered 1-bit port with 32 bit transfer size.

What are you using as the stimulus for your 1-bit port?
XMOS hardware grey beard.
Joe
Verified
Experienced Member
Posts: 118
Joined: Sun Dec 13, 2009 1:12 am

Post by Joe »

I would check the disassembly with xobjdump -D <your_binary.xe> and see what your loop looks like. Sometimes the compiler can add in array bounds checking which would compromise the timing.
XMOS hardware grey beard.
Jcvc
Member++
Posts: 18
Joined: Wed May 07, 2025 11:13 pm

Post by Jcvc »

The IN instruction does run in a single cycle. The IN instruction blocks until the transfer register is full (meanwhile the shift register is filling with data).
Thank you, this helps to understand port read process!
XU316 does come in an 800MHz version also.
Actually, my bad, I'm using the EVK for this test and it's using the C32, therefore the 800MHz. I was using the MC-316 this morning which is 600MHz and was thinking the EVK was the same, my apologies! But yeah, this means I can push a bit further.
Remember the thread(core) speed is different to the processor speed. If you are using a 600MHz part and using 5 threads or less then the thread speed is 600/5 = 120MHz. If using more than 5 threads, the thread speed is 600/thread count so down to 75MHz with 8 threads.

The thread speed is the rate at which instructions will run.
I'm still running on a single thread. For proof of concept purposes (and to test limits), I'm just running this from the main function, without any parallel statements.
What are you using as the stimulus for your 1-bit port?
I'm using the xCORE clock, configured as below (clocking at 400MHz and not 300MHz as I have previously mentioned in one of my previous comments):

Code: Select all

clock_set_source_clk_xcore(samplerClk);
    clock_set_divide(samplerClk, 1);
That should indeed allow me to get up to ~200MHz of sampling.

Now, a few improvements based on your suggestion on the previous comment: Adding the '#pragma unroll' has indeed helped to improve the sampling rate to which I can sample the signal. I currently can successfully sample a signal up to ~50MHz without losing bits between port ins. I need to test at slightly higher frequencies, but if I go straight up to 100MHz, then I see data bits falling behind.

I would check the disassembly with xobjdump -D <your_binary.xe> and see what your loop looks like. Sometimes the compiler can add in array bounds checking which would compromise the timing.
Thanks for the suggestion. I'll look into this more in detail throughout the afternoon, but from the first look, the copy operation itself is taking indeed 1 cycle, but the port_in is taking 6 clock cycles and then some padding happening?:

Code: Select all

<port_in>:
             0x0008039c: ff 17:       nop (0r)
             0x0008039e: 80 7f:       dualentsp (u6)  0x0
             0x000803a0: c0 b6:       in (2r)         r0, res[r0]
             0x000803a2: ff 17:       nop (0r)
             0x000803a4: ff 17:       nop (0r)
             0x000803a6: c0 77:       retsp (u6)      0x0
The 6 cycles would currently explain why I can sample a signal of 50MHz but not of 100MHz. 6cycles @ 800MHz clock means that the maximum sample rate would be 133.3(3)MHz


I'll proceed with the debugging and once again (and can't thank you enough), thank you Joe!
Joe
Verified
Experienced Member
Posts: 118
Joined: Sun Dec 13, 2009 1:12 am

Post by Joe »

One misconception I can correct here quickly: The thread speed won't go above core clock/5 even if you only have one thread running.

The disassembly looks pretty inefficient. At the highest speed it should just be an unrolled loop of IN and store to memory instructions with the address in memory hardcoded.
XMOS hardware grey beard.
Jcvc
Member++
Posts: 18
Joined: Wed May 07, 2025 11:13 pm

Post by Jcvc »

One misconception I can correct here quickly: The thread speed won't go above core clock/5 even if you only have one thread running.
Ohh ok, didn't know, thank you Joe for the information :) Fortunately didn't need it before xD
I had the idea that it could be at least 300/400MHz (depending on the chip clock) because of the first paragraph here: https://www.xmos.com/documentation/XM-0 ... lock-rates.
The disassembly looks pretty inefficient. At the highest speed it should just be an unrolled loop of IN and store to memory instructions with the address in memory hardcoded.
Yeah, I tried changing it to a for loop instead, but no difference. I'll see then if I can improve it.