Strange assembly generated code depending on C/xC context

azs · Post by **azs** » Wed Feb 28, 2024 10:46 am

Hi there !

I'm working on an Rx/Tx interface between an FPGA and XMOS (XEF232).
This is a custom interface, I write both FPGA and XMOS code, but it's mainly like a MII interface : a 8-bits bus, a data_valid and a clock (for one direction, same for the other).
On my evaluation board, I have access only to 4-bit bus due to wiring between devices, but that's not the deal here. This is just to set the hardware context.

Let's focus on the Rx part (i.e. from FPGA to XMOS), as Tx seems to work pretty well.
The idea is to have a task that continuously waits for input data with strobing (configure_in_port_strobed_slave) in a select. When data_valid goes IDLE, end of transmission, I then call a swap_buf interface function.

Here is the concept code (the busy pin is for flow control, just ignore it):

void Eth_IF_Receiver (client interface ETH_bufswap i_eth,
        uint32_t  rx_buffer[],
        in buffered port:32 p_data,
        clock clk,
        in port p_ready_in,
        out port p_busy_out )
{
       {
        uint32_t rcv_ptr = 0;
        uint32_t * l_rx = &rx_buffer[0];

        configure_in_port_strobed_slave(p_data, p_ready_in, clk);
        start_clock(clk);
        p_busy_out <: 0;

        while (1) {
            select {

            case p_data :> uint32_t data :
                l_rx[rcv_ptr] = data ;
                if (rcv_ptr < (ETH_BUF_SIZE-1))
                    rcv_ptr++;
                break ;

            case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state
                if (rcv_ptr != 0)
                {
                    p_busy_out <: 1;
                    i_eth.swap(l_rx, rcv_ptr);
                    rcv_ptr = 0;
                    p_busy_out <: 0;
                }
                break;
            }
        }
    }
}

It is obvious that the clock rate from the FPGA should be slow enough to not loose a word in the "case p_data :> uint32_t data :"
And here is the deal ! This part is really time critical, and thus depends on the generate assembly code. The more instruction generated for this case statement, the slower I should run the FPGA.

With such code, I understand that Xtime compiler makes a lot of control on "l_rx[rcv_ptr] = data" to avoid any out of bound write in the memory. It cannot be done at compilation time, so it is done at runtime, inserting many assemby instructions. This is fair, but not optimized for performance :)
First idea was to declare the buffer pointer as "unsafe". This worked very well, as now assembly code was really minimalist, ensuring a high clock rate.

But as I need to transfer the buffer pointer to another task with the move function, I need to declare the pointer as movable : uint32_t * movable l_rx = &rx_buffer[0];
Doing so, the "l_rx[rcv_ptr] = data" instruction is now again assembled with way to many instructions !
The only way I found to bypass this issue was to use a direct assembly code instead of C code :

asm("stw %0, %1[%2]" :: "r"(data), "r"(l_rx), "r"(rcv_ptr)); // <-> l_rx[rcv_ptr] = data ;

To check validity of my reception, I added printf before the swap task.
Here is the code :

void Eth_IF_Receiver (client interface ETH_bufswap i_eth,
        uint32_t  rx_buffer[],
        in buffered port:32 p_data,
        clock clk,
        in port p_ready_in,
        out port p_busy_out )
{
        uint32_t rcv_ptr = 0;
        uint32_t * movable  l_rx = &rx_buffer[0];


        configure_in_port_strobed_slave(p_data, p_ready_in, clk);
        start_clock(clk);
        p_busy_out <: 0;

        while (1) {
            select {

            case p_data :> uint32_t data :
                asm("stw %0, %1[%2]" :: "r"(data), "r"(l_rx), "r"(rcv_ptr));
                if (rcv_ptr < (ETH_BUF_SIZE-1))
                    rcv_ptr++;
                break ;

            case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state
                if (rcv_ptr != 0)
                {
                    p_busy_out <: 1;
                    for ( unsigned idx = 0 ; idx < rcv_ptr ; idx ++){
                        printf(.......);
                    }
                    i_eth.swap(l_rx, rcv_ptr);
                    rcv_ptr = 0;
                    p_busy_out <: 0;
                }
                break;
            }
        }
    }
}

Now here is the funny thing and the real issue !!
With this exact code, everything is fine, works at a high clock rate adjusted regarding number of assembly lines generated in the reception case statement.
But if I remove the printf() FOR loop, or even if I declare the "idx" variable OUTSIDE the for loop, the assembly generated code gets completely weird !!! The select statement is replicated 4 times (but with slight variations), and ruins my clock rate performance, as now the reception loop depends on too many assembly instructions to follow the rythm.

For my system to work at a proper clock speed, I need to leave an empty "for" loop with the index variable declared inside the loop. Any other way (no loop, as I do not need to printf or do anything here, or a loop with an index variable declared outside of it) leads to a complicated assembly code that ruins performance.

Any one can have an idea on what happens here ??
Many thanks for highlight and help on this topic.

Cheers,

fabriceo · Post by **fabriceo** » Wed Feb 28, 2024 6:01 pm

Hello!
when you remove a printf and the code becomes "hectic", from my humble experience it means the optimization level as something to do with your problem :)
printing something will tell the compiler that something is used and then it will not remove some piece of code. Also due to the fact that you print something, the compiler need to effectively create the code for computing something at a certain point in time and this might completely change the optimization behavior...

typically, with your asm setw instruction as it is written, the compiler believes you do nothing in memory ! try to add a 3rd semicolon : "memory" before the ")" and see what this change.

also you can replace the asm setw by a

Code: Select all

volatile unsigned * unsafe ptr = &rx_buffer[0];
and then use instead 
unsafe { ptr[rcv_ptr] = data; }

The compiler will appreciate that and lets see what happens.

also don't hesitate to use the #pragma unsafe arrays, if you know what you do, just before your function/task then the compiler will not generate the bound checks.

hope this helps

azs · Post by **azs** » Thu Feb 29, 2024 10:52 am

Thanks fabriceo for this feedback.
Sure there are some interesting things in your reply.

But I will clarify my issue, as I understand that this is not easy to understand :D
The change in assembly code is not directly related to the printf. Of course, with our without printf, different code is generated, but that's not the issue. I will try to clarify.

The big deal is here :

Case #1 :

case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state
                if (rcv_ptr != 0)
                {
                    p_busy_out <: 1;
                    for (unsigned idx = 0 ; idx < rcv_ptr ; idx++)
                    {
                       ;
                    }
                    i_eth.swap(l_rx, rcv_ptr);
                    rcv_ptr = 0;
                    p_busy_out <: 0;
                }
                break;

Case #2 :

case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state
                if (rcv_ptr != 0)
                {
                    p_busy_out <: 1;
                    for (idx = 0 ; idx < rcv_ptr ; idx++)
                    {
                       ;
                    }
                    i_eth.swap(l_rx, rcv_ptr);
                    rcv_ptr = 0;
                    p_busy_out <: 0;
                }
                break;

In the 2 cases, the for loop is empty and does nothing.
The only difference is that in case #1 the inde is declared "inside" the loop, and in case #2, it is declared outside, at the begining of the function.

Case #1 generates a "short" assembly code, the "select" code is generated only one time.
Case #2 generates a "long" assembly code, where the "select" code is replicated 4 times !

However, I will try the "memory" tip in inline asm code (I have read about this but was not sure how to use it).
I will also retry the unsafe tip, but as my pointer needs to be "movable", I'm not sure it will work ?
I tried to declare 2 pointers on the same buffer few days ago, one unsafe and one movable. The assembly code was okay (no bound control code) but the "move" function seems to not work properly, as my ping-pong buffer seems to not swap.

Thanks,

azs · Post by **azs** » Thu Feb 29, 2024 11:18 am

To add a quick feedback, I confirm that working in xC with unsafe pointer, instead of inline assembly, does not work, as I assume it's in conflict with the movable attribute. My buffer swap function does not work in this case.
Adding :"memory" in the inline assembly does not change the global behaviour.
Assembly generated code around this particular line is still okay (same as before), but same issue for the "global" generated code : still depends of the stupid empty "for" loop and its declaration.

Will call Mulder and Scully......

Strange assembly generated code depending on C/xC context

Strange assembly generated code depending on C/xC context

Re: Strange assembly generated code depending on C/xC context

Re: Strange assembly generated code depending on C/xC context

Re: Strange assembly generated code depending on C/xC context