I'm working on an Rx/Tx interface between an FPGA and XMOS (XEF232).
This is a custom interface, I write both FPGA and XMOS code, but it's mainly like a MII interface : a 8-bits bus, a data_valid and a clock (for one direction, same for the other).
On my evaluation board, I have access only to 4-bit bus due to wiring between devices, but that's not the deal here. This is just to set the hardware context.
Let's focus on the Rx part (i.e. from FPGA to XMOS), as Tx seems to work pretty well.
The idea is to have a task that continuously waits for input data with strobing (configure_in_port_strobed_slave) in a select. When data_valid goes IDLE, end of transmission, I then call a swap_buf interface function.
Here is the concept code (the busy pin is for flow control, just ignore it):
void Eth_IF_Receiver (client interface ETH_bufswap i_eth, uint32_t rx_buffer[], in buffered port:32 p_data, clock clk, in port p_ready_in, out port p_busy_out ) { { uint32_t rcv_ptr = 0; uint32_t * l_rx = &rx_buffer[0]; configure_in_port_strobed_slave(p_data, p_ready_in, clk); start_clock(clk); p_busy_out <: 0; while (1) { select { case p_data :> uint32_t data : l_rx[rcv_ptr] = data ; if (rcv_ptr < (ETH_BUF_SIZE-1)) rcv_ptr++; break ; case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state if (rcv_ptr != 0) { p_busy_out <: 1; i_eth.swap(l_rx, rcv_ptr); rcv_ptr = 0; p_busy_out <: 0; } break; } } } }It is obvious that the clock rate from the FPGA should be slow enough to not loose a word in the "case p_data :> uint32_t data :"
And here is the deal ! This part is really time critical, and thus depends on the generate assembly code. The more instruction generated for this case statement, the slower I should run the FPGA.
With such code, I understand that Xtime compiler makes a lot of control on "l_rx[rcv_ptr] = data" to avoid any out of bound write in the memory. It cannot be done at compilation time, so it is done at runtime, inserting many assemby instructions. This is fair, but not optimized for performance :)
First idea was to declare the buffer pointer as "unsafe". This worked very well, as now assembly code was really minimalist, ensuring a high clock rate.
But as I need to transfer the buffer pointer to another task with the move function, I need to declare the pointer as movable : uint32_t * movable l_rx = &rx_buffer[0];
Doing so, the "l_rx[rcv_ptr] = data" instruction is now again assembled with way to many instructions !
The only way I found to bypass this issue was to use a direct assembly code instead of C code :
asm("stw %0, %1[%2]" :: "r"(data), "r"(l_rx), "r"(rcv_ptr)); // <-> l_rx[rcv_ptr] = data ;
To check validity of my reception, I added printf before the swap task.
Here is the code :
void Eth_IF_Receiver (client interface ETH_bufswap i_eth, uint32_t rx_buffer[], in buffered port:32 p_data, clock clk, in port p_ready_in, out port p_busy_out ) { uint32_t rcv_ptr = 0; uint32_t * movable l_rx = &rx_buffer[0]; configure_in_port_strobed_slave(p_data, p_ready_in, clk); start_clock(clk); p_busy_out <: 0; while (1) { select { case p_data :> uint32_t data : asm("stw %0, %1[%2]" :: "r"(data), "r"(l_rx), "r"(rcv_ptr)); if (rcv_ptr < (ETH_BUF_SIZE-1)) rcv_ptr++; break ; case p_ready_in when pinseq(0x0) :> void: // Detect the IDLE state if (rcv_ptr != 0) { p_busy_out <: 1; for ( unsigned idx = 0 ; idx < rcv_ptr ; idx ++){ printf(.......); } i_eth.swap(l_rx, rcv_ptr); rcv_ptr = 0; p_busy_out <: 0; } break; } } } }
Now here is the funny thing and the real issue !!
With this exact code, everything is fine, works at a high clock rate adjusted regarding number of assembly lines generated in the reception case statement.
But if I remove the printf() FOR loop, or even if I declare the "idx" variable OUTSIDE the for loop, the assembly generated code gets completely weird !!! The select statement is replicated 4 times (but with slight variations), and ruins my clock rate performance, as now the reception loop depends on too many assembly instructions to follow the rythm.
For my system to work at a proper clock speed, I need to leave an empty "for" loop with the index variable declared inside the loop. Any other way (no loop, as I do not need to printf or do anything here, or a loop with an index variable declared outside of it) leads to a complicated assembly code that ruins performance.
Any one can have an idea on what happens here ??
Many thanks for highlight and help on this topic.
Cheers,