RGMII lib_ethernet crashes (XCore200)
Posted: Wed May 03, 2017 9:21 am
Hi all,
I'm currently debugging a very strange problem and now think I'm finally at a point where I need your help:
I have a small board featuring an XMOS XCore200 (XE232-1024), a Gigabit Ethernet PHY connected to the RGMII interface at Tile 3 and another 100 Mbit PHY connected via MII to Tile 2. There are some other distributable tasks running at core 3 which operate some IOs via lib_gpio which are bound to Tile 3. The software (mainly also running on Tile 2) in principle forwards Ethernet frames from one interface to another (L2 bridge), which is working. So far so good.
Sporadically, espacially at "not too low" rates (~ 20 Mbit full duplex) the CPU crashes. I thought it might be some race condition or something (happens mainly if the rates per direction are slightly different) when two packets in RX and TX at one of the interfaces have a certain, unlucky time correlation. Trying to trace down the problem using the XTAG debugger, I was quite shocked to see, that the crashes are caused by "random" ET_LOAD_STORE, ET_ILLEGAL_RESOURCE or even ET_ILLEGAL_PC CPU exceptions within the RGMII TX thread (Tile 3, Core 0). It looks like the code excecution is interrupted and afterwards continued with modified registers. I was lucky to find a very good example for that (see screenshot attached):
The ET_LOAD_STORE (unaligned memory access) happens at rgmii_buffering.xc:671 at the instruction
ldw r3, r2[0x00]
Indeed, r2 = 523403 is not aligned. Let's go back by a few instructions. The history leading to that point is
[1] shl r2, r1, 0x4 // r2 = r1 * 16
[2] add r2, r8, r2 // r2 += r8
[3] ldw r3, r2[0x00] // r3 = *r2
The relevant registers:
r1 = 0
r8 = 523400
With these register values, there is no chance to get r2 = 523403. I therefore suspect, that the code execution was interrupted between instructions [1] and [2], leaving r2 with a value of 3 before continuing execution at instruction [2].
Of course, the exception does not always happen at this exact point (but somewhere around it) but in this example it's very clear to see how it happens. At other points, the relevant registers have a very long history and it's therefore not that easy to trace it back.
As there are no interrupts active at that core (the only interrupts I could find are the RX_DONE and RX_ERROR interrupts within the RGMII RX LLD running on a different core on the same tile) I don't understand how this can happen. I already tried to disable XScope and debug messages and everything I could imagine. In the end I think, that somehow the RX and TX threads might sill interfer.
Does anybody have any idea how I can find out what exactly is causing these problems?
EDIT:
I've removed all the GPIO tasks running at Tile 3, such that the only tasks there are now
rgmii_ethernet_mac
rgmii_ethernet_mac_config
Still, the problem persists.
Thank you very much and have a nice day!
Max
I'm currently debugging a very strange problem and now think I'm finally at a point where I need your help:
I have a small board featuring an XMOS XCore200 (XE232-1024), a Gigabit Ethernet PHY connected to the RGMII interface at Tile 3 and another 100 Mbit PHY connected via MII to Tile 2. There are some other distributable tasks running at core 3 which operate some IOs via lib_gpio which are bound to Tile 3. The software (mainly also running on Tile 2) in principle forwards Ethernet frames from one interface to another (L2 bridge), which is working. So far so good.
Sporadically, espacially at "not too low" rates (~ 20 Mbit full duplex) the CPU crashes. I thought it might be some race condition or something (happens mainly if the rates per direction are slightly different) when two packets in RX and TX at one of the interfaces have a certain, unlucky time correlation. Trying to trace down the problem using the XTAG debugger, I was quite shocked to see, that the crashes are caused by "random" ET_LOAD_STORE, ET_ILLEGAL_RESOURCE or even ET_ILLEGAL_PC CPU exceptions within the RGMII TX thread (Tile 3, Core 0). It looks like the code excecution is interrupted and afterwards continued with modified registers. I was lucky to find a very good example for that (see screenshot attached):
The ET_LOAD_STORE (unaligned memory access) happens at rgmii_buffering.xc:671 at the instruction
ldw r3, r2[0x00]
Indeed, r2 = 523403 is not aligned. Let's go back by a few instructions. The history leading to that point is
[1] shl r2, r1, 0x4 // r2 = r1 * 16
[2] add r2, r8, r2 // r2 += r8
[3] ldw r3, r2[0x00] // r3 = *r2
The relevant registers:
r1 = 0
r8 = 523400
With these register values, there is no chance to get r2 = 523403. I therefore suspect, that the code execution was interrupted between instructions [1] and [2], leaving r2 with a value of 3 before continuing execution at instruction [2].
Of course, the exception does not always happen at this exact point (but somewhere around it) but in this example it's very clear to see how it happens. At other points, the relevant registers have a very long history and it's therefore not that easy to trace it back.
As there are no interrupts active at that core (the only interrupts I could find are the RX_DONE and RX_ERROR interrupts within the RGMII RX LLD running on a different core on the same tile) I don't understand how this can happen. I already tried to disable XScope and debug messages and everything I could imagine. In the end I think, that somehow the RX and TX threads might sill interfer.
Does anybody have any idea how I can find out what exactly is causing these problems?
EDIT:
I've removed all the GPIO tasks running at Tile 3, such that the only tasks there are now
rgmii_ethernet_mac
rgmii_ethernet_mac_config
Still, the problem persists.
Thank you very much and have a nice day!
Max