RGMII lib_ethernet crashes (XCore200)

maxwinkel · Post by **maxwinkel** » Wed May 03, 2017 9:21 am

Hi all,

I'm currently debugging a very strange problem and now think I'm finally at a point where I need your help:

I have a small board featuring an XMOS XCore200 (XE232-1024), a Gigabit Ethernet PHY connected to the RGMII interface at Tile 3 and another 100 Mbit PHY connected via MII to Tile 2. There are some other distributable tasks running at core 3 which operate some IOs via lib_gpio which are bound to Tile 3. The software (mainly also running on Tile 2) in principle forwards Ethernet frames from one interface to another (L2 bridge), which is working. So far so good.

Sporadically, espacially at "not too low" rates (~ 20 Mbit full duplex) the CPU crashes. I thought it might be some race condition or something (happens mainly if the rates per direction are slightly different) when two packets in RX and TX at one of the interfaces have a certain, unlucky time correlation. Trying to trace down the problem using the XTAG debugger, I was quite shocked to see, that the crashes are caused by "random" ET_LOAD_STORE, ET_ILLEGAL_RESOURCE or even ET_ILLEGAL_PC CPU exceptions within the RGMII TX thread (Tile 3, Core 0). It looks like the code excecution is interrupted and afterwards continued with modified registers. I was lucky to find a very good example for that (see screenshot attached):

ET_LOAD_STORE_1.PNG

The ET_LOAD_STORE (unaligned memory access) happens at rgmii_buffering.xc:671 at the instruction

ldw r3, r2[0x00]

Indeed, r2 = 523403 is not aligned. Let's go back by a few instructions. The history leading to that point is

[1] shl r2, r1, 0x4 // r2 = r1 * 16
[2] add r2, r8, r2 // r2 += r8
[3] ldw r3, r2[0x00] // r3 = *r2

The relevant registers:
r1 = 0
r8 = 523400

With these register values, there is no chance to get r2 = 523403. I therefore suspect, that the code execution was interrupted between instructions [1] and [2], leaving r2 with a value of 3 before continuing execution at instruction [2].

Of course, the exception does not always happen at this exact point (but somewhere around it) but in this example it's very clear to see how it happens. At other points, the relevant registers have a very long history and it's therefore not that easy to trace it back.

As there are no interrupts active at that core (the only interrupts I could find are the RX_DONE and RX_ERROR interrupts within the RGMII RX LLD running on a different core on the same tile) I don't understand how this can happen. I already tried to disable XScope and debug messages and everything I could imagine. In the end I think, that somehow the RX and TX threads might sill interfer.

Does anybody have any idea how I can find out what exactly is causing these problems?

EDIT:
I've removed all the GPIO tasks running at Tile 3, such that the only tasks there are now

rgmii_ethernet_mac
rgmii_ethernet_mac_config

Still, the problem persists.

Thank you very much and have a nice day!

Max

mon2 · Post by **mon2** » Wed May 03, 2017 1:46 pm

Have you tried different compiler optimization flags to see if that changes the results or even the generated compiled code ?

maxwinkel · Post by **maxwinkel** » Wed May 03, 2017 3:07 pm

Hi!

Thank you for the suggestion. I was compiling without optimization all the time, but due to the module build info, the lib_ethernet was compiled with -O3. I've now compiled all my code with -O3 and it's currently running - let's see. But still, it should work with or without optimization.

I have a very wild idea: The crashing TX buffer thread is running on lcore 0, the LLD RX thread on lcore 7. Everytime the crash occured, the RX thread was suspended within rx_data (could be random, could be systematic). As far as I understand it, each lcore/hardware thread has its own set of registers and its own interrupts. But since they are using a common pipeline for instruction execution, could it be possible, that, due to a bug (timing error / unforeseen number of clock cycles for a very specific combination of instructions) the result of an instruction on lcore 7 could be written to a register on lcore 0?

What leads me to this possible suspect:
- The problem only occurs full duplex with specific rates. There seems to be a correlation between RX and TX which should not be there.
- The crashing thread is running on lcore 0, the possibly interfering thread on lcore 7. In the instruction pipeline, the instructions of lcore 0 are executed right after the instructions of lcore 7.
- The RX thread of lcore 7 is always within the rx_data "loop".
- The rx_data loop writes to the registers r5, r6 and r8, which are also involved in the crashes [to be completely honest: In the example above it looks like r2 is overwritten...]
- The crc32_inc instruction seems to be a very complex operation.
- lcore7 is running in fast, dual issue mode, which I think might be another complication.

I know it's a wild idea and very unlikely, but I know such things CAN happen. Do you have any information if there are known hardware issues in that direction?

Again, thank you very much!

Best regards,

Max

maxwinkel · Post by **maxwinkel** » Wed May 03, 2017 4:02 pm

Hi all,

with -O3, the code still crashes. Now I've caught this one:

ET_ILLEGAL_RESOURCE.PNG

ldw r0, sp[0xd] // r0 = *(sp + 4*14)
setv res[r0], r11 // set event vector for resource pointed to by r0 to r11

The content of the memory at the given location is 0x80191302. However, r0 has a value of 0x80190302. There is a '1' missing at bit 12.

At this point I would tend to say, that this particular CPU might be broken (it happens), but we have several identical boards which all show this behaviour. Even though it's not impossible, that ALL CPUs are broken, I think it's more likely, that it's either a hardware bug, or a very tricky software issue which I just can't find.

Again, the possibly suspicios core 7 is in the RX loop:

rgmii_rx_lld.S:835
-> in tmp1, res[p_rxd]
stw tmp1, ptr[0]
crc32_inc crc, tmp1, poly, ptr, 4

Thank you for your help!

Best regards,

Max

peter · Post by **peter** » Wed May 03, 2017 8:13 pm

Hi Max,

This is indeed very strange behaviour. I would be very surprised if it is a CPU bug given how many other applications we have running reliably. Not impossible, but it seems more likely it is a tricky software issue. It is hard to understand what you are seeing in terms of what is in the memory/registers compared to what logically could be happening. I agree it doesn't make sense without the ability for interrupts to be going off.

It would be good to ensure that there are no issues coming from the speed changing / auto-negotiation. Have you tried configuring the RGMII to run with auto-negotiation turned off? Have you tried restricting it to 100Mb/s? It is a completely different code instance at 100Mb/s, but it would be good to know whether random issues persist or not.

These bits are controlled by the line:

Code: Select all

  smi_configure(smi, phy_address, LINK_1000_MBPS_FULL_DUPLEX, SMI_ENABLE_AUTONEG);

Another thing I spotted in looking at the low-level code for the rgmii_rx_lld.S is that it does look like there might be a bug in there that means that those cores can return to the main with interrupts still enabled. Not sure if the path can be exercised, but if the speed change were to occur in the middle of receiving a packet then it looks like the exit path from rx_speed_change_handler doesn't clear the interrupt enable bit.

It feels like the line

Code: Select all

  clrsr XS1_SR_ININT_SET(0,1)

should be:

Code: Select all

  clrsr XS1_SR_IEBLE_SET(0,1)

Does that make any difference? Clearly, having tasks continuing to execute with interrupts enabled could break things.

infiniteimprobability · Thu May 04, 2017 8:38 am

Could you tell us a bit more about the board? What you are seeing doesn't look unlike a soft error. The most likely cause for this is inadequate decoupling or poor supply regulation.

For example, some of our older ICs featured DC-DC convertors and when the passives around them are not laid out properly (big current loops with PCB trace/via inductance) you can get similar behaviour to what you are reporting, which was caused my noisy supply rails. They were all rock solid when the layout issues were addressed.

One way to quickly test this hypothesis is to raise the core voltage (up to 1.05 is fine) and/or lower the chip temperature to see if that reduces the rate of errors you are seeing.

Conversely, lowering the core voltage to 0.95 and heating the IC should make it worse.

If there are no changes in failure rate from this experiment, then soft errors from supply rail noise from poor decoupling can be ruled out.

maxwinkel · Post by **maxwinkel** » Thu May 04, 2017 9:06 am

Hi all,

after seeing the bit error in the last screenshot I also got the idea, that the core power supply might be insufficient. I've check the supplies before (at the output of the converters), but now indeed I've found a filter which was rated for 300 mA only (instead of up to 1.5A according to the datasheet) which lowered the core voltage to only 0.7V. I've replaced the filter and so far it's running without any issue for 20 minutes already.

20 minutes is not yet a proof, but I'm quite certain. I'm sorry that I bothered you with the problem. I was too fixated on a software issue, that it took me too long to reconsider the hardware.

Thank you very much for your help!

Best regards,

Max

infiniteimprobability · Thu May 04, 2017 9:08 am

That's great news. Thanks for feeding back the fix. It's reassuring to know that our chips can (just about) function with such a low core voltage when doing heavy lifting!

maxwinkel · Post by **maxwinkel** » Thu May 04, 2017 9:36 am

It's reassuring to know that our chips can (just about) function with such a low core voltage when doing heavy lifting!

Yes, I was really supprised when I saw the 0.7V (without network traffic - i.e. most threads idle waiting for events) and still it's (mostly) working. You guys really did a great job designing them!

Now, after 1 hour without any errors (not even bit errors on the line which I had around once per minute before) I think I can conclude, that the problem is solved. Again, thank you very much!

RGMII lib_ethernet crashes (XCore200)

RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)

Re: RGMII lib_ethernet crashes (XCore200)