Troubleshooting Hardware

cancel
Showing results for 
Search instead for 
Did you mean: 

4499 RoboRio randomly Disconnecting

NI,

 

Over the past few months 4499 has been having issues with roboRIO's randomly losing all network connections (and perhaps completely crashing). When this error happens, the FRC driver station loses connection to the roboRIO and any existing ssh connections into the RIO die. This happens for both ethernet and USB connections. The RS232-Serial console on the RIO also dies. Surprisingly, the RIO can still be pinged over either ether ethernet or via USB and typically replies in 1 to 2 ms. When this error happens, we typically reboot the robot and then everything works as normal until the error happens again. The error seems random, with anywhere between a few minutes to a few hours passing between occurrences. We first started having this issue in September 2020, and the issue has kept with us to this date. Over the past few months, we have tried three different roboRIO's and have updated from the 2020 to 2021 versions of the FRC control system, the issue has persisted. To debug this error we have tried looking at roboRIO logs on the driver station, as well as the log files on the roboRIO itself; however, nothing has stood out to us at the core cause behind this issue. Recently, we connected into the serial console on the roboRIO via the RS-232 port to try and get more insight into the issue. Attached is the text dump from the serial console that starts when the RIO boots and ends when the error occurres (and the serial connection died). Please let us know if you have any thoughts on this issue, or if there other ways of debugging this issue that you would recommend. Thanks.

0 Kudos
Message 1 of 10
(475 Views)

It sounds like the software crashed. Do you have a startup program running? If yes, have you tried disabling the startup app?

0 Kudos
Message 2 of 10
(428 Views)

Are you running anything on the I2C port?  For this year's game, a common example is the Rev Color Sensor that came with the KOP. 

 

Which language are you using?  Java/C++/LabVIEW? 

0 Kudos
Message 3 of 10
(423 Views)

Right now the robot code is set to launch when the RIO boots up. Ill try disabling that and see if the RIO still disconnects if no code is on the device. 

0 Kudos
Message 4 of 10
(414 Views)

We are programming in Java. Our team doesn't have anything specifically on the I2C bus itself, but we do use a NavX in SPI mode plugged into the MXP Port. 

0 Kudos
Message 5 of 10
(410 Views)

MXP is fine.  There's an odd bug with the I2C/SPI port itself when connected to I2C devices that behaves like what you're describing.  It's rare and there are workarounds.  But, that's not what you're running into.

 

I'd keep your code running at startup.  Is there anything showing in the console on the driver station, specifically anything showing an error around the time you're losing connectivity?

0 Kudos
Message 6 of 10
(401 Views)

We are not seeing any specific error right before the robot crashes. Occasionally, we get the error:

Error 11 [CAN SPARK MAX] IDs 7, Received parameter invalid error parameter id 78

Is there a chance the can bus could be linked to this issue?

 

When the robot disconnects it does so extremely abruptly, so its possible that the robot doesn't even get the change to send the error message to the driver station.

 

Also are there any threads related to the I2C bug you mentioned? Id like to read through them, and see what workarounds people have come up with. I suspect you are right this this is a different issue but id like to check all the same.

 

Thanks.

0 Kudos
Message 7 of 10
(395 Views)

If you're not using the I2C port, it's a different issue.  The workaround folks have used successfully is to reduce the rate they refresh data from their sensor.

 

It doesn't sound like the CAN error is a likely culprit.  I'd agree with your general sentiment the error isn't being sent.

 

How reproducible is the behavior?  Does it happen once a week?  Are you able to sit down and work for a while and see it?  Ideally, we'd try to narrow this down a bit.  If we're able to force it to occur, I'd want to start with a pretty basic example program.  If we see the error there, we're likely looking at something independent of your specific code.  If the error goes away, that helps us focus our troubleshooting a bit.  Until we have a clearer picture, it's the Spiderman meme where every component is pointing at the others placing blame.  If we can't force the error, this becomes a bit harder to narrow down.

0 Kudos
Message 8 of 10
(386 Views)

This is happening to many teams including ours and has carried over from 2020.  There are several instances where people have commented that "NI is aware of the issue and able to reproduce it" however we've not heard any correspondence from NI.  Here is a good source of information on the subject and ways to reproduce and help to mitigate the impact of this bug:  https://www.chiefdelphi.com/t/robot-keeps-losing-com-and-code/378945

0 Kudos
Message 9 of 10
(368 Views)

We've also seen this using a navX MXP device configured to use SPI instead of I2C, with SPI it occurs less frequently but it still happens.

0 Kudos
Message 10 of 10
(365 Views)