NI Linux Real-Time Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

Checking cDAQ-9134 storage (NAND) health

Hello All,

We recently saw a case where an unplanned power loss caused our unit to display the following:


NI-cDAQ-9134-019E3C0F nidevldu: [nidevldu] Fatal: Could not connect to MXS. MAX configuration files may be corrupt.

NI-cDAQ-9134-019E3C0F -2147220712

NI-cDAQ-9134-019E3C0F Error code could not be found. Reinstalling the driver might fix the issue. Otherwise, contact National Instruments technical support.

This has prompted me to start looking into how we can keep better track of the NAND health within the unit. I see that the opkg repo has libatasmart, but I'm not sure if this is going to work or if this is the best way to look for bad blocks/errors, detected ECC errors, and NAND wear.

Does anyone have any suggestions?

Just an FYI I have to travel for a few day so I might not be able to respond right away - it doesn't mean I'm not grateful for any guidance provided.

0 Kudos
Message 1 of 4
(3,364 Views)

jefferyanderson wrote:

Hello All,

We recently saw a case where an unplanned power loss caused our unit to display the following:


NI-cDAQ-9134-019E3C0F nidevldu: [nidevldu] Fatal: Could not connect to MXS. MAX configuration files may be corrupt.

NI-cDAQ-9134-019E3C0F -2147220712

NI-cDAQ-9134-019E3C0F Error code could not be found. Reinstalling the driver might fix the issue. Otherwise, contact National Instruments technical support.

This has prompted me to start looking into how we can keep better track of the NAND health within the unit. I see that the opkg repo has libatasmart, but I'm not sure if this is going to work or if this is the best way to look for bad blocks/errors, detected ECC errors, and NAND wear.

Does anyone have any suggestions?

The issue you have encountered is not a NAND issue, so nothing in SMART would have helped with this specific issue. (For other potential issues, maybe. But not this one.) What you are describing appears to be a userspace file integrity issue involving an NI software component (nidevldu). It might be a bug in the NI software stack, or it might be a bug at some lower level, in software or hardware.

Have you filed a service request over this yet?

0 Kudos
Message 2 of 4
(3,316 Views)

Richard,
Sorry for the delayed response, I replied earlier but it seems to have fallen into a black hole.

I think you're likely correct, this is probably 'just' MAX database corruption. That said, I'd expect that NAND issues could manifest themselves just like this - the data corruption could impact OS operation, NI software, or the user applications.

This got me thinking that I'm presently in the dark with how our flash is wearing/behaving - I can't see the # of bad blocks, ECC errors, etc, so I'm looking for recommendations on how to monitor this, perhaps with a tool like smartmontools or similar.

 

Do you have any recommendations for this?

Thanks.

0 Kudos
Message 3 of 4
(3,308 Views)

No problem regarding delayed responses... heh. <blush>

 

I guess the main flaw I see with your logic is the presumption that NAND corruption is abnormal. On the contrary, it's so incredibly common that if NAND management *didn't* automatically correct virtually all errors automatically, NAND would be rather unusable. There is not much that can exist between "working SSD" and "bricked SSD": a well-written SSD firmware will continue to correct errors until there are not enough unworn NAND blocks left to save all data.

 

And in fact, this whole process is sufficiently complex that, in general with SSDs, you are at least as likely to get bit by firmware issues than of actual NAND wear. And no amount of monitoring can alert you to that sort of thing. Not coincidentally, the state of NAND monitoring is pretty horrible; the standardized SMART metrics are notoriously unreliable at predicting SSD failure. Some SSD vendors will provide low-level utilities to query for the sort of information you're asking about if you ask really nicely, but to be honest, while those can be useful in some cases, I'm not sure I would expect that in most.

 

So I would recommend not trying to monitor this -- at least, unless/until there is something demonstrably actionable that you can observe.

 

 

 

0 Kudos
Message 4 of 4
(3,135 Views)