Switch Hardware and Software

cancel
Showing results for 
Search instead for 
Did you mean: 

Multiple NI-Switch problems with PXI-2566 including blue screen

We've been having intermittent problems with our PXI-2566 for months now.

 

It seems like at some point the device and/or driver gets in a state where the open of the device causes a "blue screen." It traps in niswdk.dll (addr: ae9db759, base: ae9b7000, datestamp: 488e1ebe). The NI_Switch version on this system is 3.8.0f1. We now have the system configured to do a full kernel dump (as suggested in another thread).

 

We've seen some bad viStatus'es (that we've haven't been able to decode) returned from NI-Switch:

 

niSwitch_InitWithTopology - 0xBFFA6767 

niSwitch_Connect - 0xBFFA4B50 -- I think this is the error that starts the downward spiral...

niSwitch_Disconnect & niSwitch_Connect  - 0xBFFA495D (after 0xBFFA4B50 error)

 

We get a blue screen the next time we start our app -- best guess is on the niSwitch_InitWithTopology ("foo", NISWITCH_TOPOLOGY_2566_16_SPDT, VI_FALSE, VI_TRUE, ) call. 

 

We'd seen these kind of problems several months back & switched out to a different PXI-2566 & they went away, but now they've come back. Not sure how much the relays have been stressed, but even if they were don't see why we'd get these types of failures. The card passes self-test (from Max), but I get errors running the soft front panel & can't get to the relay counts:

 

error.GIF 

 

The system is running XP. The application uses MSVC, VISA, etc. This doesn't have anything to do with powering on/off the PXI chassis (another reason for blue screens).

 

 Ideas for fixes or debugging ???

 

0 Kudos
Message 1 of 8
(8,366 Views)

Hi Dean,

 

Error -200055 typically occurs if the switch resource name has invalid characters in it, but 'foo' is valid and after looking at the error codes I believe this is not related to the underlying problem.  Tell us more about the frequency of this failure.  Are you seeing this everytime we run a particular piece of code, or are we running the same code many times before failure?  How long are we running the system between restarts?  Have you seen this behavior when using the shipping example code?

 

0xBFFA6767 indicates that a low-level hardware failure has occured.  This directly leads to 0xBFFA4B50, which indicates that the hardware is not responding.  Tell us more about the hardware setup.  What PXI chassis are we using?  Is the PXI system using an integrated controller, or are we using a PC connected via a MXI connection?  In either case, what is the make and model of the controlling hardware? 

 

What other PXI devices do we have in the system?  Have we seen failures on these devices?

 

If we move the 2566 to a new slot in the PXI chassis, do we continue to see errors?  If we have another chassis, does placing the 2566 is said chassis resolve the issue?

 

The next time we see error 0xBFFA6767, let's immediately reset the 2566 in Measurement and Automation.  Resetting the device should get us closer to the actual fault condition and thus give us a better idea of actual error. 

 

Also, can you elaborate on "nother reason for blue screens".

 

One last note: the mechanical wear on the relays wouldn't cause this type of failure.  It is possible that if we greatly exceed the specifications of the switch module, we could break down the insulation between the signals and the digital backend of the card, but this would typically destroy the card such that it would never again function.  Since your behavior is intermittent, I wouldn't expect this to be the case.  Still, it's never a bad idea to remove all signals from the front connector to see if the error persists.  

-John Sullivan
Problem Solver
0 Kudos
Message 2 of 8
(8,349 Views)

OK, 1 easy answer: the device name was set to "PXI1-NI2566" MAX allows this & we have no trouble using this name from our application, but apparently the "-" is invalid from the soft front panel app. So at least I can get to the switch counts now (and they're all <2000).

 

We did get another 0xBFFA6767 this morning, follow shortly by a blue screen & have a kernel dump. Does that help in tracking this down?

 

We've been running this application for many months. It's use of the 2566 is pretty simple and has not changed. It fails intermittently & we haven't been able to correlate the failure with any other events. The system in general has been pretty stable in terms of hardware & software changes. 

 

There are a lot of devices in this system -- this is the only one we're having problems with. We have a rack-mounted PC interfacing to separate PXI & VXI chassis. Here are summaries from MAX & Device Manager: 

 

max.GIF 

 

PCI devs.GIF 

 

 

Is this enough detail? The MXI interface card (MXI-4?) is currently PCIe, but had this same problem with the PCI version of the card 

 

Can you be more specific about resetting after a 0xBFFA6767 to get closer to the fault?

 

The "other blue screen" events I was referring to were in a different thread on the NI forums. Those problems had to do power cycling the PXI (or powering it on after the PC) -- we know from experience not do that.

 

0 Kudos
Message 3 of 8
(8,341 Views)

Hey Dean,


Having a dash in the Device name in MAX is a known issue with switch devices.  This was reported to R&D (# 42742) for further investigation.  Until this behavior is corrected, we strongly recommend not using dashes as identifiers for NI Switch Modules.

 

If we remove the 2566 code from our test procedure, does this prevent the blue screen from occurring?  It's possible that there is a larger issue that manifests due to the sequence of events in your test.  The kernel dump could help us determine the state of the system at failure.  

 

What is the make and model of the rack-mounted system?

 

It sounds like you're seeing this error on a daily basis.  Next time we see error 0xBFFA6767, let's immediately open MAX and click the 'Reset Device' button.  Since this error code pertains to a hardware failure, the device reset should fail with an additional error code.  It's possible that this error code will provide us more detail. 

 

An even better way to reset the device would be to programmatically insert the Reset Device vi between each NI-Switch function call inside of a case structure.  This is only a temporary diagnostic tool to isolate the root error.  If any error occurs, we'll immediately reset the device and then look at the resulting error:

error.png

 

We could easily make each case structure a subVI, but for simplicity I've left it expanded here.  The green "No Error" case is shown as a reference. 

 

 

 

On an unrelated note, let's make sure we identify the PXI system as an External PC:

sdkf.png

-John Sullivan
Problem Solver
0 Kudos
Message 4 of 8
(8,325 Views)

We cannot remove the 2566 code and do anything useful. The errors are intermittent, but the 2566 seems to be the common thread. In fact, we are not in a position where we can change anything (software or hardware) at the moment. 

 

The chassis is a NI PXI-1044. The 2566 is in slot 2. 

 

For now, we'll plan to reset the switch from MAX after we see the 0xBFFA6767. When we can change the code we'll look to add niSwitch_reset() calls after any bad status is returned from niSwitch_InitWithTopology(), niSwitch_Connect(), niSwitch_Disconnect(). Note that we already call niSwitch_reset() after a successful niSwitch_InitWithTopology() call -- doesn't look like this is necessary, but also shouldn't cause a problem.

 

So at this point, we're waiting for the next "event." Do you want the kernel dump? (If so, what's the best way to get it to you?) Any ideas for an app to force the problem (when we get the tester back to debug)? E.g., just call niSwitch_InitWithTopology() in a loop? If you suspect the chassis, any diagnostic ideas there? 

 

Thanks. 

0 Kudos
Message 5 of 8
(8,306 Views)

Hi Dean,

 

Just checking in to see if you've been able to reproduce this behavior.  We'd love to see the kernel dump.  You can zip it up and post it here on the forum if you'd like, or with your permission I can contact you via email with further instructions.  If you'd like to increase the rate of error, let's remove the reset so that the driver has more time to reproduce this instability while your code is running.  We should keep running the code you've been using already, but I would recommend using a simulated switch so that we can run your code more frequently without the need for the code for other hardware actions or software processing steps to increase the likelyhood of failure with the driver.

-John Sullivan
Problem Solver
0 Kudos
Message 6 of 8
(8,237 Views)

We haven't seen the problem since the last post. We tried looping on the init / reset for 10,000's of iterations without any errors and are in the process of trying some variations.

 

Let's continue this via email & I'll get you the dump.

 

Thanks, 

Dean 

0 Kudos
Message 7 of 8
(8,231 Views)

Hey Dean,

 

Thanks for sending the dump via our FTP site.  The kernel dump definitely exhibited some good diagnostic information, but we have yet to see how exactly the driver got into this bad state. After analyzing the file, we'd like to setup a copy of your system here at NI to see if we can further analyze the issue.

 

Can you send us a copy of the code you're running along with information on how long it runs before failure (ballpark).  In addition, if there's anything special about the setup we need to be aware of,  let us know.  Also, please include a screen shot of the Software tab in MAX so we can replicate the installed software.

 

If you'd rather send us the entire system, we can definitely arrange for that, too.  Just let us know.

-John Sullivan
Problem Solver
0 Kudos
Message 8 of 8
(8,188 Views)