Intermitant failure with PCIe-6251

Dpete · ‎07-14-2009

We have been having an intermitant failure that has been getting worse. It appears to be due to the PCIe-6251 card. We have Labview 7.0 and DAQmx 8.1 on the system as well as a PCI-GPIB and a PCI-6711. Periodically the computer would lock up and upon reset either could not find the card, or could not reset it until the power to the entire test system was cut long enough for +5V standby to drain away. The problem seemed to be exacerbated by ambient temperature. When the test system was moved away from other external heat sources (Ie the chiller for the laser) the problem seemed to disappear or was reduced to less than 1 time a week usually when labview program was first run.

The test cabinet itself was rebuilt recently and the problem came back with a vengence. After additional cooling was added, and the problem persisted we started monitoring case temperatures. The internal system temperature is under 42C worst case as reported by the chipset. we upgraded the power supply to make sure that it was not a supply line sag causing the problems. When we catch the problem in the error log it is either an ATI video driver or NIPALK.sys Because the production line keeps going down for this we installed fresh software on a different machine, and moved the cards and installed. It seemed to run fine for maybe 2 hours then started throwing memory parity errors. The computer is outside of the test rack and cold. If we pull the PCIe Card the parity errors go away and all is fine. The problem being is it is the heart of the control system. The original computer only seems to crash 2-3 times a day so it takes a fair amount of time to try to verify a fix.

Unless I misunderstand it, DAQmx 8.1 is the newest driver that supports labview 7.0 DAQmx 8.01 is the other driver that supports the card and labview 7.

Are there know issues with this card that I am not finding?

Dpete · ‎07-15-2009

I moved the card in the second computer to the PCIe slot that happens to be at the bottom of the motherboard rather than closest to the processor. It was left on over night and has been running in production for 6 hours now with no crashes.

In the first computer the 6251 was right next to the onboard video chipset, and on the second right next to the RAM. Both computers use the same ATI video driver, but we only had video crashes on the first and parity errors on the second. I am beginning to suspect that this card has a serious RF interference issue, but I do not have any equipment to verify this with.

Chris G in AE · ‎07-15-2009

Dpete,

I'm sorry to hear that you're having so much trouble with the PCIe card. We have not heard of this type of issue before, which makes me even more curious about it. To answer the question in your first post: the card does not have any known issues that you aren't aware of. That being said, I'd like to get a kernal dump from your Windows machine so we can see what's going on. I'm also curious if the problem goes away when the external connections are removed from the card. The last thing I'd like to point you to is the CE specs for the card, which will hopefully address your RF question.

Please post back with the kernal dump and results from unplugging the connections. Thanks for brining this to our attention, I look forward to hearing back from you!

Sincerely,

Chris G in AE

Dpete · ‎07-17-2009

The chassis that had the parity errors so far had a best run time of 20 hours between crashes. If it generates a parity error there is no entry in the event log so I am not sure that a kernel dump will be generated. Yesterday mornings crash, the computer would not restart until the PCIe card was off for a fair amount of time and case opened to cool (That was the warmest part in the system by anecdotal methods No part was what I would consider hot) So the first case was retrofitted with casefans to directly cool the PCIe card. That system just went in a few minutes ago. If that fails we will go to labview 8.6 on that system to see if newer drivers help and finally throw the PCIe away and move the I/O down into the VXI chassis. The latter may be forced depending on when the system crashes next. I'm a bit leary of swapping from 7.0 to 8.6 on a production line with no time available to debug old software with the new version of Labview.

Which external connections are you refering to? The auxillery power connection or the connection to the BNC 2090A? At this point we are down to just the digital I/O being used on that card.

Dpete · ‎07-17-2009

My mistake. That chassis made it 40 hours to the next failure It failed about 24 hours after the earlier post.

Chris G in AE · ‎07-20-2009

Dpete,

The external connections I was referring to are anything that plugs into the cards in your system. Basically, we'd like to isolate down the machine to see if it's an internal error or something coming in from one of the lines.

I'm a bit confused at this point if you're still having the crash issue. It sounds like you are and that you're about to upgrade to 8.6 to see if the problem goes away. I agree with you that upgrading from 7.0 to 8.6 without compatibility testing is a risky move, especially since 7.0 is not an officially supported version anymore.

Additionally, since we can't get a kernal dump we're not going to be able to see where everything is crashing. This is going to make it very difficult to figure out what the error is.

I'll have to check to see what the next steps are, but at this point it's looking like there isn't much else we can do. Please let me know the results of your latest test and we'll go from there. Thanks, and have a great day!

Sincerely,

Chris G in AE

Dpete · ‎07-21-2009

The system has been running since fridaynight, so 3.5 days but only two full production shifts so far. Not sure that I am willing to call it problem solved yet, but it looks that adding a 110CFM fan pointed at that card is helping. This is the joy of intermittant failures. If it fails, you know you need to keep looking, if it does not, you are not sure. We are still running 7.0 on it.

Dpete · ‎07-31-2009

Going on 2 weeks now with no crashes. At this point I guess I'd be willing to chalk it up as a thermal management issue. The cooling on both cases apparently was insufficient for the heat that that card puts out.

Dpete · ‎08-10-2009

Problem not solved. The system just locked up three times in a row. However it causes the problem, the way that it crashes prevents the system from generating a memory dump log. Best guess is still a direct memory access corruption problem.

Sputnikrent · ‎08-11-2009

Hi Dpete,

This issue sounds really odd, there are different behaviors that occur on your two different computers that you tried.

I don't think you mentioned this, but is this a new behavior you have seen recently? Or is this a relatively new test setup that has always seen this issue?

Rasheel

Multifunction DAQ

Intermitant failure with PCIe-6251

Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermitant failure with PCIe-6251

Re: Intermittent failure with PCIe-6251

Re: Intermittent failure with PCIe-6251

Re: Intermittent failure with PCIe-6251