Dr. Damien's Development - Hard Drive Crashes

DFGray · ‎01-26-2010

When you write software for a living, it is important to keep the code you write safe. This has several levels, but one that is often neglected is the physical level. What happens if your hard drive crashes or your motherboard dies? Let's look at a couple of experiences I have had.

When I was a young engineer designing medical devices at Hewlett–Packard Laboratories, the motherboard on my workstation failed, taking the hard drive with it. I was not using any sort of source code control. I was doing weekly backups at the time, and was somewhat lucky, I only lost a couple of days of work. But I also had to recreate my entire computer. I lost almost a week of time.

Fast forward to a couple of weeks ago. I was using a dedicated workstation to test a set of new build processes for the project I am working on. I came in Monday morning and noticed the machine had rebooted and halted in the ICH7 BIOS with a disk failed error. I rebooted the machine and ran the disk diagnostics to confirm the disk was actually bad. It was. Fortunately, the machine was set up with a RAID 5 disk system, so I replaced the drive, rebooted the machine, and I was back up with no loss of data and only an hour loss of time. The machine was slow for the day as it rebuilt the redundant information, but was usable.

What would have happened had the motherboard died, as happened in the first case? All code was under source code control and I was backing up anything not submitted on a nightly basis. I estimate I would have lost about a day of time and no code. This is a sharp contrast to the first case, where I lost about a week.

So what is the probability you will suffer a hard drive failure? It is a lot higher than you may think. Check out the Google study, Failure Trends in a Large Disk Drive Population. Failure rates the first year declined from 3% in the first three months to about 2% for the rest of the year. The rates then shot up to about 8% for the next few years. This gives a cumulative probability of losing a hard drive in three years of about 17%.

However, things are not as simple as that. Hard drive failure rates are correlated with model and manufacturer. Pick the right model or manufacturer and you may be good for ten years. Pick poorly and your failure rates may be much higher. Unfortunately, the only good way know if a drive is reliable is historical data. By the time you know this data, the drive is obsolete and a poor purchase.

The best way to alleviate this problem is using a RAID system for your drives. RAID systems can be set up to give you redundancy (RAID 1), speed (RAID 0), or both (RAID 10 or RAID 5). RAID 0 systems are particularly dangerous, because if any one drive fails, the entire system fails. Since they are composed of multiple drives, the probability of failure is increased.

I prefer a RAID 5, since it gives you a performance boost and redundancy with minimal disk investment (it requires three or more disks). Most modern motherboards support RAID 5, so you usually do not need to buy a dedicated controller (although performance will typically be better if you do). In the ten years I have been at National Instruments, I have had four drive failures at work and two at home. None resulted in data loss, although that was luck for the first couple.

What if you cannot afford more than one drive? There are a couple of symptoms of imminent hard drive failure that can prevent data loss and enable you to clone your drive before it is too late.

Slow Performance — In many cases, poor hard drive performance is indicative of an impending failure. The disk is actually failing repeatedly, but the redundancy built into the electronics keep it functioning. I had a workstation that took 15 minutes to boot Windows XP. When the hard drive was replaced, it took two minutes.
Thunking Sound — A thunking sound from your hard drive indicates immenint failure. The thunk is the hard drive mechanism attempting to recalibrate itself on a frequent basis. The frequency of these thunks can be anywhere from a few seconds to a few minutes. If I hear it, I replace the drive.

To wrap up:

Use a RAID 1, RAID 5, or RAID 10 disk configuration.
Use a source code control system.
Use nightly backups for files which are not under source code control.
Be aware of slow performance and thunking noises coming from your computer.

PhillipBrooks · ‎01-26-2010

Most new drives include S.M.A.R.T. (below from wikipedia)

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART), is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

When a failure is anticipated by S.M.A.R.T., the drive is typically replaced and returned to the manufacturer, who uses these dead drives to discover where faults lie and how to prevent them from reoccurring on the next generation of hard disk drives.

Servers and business model computers often come with a software that will alert the admin via SNMP or email.

Some computer and drive manufacturers also include basic diagnostic programs that access the drive's SMART data. There are also independent tools available for SMART that can be found using Google.

If you have performance problems or hear the sounds mentioned, you could use one of these tools to diagnose your drive.

Now is the right time to use %^<%Y-%m-%dT%H:%M:%S%3uZ>T
If you don't hate time zones, you're not a real programmer.
"You are what you don't automate"Inplaceness is synonymous with insidiousness

Joseph_Loo · ‎01-26-2010

You should read the paper from Google. Basically it says if the drive is failing, the SMART data will reflect it. If you want to predit the failure using SMART data, it is not very good.

Ben · ‎01-26-2010

Now there is a topic I can comment on!

I spent ten years at DEC (Digital Equipment Corporation) where I was described in one of my performance apprasals as "The Premier Large Disk Specailist". So first we start with a joke.

Q: There are two types of disk in the world, what are they?

A: Disks that are bad and disks that are not bad yet.

If you even saw the inside of a hardrive in operation and you are like me, you would be supprised they even work. Heads the size pin are literally flying on a cousion of air over a surafce at more than 200 miles an hour. When the stop fly they "crash" into the surface tha stores the data. In the old days of CDC 9766's the after affects of a head crash could look like a shot-gun went of inside the drive.

Trivia:

The heads are positioned using tracking infor written to the platters. When the logic looses track of what track it is over, it has to find home and locate the tracks again. The "thunking" sound is the heads home action.

War Stories:

I worked with another engineer for three days ($250 an hour) trying to recover a disk that held source code that a team had been developing for a month. No I did not get the data back.

An ssociate was reporting the results of a head re-build project he completed the night before on a disk drive that used intercahnable pallterrs (disk pack). SO after rebuilding and testing the customer put his backup back in the rebuilt drive. Afterwatrds while standing behind the systems they noticed a fine brown dust blowing out of the back of the drive. The customer asked

Q: What is that?"

My Buddy replied...

A: Data.

Take home story;

make a habit of using ctrl-s,

Backup often to a different spindle.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

tst · ‎01-26-2010

Ben wrote:
...like a shot-gun went of inside the drive.

Did you try that to make sure your comparison is accurate? 😉

___________________
Try to take over the world!

Ben · ‎01-26-2010

tst wrote:
Ben wrote:
...like a shot-gun went of inside the drive.
Did you try that to make sure your comparison is accurate? 😉

Oh you caught me!

No I made that judgement based on the shrapnel I found imbedded in the back of the head pre-amp ciruit board.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

GregFreeman · ‎01-26-2010

Ben wrote:

I worked with another engineer for three days ($250 an hour) trying to recover a disk that held source code that a team had been developing for a month. No I did not get the data back.

Drives should fail more often if thats the price they pay.

Ray.R · ‎01-26-2010

Knowing the linear velocity of the shrapnel, you could probably calculate and compare to that of a shotgun.

Maybe invent a myth and get Mythbusters to verify.. 😉

Ben · ‎01-26-2010

That was the hourly rate for each of us. Back then a 1G hard-drive (RP-07) could cost you $150K. The HDA (Head Disk assembly) cost $12K alone and required a 12 hour procedure to replace it. There were only two people in the region qualified to do the procedure since a mistake cost us $12K. I needed a lift gate van to get the HDA to site and the replacement back to the office.

Ben

Retired Senior Automation Systems Architect with Data Science Automation LabVIEW Champion Knight of NI and Prepper LinkedIn Profile YouTube Channel

DFGray · ‎01-26-2010

When I was a grad student, we had a 10MByte hard drive connected to our embedded PDP-11/23. It failed once a year, so we would never trust data on it, except as a transient storage mechanism. Permanent storage was on 8" floppies; they were more reliable!

LabVIEW

Dr. Damien's Development - Hard Drive Crashes

Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes

Re: Dr. Damien's Development - Hard Drive Crashes