Virtual machine’s and a corrupt hard drive

This month saw another first for C&C – seeing a disk “fail” (develop bad sectors) in our Hyper-V server that hosted a number of development staging server VM’s.

What is a bad sector?

For quick context, hard drives are made up similar to a record player; a spinning disk with a needle on an arm that moves back and forth to read and write data from the disk. 

The disk is made up of millions of “sectors” which are little pockets of space that your data is read and written into. If a sector becomes corrupt in anyway (usually physical damage through wear and tear overtime) the data becomes very difficult to retrieve.  If this space was empty, the next time data is written to this sector it may be lost, which damages the integrity of the file or folder that has occupied this space.

In short, you’re going to have a bad time.

What is a Virtual Machine?

We make use of VM’s (virtual machines) for various reasons and tasks here at Code and Create.  In this instance, we ran a Ubuntu server build to deploy and host a web application that we were tasked to make modifications and upgrades on. 

By using a VM we are able to contain and manage a dedicated environment for the web application, and make any changes to the environment without having to worry about other web apps or services being affected.

Each VM has allocated resources from the host server, including its own virtual hard drive (VHD).  In this instance, we allocated a flexible 120gb allocated space which was stored on the physical servers secondary 500gb hard drive.  Being flexible means it will only occupy the amount of space the server uses, and grows on the physical servers hard drive as the VM uses more space (opposed to bulk allocating the entire 120gb from day 1).

It’s broken, time to investigate.

Usage wise, the application eventually became unusable, to the point where the only thing rendered on screen was a 500 internal server error.  *yikes*.

First steps are always to check the log files.  Log files are our friends.  SSHing into the server to check out the error logs worked but was unusually slow.  Nano viewing the application logs hinted at the disk was mounted in a read only capacity (something I wouldn’t know how to do if I wanted!)  Very unusual, and a little worrying.  

Next up was an IT professionals go to fix all approach; turn it off and back on again.  This always works, and should always be done.  Except this time it didn’t work, and actually made things worse.  

Now we’re in full disaster mode.  I’m looking at a Linux boot loader page spitting out error upon error with no particularly helpful messages hinting how to proceed.  Reading between the lines (you could interpret this as “googling”), everything was suggesting and pointing towards hardware issues; something a VM shouldn’t experience!

Sure enough, a quick launch of Speccy to read the health status of the server, highlighted some scary information suggesting the disk was imminently going to fail, and had already developed bad sectors.

What went wrong?

As the development server was used for testing, continuous upgrades and more testing, we made VM snapshots as time went on as easy roll-back references should we break anything.  As time went on, the VHD grew and grew occupying more space on the physical servers 500GB disk, until it grew and hit a bad sector.

Self confession time, initial symptoms were completely overlooked.  I had reports of the server was running “unusually slow” when performing operations, something which I put down to a poor internet connection from the other party which was also a factor!  

To fix this, I attempted to migrate all of the VM’s data onto another disk in the same server.  A standard process, which would normally complete without a hitch. However this process failed time after time around the 40% process, indicating it was hitting a bad sector on the disk and aborting the operation.

Recovering the VM (and its data).

I have no doubt that there is some very good software available for recovering data from failed / soon to fail disks… However where ever they are, they are not on the first page of google!

Each attempt of specialist “data repairing” or “bad sector” repairing failed me.  Until I let Microsoft’s own Checkdisk do its thing.

Checkdisk is a veteran application included in the Windows OS which will scan and repair data on its disks.  One thing I learned during this process, is that it will create a host table of bad sectors on the disk and tell the OS to actively avoid them.  This is great for short term use, but relying on this is a BAD idea. If you ever find yourself in a similar situation, use Checkdisk to retrieve your data and migrate it to a healthy disk ASAP!

I was fortunate, as this is not a 100% guaranteed fix process.  The magical command I used was:

Chkdsk d: /f /r /x

The “d:” is the drive letter of my bad disk, and the flags “/f /r /x” instructs the program to fix, recover bad sector data and to dismount the drive.  (This makes the disk unusable whilst the program is running – which is a good thing!)

This process took over 24 hours to run, and to the untrained and panicked eyes, will look as though it may have crashed.  Often hanging on a printed line suggesting there is 999 hours remaining… fear not, have faith, let it work (lets face it, you can’t do anything else…)

Once it had finished, I still wasn’t able to migrate the VM’s data, but I was able to delete the snapshots which triggers Hyper-V to rebuild the VHD in a single file, which will now naturally be located onto a different part of the disk.  And now we have a host table instructing the OS to avoid certain bad sectors, it was successfully rebuilt onto healthy sectors!

After the merge had completed, a final attempt to migrate the data was successful, and the entire VM booted clean and healthy.

Final thoughts

Backup your data and perform regular health checks on your servers. 

Its very easy to set a server up and put your head in the sand… it’s worked for over 2 years in my eyes!  Until the day it doesn’t!

I will put time aside in the future to find some good health reporting software, something which could check the integrity of disks and push email notifications on any worrying finds.  I haven’t found anything as yet, if you know of anything get in touch!