Replacing Failing Drive in Linux Raid 1 mdadm Array

I've had a disk drive complaining about have trouble writing to certain blocks (specific error is I/O error, dev sdc, sector 1446901823 -across a bunch of different sectors).

This raid mirror isn't supper critical, it holds backups from the other machines on my home network, but I'm starting to think it's time to replace the failing drive and eventually upgrade the size of my backup mirror (md0) partition from 1TB to 2TB. I'm using about 90% of the space right now, but probably have a good 3-4 months before I need the space.

Overall, the process was much simpler than I expected and very straightforward. Other than the synchronization of the new drive the whole process took about 20 minutes. It was nice not having to do this with a system down, but I could do it much quicker having been through this test run.

1. Preparation before starting:

I just wanted to gather some background on my drive system before I start unplugging and plugging back in drives.

First I just verify that the array is currently okay:
sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Fri Mar 11 22:53:54 2011
     Raid Level : raid1
     Array Size : 976759936 (931.51 GiB 1000.20 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent
    Update Time : Sat Apr 14 08:30:19 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
           UUID : 298307dc:5f5c8273:3b33ca75:284608dc (local to host AMD-ubuntu)
         Events : 0.176
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       33        1      active sync   /dev/sdc1
Then run a check on the overall disk usage on the machine:
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             224G   59G  155G  28% /
udev                  1.9G  8.0K  1.9G   1% /dev
tmpfs                 780M  1.1M  779M   1% /run
none                  5.0M     0  5.0M   0% /run/lock
none                  2.0G  1.6M  2.0G   1% /run/shm
/dev/md0              917G  825G   84G  91% /media/1TB
Then list out the various disk drives, just in case I run into any unforeseen problems:
sudo fdisk -l
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1              63  1953520064   976760001   fd  Linux raid autodetect
Disk /dev/sdb: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders, total 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d466c
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *          63   476696744   238348341   83  Linux
/dev/sdb2       476696745   488392064     5847660    5  Extended
/dev/sdb5       476696808   488392064     5847628+  82  Linux swap / Solaris
Disk /dev/md0: 1000.2 GB, 1000202174464 bytes
2 heads, 4 sectors/track, 244189984 cylinders, total 1953519872 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ca3d0
   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              63  1953520064   976760001   fd  Linux raid autodetect
Verify that the sdc1 drive is the drive that smartctl has detected issues with and grab it's serial number so that I pull the correct drive out of my PC: 
sudo smartctl --attributes --log=selftest /dev/sdc1
=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F2 EG
Device Model:     SAMSUNG HD103SI
Serial Number:    S1VSJ90S939439
LU WWN Device Id: 5 0024e9 20115afa9
Firmware Version: 1AG01118
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Sat Apr 14 08:46:50 2012 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     21002         1365630457
# 2  Short offline       Completed without error       00%     20158         -

2. Removing the failed drive from the array

First thing to do is fail the suspect drive in the md0 array. Checking the status of the raid after this shows the sdc1 partition failed:
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
Next thing is to remove the drive partition from the md0 array:
sudo mdadm --manage /dev/md0 --remove /dev/sdc1

3. Physically replacing the failed drive and restarting kubuntu

Then I shutdown the machine. Since I only have a limited number of SATA connectors I actually need to remove the old drive and plug the new drive in it's place.

Then I rebooted the PC. When rebooting linux asked if I wanted to continue booting the degraded array. I said yes, since this was expected and this drive is set up as a data only drive and Kubuntu came up as expected.

4. Partition the new drive

I created a single partion using all of the space on the new disk drive that had been created:
sudo fdisk /dev/sdc 
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel                               Building a new DOS disklabel with disk identifier 0x60be4700.                                                    Changes will remain in memory only, until you decide to write them.                                              After that, of course, the previous content won't be recoverable.                                                                                                                                                                 Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)                                                                                                                                                    The device presents a logical sector size that is smaller than                                                
 

the physical sector size. Aligning to a physical sector (or optimal                                              I/O) size boundary is recommended, or performance may be impacted.                                            
 
                                                                                                                 
Command (m for help): n                                                                                          Command action                                                                                                
 

   e   extended                                                                                                     p   primary partition (1-4)                                                                                
 

p                                                                                                                Partition number (1-4, default 1): 1                                                                          
 

First sector (2048-3907029167, default 2048):                                                                    Using default value 2048Last sector, +sectors or +size{K,M,G} (2048-3907029167, default 3907029167): Using default value 3907029167
Command (m for help): tSelected partition 1Hex code (type L to list codes): fdChanged system type of partition 1 to fd (Linux raid autodetect)
Command (m for help): wThe partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: If you have created or modified any DOS 6.xpartitions, please see the fdisk manual page for additionalinformation.Syncing disks.
A quick check to make sure linux sees the new partition and that I formatted it correctly: 
sudo fdisk -l
Disk /dev/sdb: 250.1 GB, 250059350016 bytes255 heads, 63 sectors/track, 30401 cylinders, total 488397168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x000d466c
   Device Boot      Start         End      Blocks   Id  System/dev/sdb1   *          63   476696744   238348341   83  Linux/dev/sdb2       476696745   488392064     5847660    5  Extended/dev/sdb5       476696808   488392064     5847628+  82  Linux swap / Solaris
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00000000
   Device Boot      Start         End      Blocks   Id  System/dev/sda1              63  1953520064   976760001   fd  Linux raid autodetect
Disk /dev/md0: 1000.2 GB, 1000202174464 bytes2 heads, 4 sectors/track, 244189984 cylinders, total 1953519872 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 4096 bytesI/O size (minimum/optimal): 4096 bytes / 4096 bytesDisk identifier: 0x60be4700
   Device Boot      Start         End      Blocks   Id  System/dev/sdc1            2048  3907029167  1953513560   fd  Linux raid autodetect

5. Add the new partition to the array

The last step was to add this new partition to the existing array:
sudo mdadm --manage /dev/md0 --add /dev/sdc1
The following command allowed me to monitor the re-sync process that took a few hours.
watch cat /proc/mdstat

Currently I'm just replacing the one drive with a 2 TB drive. Later when the other 1TB drive fails it will get replaced with at least a 2 TB drive to support my backup array (or if I need the space). This was my first time replacing a drive in a linux software raid 1 array and it went pretty smooth.

Thanks to the sources below that helped me put together my plan for kubuntu
Sources used:
http://en.wikipedia.org/wiki/Mdadm
http://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
http://zackreed.me/articles/64-increasing-the-size-of-mdadm-raid1-disks



Comments

  1. Did the re-sync process recreate the proper sized swap partition or do you have to make your own swap partition? I am gathering to use gparted and create identical partitions to match the original partitions before replacing the failed drive.

    ReplyDelete
  2. This was just a data drive with no swap partition. I'd assume if you had multiple partitions in your array -you'd need to partition them before adding them back to their respective array.

    ReplyDelete
  3. I'm in the process of doing something similar, but the problem is I forgot to remove the failed partition before the disk got physically removed (it's in a datacenter). I tried removing the partition and then re-adding it, but I end up with a number 0 and number 2 (number 1 used to be the old disk) for each of the md partitions. Is this a problem and is it possible to solve it without having to re-install the old disk physically?

    ReplyDelete
  4. There's no need to "sudo" the `watch cat /proc/mdstat`.

    ReplyDelete

Post a Comment

Popular posts from this blog

Moen 1225 Kitchen Faucet Cartridge Repair or Replacement

Outdoor Temperature - Waiting for Update Honeywell WiFI Thermostat (RTH9580WF)

Comcast Xfinity HD uDTA Pace DC60Xu Unboxing and Setup Instructions