Replacing Failing Drive in Linux Raid 1 mdadm Array
I've had a disk drive complaining about have trouble writing to certain blocks (specific error is I/O error, dev sdc, sector 1446901823 -across a bunch of different sectors).
This raid mirror isn't supper critical, it holds backups from the other machines on my home network, but I'm starting to think it's time to replace the failing drive and eventually upgrade the size of my backup mirror (md0) partition from 1TB to 2TB. I'm using about 90% of the space right now, but probably have a good 3-4 months before I need the space.
Overall, the process was much simpler than I expected and very straightforward. Other than the synchronization of the new drive the whole process took about 20 minutes. It was nice not having to do this with a system down, but I could do it much quicker having been through this test run.
First I just verify that the array is currently okay:
Next thing is to remove the drive partition from the md0 array:
Then I rebooted the PC. When rebooting linux asked if I wanted to continue booting the degraded array. I said yes, since this was expected and this drive is set up as a data only drive and Kubuntu came up as expected.
The following command allowed me to monitor the re-sync process that took a few hours.
Currently I'm just replacing the one drive with a 2 TB drive. Later when the other 1TB drive fails it will get replaced with at least a 2 TB drive to support my backup array (or if I need the space). This was my first time replacing a drive in a linux software raid 1 array and it went pretty smooth.
Thanks to the sources below that helped me put together my plan for kubuntu
Sources used:
http://en.wikipedia.org/wiki/Mdadm
http://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
http://zackreed.me/articles/64-increasing-the-size-of-mdadm-raid1-disks
This raid mirror isn't supper critical, it holds backups from the other machines on my home network, but I'm starting to think it's time to replace the failing drive and eventually upgrade the size of my backup mirror (md0) partition from 1TB to 2TB. I'm using about 90% of the space right now, but probably have a good 3-4 months before I need the space.
Overall, the process was much simpler than I expected and very straightforward. Other than the synchronization of the new drive the whole process took about 20 minutes. It was nice not having to do this with a system down, but I could do it much quicker having been through this test run.
1. Preparation before starting:
I just wanted to gather some background on my drive system before I start unplugging and plugging back in drives.First I just verify that the array is currently okay:
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Fri Mar 11 22:53:54 2011
Raid Level : raid1
Array Size : 976759936 (931.51 GiB 1000.20 GB)
Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Apr 14 08:30:19 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 298307dc:5f5c8273:3b33ca75:284608dc (local to host AMD-ubuntu)
Events : 0.176
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 33 1 active sync /dev/sdc1
Then run a check on the overall disk usage on the machine:
df -h
Filesystem Size Used Avail Use% Mounted onThen list out the various disk drives, just in case I run into any unforeseen problems:
/dev/sdb1 224G 59G 155G 28% /
udev 1.9G 8.0K 1.9G 1% /dev
tmpfs 780M 1.1M 779M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 1.6M 2.0G 1% /run/shm
/dev/md0 917G 825G 84G 91% /media/1TB
sudo fdisk -l
Disk /dev/sda: 1000.2 GB, 1000204886016 bytesVerify that the sdc1 drive is the drive that smartctl has detected issues with and grab it's serial number so that I pull the correct drive out of my PC:
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sda1 63 1953520064 976760001 fd Linux raid autodetect
Disk /dev/sdb: 250.1 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders, total 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d466c
Device Boot Start End Blocks Id System
/dev/sdb1 * 63 476696744 238348341 83 Linux
/dev/sdb2 476696745 488392064 5847660 5 Extended
/dev/sdb5 476696808 488392064 5847628+ 82 Linux swap / Solaris
Disk /dev/md0: 1000.2 GB, 1000202174464 bytes
2 heads, 4 sectors/track, 244189984 cylinders, total 1953519872 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ca3d0
Device Boot Start End Blocks Id System
/dev/sdc1 63 1953520064 976760001 fd Linux raid autodetect
sudo smartctl --attributes --log=selftest /dev/sdc1
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F2 EG
Device Model: SAMSUNG HD103SI
Serial Number: S1VSJ90S939439
LU WWN Device Id: 5 0024e9 20115afa9
Firmware Version: 1AG01118
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Sat Apr 14 08:46:50 2012 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 21002 1365630457
# 2 Short offline Completed without error 00% 20158 -
2. Removing the failed drive from the array
First thing to do is fail the suspect drive in the md0 array. Checking the status of the raid after this shows the sdc1 partition failed:
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md0 --remove /dev/sdc1
3. Physically replacing the failed drive and restarting kubuntu
Then I shutdown the machine. Since I only have a limited number of SATA connectors I actually need to remove the old drive and plug the new drive in it's place.Then I rebooted the PC. When rebooting linux asked if I wanted to continue booting the degraded array. I said yes, since this was expected and this drive is set up as a data only drive and Kubuntu came up as expected.
4. Partition the new drive
I created a single partion using all of the space on the new disk drive that had been created:
sudo fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel with disk identifier 0x60be4700. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) The device presents a logical sector size that is smaller thanA quick check to make sure linux sees the new partition and that I formatted it correctly:
the physical sector size. Aligning to a physical sector (or optimal I/O) size boundary is recommended, or performance may be impacted.
Command (m for help): n Command action
e extended p primary partition (1-4)
p Partition number (1-4, default 1): 1
First sector (2048-3907029167, default 2048): Using default value 2048Last sector, +sectors or +size{K,M,G} (2048-3907029167, default 3907029167): Using default value 3907029167
Command (m for help): tSelected partition 1Hex code (type L to list codes): fdChanged system type of partition 1 to fd (Linux raid autodetect)
Command (m for help): wThe partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: If you have created or modified any DOS 6.xpartitions, please see the fdisk manual page for additionalinformation.Syncing disks.
sudo fdisk -l
Disk /dev/sdb: 250.1 GB, 250059350016 bytes255 heads, 63 sectors/track, 30401 cylinders, total 488397168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x000d466c
Device Boot Start End Blocks Id System/dev/sdb1 * 63 476696744 238348341 83 Linux/dev/sdb2 476696745 488392064 5847660 5 Extended/dev/sdb5 476696808 488392064 5847628+ 82 Linux swap / Solaris
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00000000
Device Boot Start End Blocks Id System/dev/sda1 63 1953520064 976760001 fd Linux raid autodetect
Disk /dev/md0: 1000.2 GB, 1000202174464 bytes2 heads, 4 sectors/track, 244189984 cylinders, total 1953519872 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectorsUnits = sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 4096 bytesI/O size (minimum/optimal): 4096 bytes / 4096 bytesDisk identifier: 0x60be4700
Device Boot Start End Blocks Id System/dev/sdc1 2048 3907029167 1953513560 fd Linux raid autodetect
5. Add the new partition to the array
The last step was to add this new partition to the existing array:
sudo mdadm --manage /dev/md0 --add /dev/sdc1
watch cat /proc/mdstat
Currently I'm just replacing the one drive with a 2 TB drive. Later when the other 1TB drive fails it will get replaced with at least a 2 TB drive to support my backup array (or if I need the space). This was my first time replacing a drive in a linux software raid 1 array and it went pretty smooth.
Thanks to the sources below that helped me put together my plan for kubuntu
Sources used:
http://en.wikipedia.org/wiki/Mdadm
http://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
http://zackreed.me/articles/64-increasing-the-size-of-mdadm-raid1-disks
Did the re-sync process recreate the proper sized swap partition or do you have to make your own swap partition? I am gathering to use gparted and create identical partitions to match the original partitions before replacing the failed drive.
ReplyDeleteThis was just a data drive with no swap partition. I'd assume if you had multiple partitions in your array -you'd need to partition them before adding them back to their respective array.
ReplyDeleteI'm in the process of doing something similar, but the problem is I forgot to remove the failed partition before the disk got physically removed (it's in a datacenter). I tried removing the partition and then re-adding it, but I end up with a number 0 and number 2 (number 1 used to be the old disk) for each of the md partitions. Is this a problem and is it possible to solve it without having to re-install the old disk physically?
ReplyDeleteThere's no need to "sudo" the `watch cat /proc/mdstat`.
ReplyDeleteThank you -change made!
Delete