When you have a faulty disk in a software RAID array (mdadm), you'll see an output similar to this when using cat /proc/mdstat
command, which shows [2/1]
and [U_]
. This means that one disk is missing from a two-disk array.
cat /proc/mdstat
md0 : active raid1 nvme0n1p2[0] nvme1n1p2[1](F) 614336 blocks super 1.0 [2/1] [U_] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 nvme1n1p3[1](F) nvme0n1p3[0] 959237120 blocks super 1.2 [2/1] [U_] bitmap: 6/8 pages [24KB], 65536KB chunk md127 : active raid1 nvme1n1p1[1](F) nvme0n1p1[0] 16759808 blocks super 1.2 [2/1] [U_]
As you can see in the above output, there are three mdadm devices on RAID-1 array.
In this example, the disk is used for the operating system and has three partitions: /boot
, /
, and swap
.
md0
-/boot
md1
-/
md127
-swap
To identify which partitions to remove from the raid array, use the following commands individually:
mdadm --detail /dev/md0 | grep faulty
mdadm --detail /dev/md1 | grep faulty
mdadm --detail /dev/md127 | grep faulty
You'll see a line similar to the following on each command output:
1 259 7 1 faulty /dev/nvme1n1p3
Or you can get this information from the cat /proc/mdstat
output shown above. It shows (F) beside the faulty partition. e.g. for md0
the failed partition is nvme1n1p2
(or/dev/nvme1n1p2
) and for md1
it's nvme0n1p3
(or /dev/nvme1n1p3
) and so on.
To get the serial number of the faulty disk follow this article.
Note: if you can't get the serial of the faulty disk because it's no longer functioning, you can get the serial of the working disk so at least you know which disk(s) to keep in the system and replace the other one.
Step 1: Remove the Faulty Disk from RAID array
Run the following commands to remove partitions of the faulty disk from the RAID array:
mdadm --manage /dev/md0 --remove /dev/nvme1n1p2 mdadm --manage /dev/md1 --remove /dev/nvme1n1p3 mdadm --manage /dev/md127 --remove /dev/nvme1n1p1
Important: replace the partition names with the correct ones on your system.
Now, shut down the system and replace the faulty disk.
Step 2: Prepare the new Disk
After physically replacing the faulty disk, copy the partition table from the existing (working) disk to the new one:
sgdisk -R /dev/NEW-DISK /dev/EXISTING-DISK
For example, in my output, the faulty disk (which has been replaced with a new one) is /dev/nvme1n1 and the existing, working disk is /dev/nvme0n1 so the command should be as follows:
sgdisk -R /dev/nvme1n1 /dev/nvme0n1
The above command will copy the partition table and the disk GUID. To avoid conflicts, randomize the new disk GUID using this command:
sgdisk -G /dev/NEW-DISK
If sgdisk command is not found, install gdisk by using the following command on RHEL/Almalinux OS:
dnf install gdisk
Step 3: Re-add the New Disk Partitions
Run the following commands to re-add the new disk partitions to the RAID array:
mdadm --manage /dev/md0 --add /dev/nvme1n1p2 mdadm --manage /dev/md1 --add /dev/nvme1n1p3 mdadm --manage /dev/md127 --add /dev/nvme1n1p1
Important: replace the partition names with the correct ones on your system.
The rebuild process will start immediately after you add the partitions. You can monitor the process using the command:
cat /proc/mdstat
Optional: Speed Up the Rebuilding Process
To speed up the rebuilding process, you can increase the speed limit to 400MB/s or more if you are using NVMe disks:
echo 400000 > /proc/sys/dev/raid/speed_limit_max
This value will reset after reboot.