Difference between revisions of "GNU Linux/Software RAID"

From WhyAskWhy.org Wiki
Jump to: navigation, search
m (Removed hr)
m (Fixed spacing, added section for using mdadm --misc --detail)
Line 7: Line 7:
 
== The problem report ==
 
== The problem report ==
  
This started off with me receiving emails from mdadm (that was monitoring three RAID devices on a 1U server with 4 physical disks) that there was a DegradedArray event on md device /dev/md0.
+
This started off with me receiving emails from mdadm (that was monitoring three RAID devices on a 1U server with 4 physical disks) that there was a DegradedArray event on md device <code>/dev/md0</code>.
  
 
<pre>
 
<pre>
Line 55: Line 55:
 
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=183e0f5d:2ac92a56:f064a724:9c4cc3a4
 
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=183e0f5d:2ac92a56:f064a724:9c4cc3a4
 
</pre>
 
</pre>
 +
  
 
=== Partitions list ===
 
=== Partitions list ===
Line 100: Line 101:
 
</pre>
 
</pre>
  
Since RAID devices require identical sizes on the array members, I realized that to find other array members I could determine the block size of one member and use that to find matching partitions.
 
  
''Note: so if <code>/dev/sdb1</code> as the remaining member of the <code>/dev/md0</code> array and it has a block size of <code>8385898</code>, the other member would also need to have the same block size.''
+
=== Determining array members based on block size ===
 +
 
 +
Since RAID devices require identical sizes on the array members, I realized that to find other array members I could determine the block size of one member and use that to find matching partitions. So if <code>/dev/sdb1</code> as the remaining member of the <code>/dev/md0</code> array and it has a block size of <code>8385898</code>, the other member would also need to have the same block size.
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 112: Line 114:
 
/dev/sdb1  *          1        1044    8385898+  fd  Linux raid autodetect
 
/dev/sdb1  *          1        1044    8385898+  fd  Linux raid autodetect
 
</pre>
 
</pre>
 +
 +
However, that could prove untrue with arrays that are of identical size, so thankfully there is an easier way to find out the array members.
 +
 +
 +
=== Determining array members based on mdadm output ===
 +
 +
<code>mdadm</code> allows us to get the list of array members for a specified array with a short command:
 +
 +
<syntaxhighlight lang="bash">
 +
mdadm --misc --detail /dev/md0
 +
</syntaxhighlight>
 +
 +
<pre>
 +
/dev/md0:
 +
        Version : 0.90
 +
  Creation Time : Wed Jul 13 23:04:19 2011
 +
    Raid Level : raid1
 +
    Array Size : 8385792 (8.00 GiB 8.59 GB)
 +
  Used Dev Size : 8385792 (8.00 GiB 8.59 GB)
 +
  Raid Devices : 2
 +
  Total Devices : 2
 +
Preferred Minor : 0
 +
    Persistence : Superblock is persistent
 +
 +
    Update Time : Tue Aug  7 17:50:16 2012
 +
          State : clean
 +
Active Devices : 2
 +
Working Devices : 2
 +
Failed Devices : 0
 +
  Spare Devices : 0
 +
 +
          UUID : cbae8de5:892d4ac9:c1cb8fb2:5f4ab019
 +
        Events : 0.4493518
 +
 +
    Number  Major  Minor  RaidDevice State
 +
      0      8        1        0      active sync  /dev/sda1
 +
      1      8      17        1      active sync  /dev/sdb1
 +
</pre>
 +
 +
In this case we see that both <code>/dev/sda1</code> and <code>/dev/sdb1</code> make up the <code>/dev/md0</code> RAID device, so we'll need to make sure both are active.
 +
  
 
== Repairing the root RAID device ==
 
== Repairing the root RAID device ==
 +
 +
To find out which array member is missing, let's determine which one isn't:
 +
 +
 +
<syntaxhighlight lang="bash">
 +
cat /proc/mdstat | grep md0
 +
</syntaxhighlight>
 +
 +
<pre>
 +
md0 : active raid1 sdb1[1]
 +
</pre>
 +
 +
So, it appears that <code>/dev/sda1</code> needs to be added back.
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">

Revision as of 17:03, 7 August 2012



The following content is a Work In Progress and may contain broken links, incomplete directions or other errors. Once the initial work is complete this notice will be removed. Please contact me via Twitter with any questions and I'll try to help you out.


These are my scratch notes for recovering Software RAID arrays on a GNU/Linux box. The examples here are for a CentOS 5.x box, but presumably any recent GNU/Linux distro could be used that has support for Software RAID via mdadm. In case it's not clear, I'm a newbie when it comes to Software RAID, so some of these steps may be redundant or nonsensical. If so, please feel free to point that out so I can make this easier to read.


The problem report

This started off with me receiving emails from mdadm (that was monitoring three RAID devices on a 1U server with 4 physical disks) that there was a DegradedArray event on md device /dev/md0.

This is an automatically generated mail message from mdadm running on server.example.org

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4]
md2 : active raid5 sdd1[3] sdc1[2] sdb2[1]
2917676544 blocks level 5, 256k chunk, algorithm 2 [4/3] [_UUU]

md1 : active raid1 sdd2[1] sdc2[0]
8385856 blocks [2/2] [UU]

md3 : active raid5 sdd3[3] sdc3[2] sdb3[1]
2917700352 blocks level 5, 256k chunk, algorithm 2 [4/3] [_UUU]

md0 : active raid1 sdb1[1]
8385792 blocks [2/1] [_U]

unused devices: <none>

Determining what disks or partitions the RAID device is composed of

Based on a previous conversation with another tech, I knew that a RAID device could be composed of entire disks or partitions from multiple disks. The advantage of using partitions instead of entire disks is the ease in which you can satisfy the requirement that all RAID members be the same size. In this case, partitions were used to assemble the RAID devices instead of entire disks.

Assuming that your main root partition is still operational, it's time to collection some information.


mdadm.conf contents

cat /etc/mdadm.conf
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=cbae8de5:892d4ac9:c1cb8fb2:5f4ab019
ARRAY /dev/md3 level=raid5 num-devices=4 uuid=a5690093:5c58a8d9:ac966bcf:a00660c2
ARRAY /dev/md2 level=raid5 num-devices=4 uuid=a45e768c:246aca55:1c012e56:58dd3958
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=183e0f5d:2ac92a56:f064a724:9c4cc3a4


Partitions list

fdisk -l
Disk /dev/sda: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        1044     8385898+  fd  Linux raid autodetect
/dev/sda2            1045      122122   972559035   fd  Linux raid autodetect
/dev/sda3          122123      243201   972567067+  fd  Linux raid autodetect

Disk /dev/sdb: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1        1044     8385898+  fd  Linux raid autodetect
/dev/sdb2            1045      122122   972559035   fd  Linux raid autodetect
/dev/sdb3          122123      243201   972567067+  fd  Linux raid autodetect

Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *           1      121078   972559003+  fd  Linux raid autodetect
/dev/sdc2          121079      122122     8385930   fd  Linux raid autodetect
/dev/sdc3          122123      243201   972567067+  fd  Linux raid autodetect

Disk /dev/sdd: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1      121078   972559003+  fd  Linux raid autodetect
/dev/sdd2          121079      122122     8385930   fd  Linux raid autodetect
/dev/sdd3          122123      243201   972567067+  fd  Linux raid autodetect


Determining array members based on block size

Since RAID devices require identical sizes on the array members, I realized that to find other array members I could determine the block size of one member and use that to find matching partitions. So if /dev/sdb1 as the remaining member of the /dev/md0 array and it has a block size of 8385898, the other member would also need to have the same block size.

fdisk -l | grep 8385898
/dev/sda1   *           1        1044     8385898+  fd  Linux raid autodetect
/dev/sdb1   *           1        1044     8385898+  fd  Linux raid autodetect

However, that could prove untrue with arrays that are of identical size, so thankfully there is an easier way to find out the array members.


Determining array members based on mdadm output

mdadm allows us to get the list of array members for a specified array with a short command:

mdadm --misc --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Wed Jul 13 23:04:19 2011
     Raid Level : raid1
     Array Size : 8385792 (8.00 GiB 8.59 GB)
  Used Dev Size : 8385792 (8.00 GiB 8.59 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug  7 17:50:16 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : cbae8de5:892d4ac9:c1cb8fb2:5f4ab019
         Events : 0.4493518

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

In this case we see that both /dev/sda1 and /dev/sdb1 make up the /dev/md0 RAID device, so we'll need to make sure both are active.


Repairing the root RAID device

To find out which array member is missing, let's determine which one isn't:


cat /proc/mdstat | grep md0
md0 : active raid1 sdb1[1]

So, it appears that /dev/sda1 needs to be added back.

mdadm --add /dev/md0 /dev/sda1

Snippet from /var/log/messages related to the last command:

Aug  7 14:16:00 lockss1 kernel: md: bind<sda1>
Aug  7 14:16:00 lockss1 kernel: RAID1 conf printout:
Aug  7 14:16:00 lockss1 kernel:  --- wd:1 rd:2
Aug  7 14:16:00 lockss1 kernel:  disk 0, wo:1, o:1, dev:sda1
Aug  7 14:16:00 lockss1 kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug  7 14:16:00 lockss1 kernel: md: syncing RAID array md0
Aug  7 14:16:00 lockss1 kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Aug  7 14:16:00 lockss1 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Aug  7 14:16:00 lockss1 kernel: md: using 128k window, over a total of 8385792 blocks.
Aug  7 14:22:31 lockss1 kernel: md: md0: sync done.
Aug  7 14:22:31 lockss1 kernel: RAID1 conf printout:
Aug  7 14:22:31 lockss1 kernel:  --- wd:2 rd:2
Aug  7 14:22:31 lockss1 kernel:  disk 0, wo:0, o:1, dev:sda1
Aug  7 14:22:31 lockss1 kernel:  disk 1, wo:0, o:1, dev:sdb1
Aug  7 14:26:10 lockss1 smartd[3387]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
tail /proc/mdstat
md1 : active raid1 sdd2[1] sdc2[0]
      8385856 blocks [2/2] [UU]

md3 : active raid5 sdd3[3] sdc3[2] sdb3[1]
      2917700352 blocks level 5, 256k chunk, algorithm 2 [4/3] [_UUU]

md0 : active raid1 sda1[0] sdb1[1]
      8385792 blocks [2/2] [UU]

unused devices: <none>

Even with the smartd error, it looks like /dev/md0 is holding. We'll have to go back at some point and run fsck on it from a rescue disc so the filesystem isn't mounted while we're trying to verify its consistency.