Building a Linux storage server

I don’t wanted to install full featured NAS software like FreeNAS or unraid e.g. I had a few goals I wanted achieve (in no specific order):

  • HDD storage should be quite resilient against hardware outage
  • Lots of storage 😉
  • Decent pricing
  • Low power consumption
  • Not too slow (in regards to I/O performance)
  • NFS/CIFS (Samba) support
  • iSCSI support (e.g. for Kubernetes Persistent Storage)
  • Only Linux tools

That’s it 😉 So lets talk about how I achieved the goals.

Working with hard disks, partitions, filesystems, and so on are potentially dangerous operations as you can loose data if running the wrong command at the wrong time! So be careful esp. if you operate on hard disks and/or partitions that already contain data! I’m not responsible for your data 😉

This basically includes two main factors:

  1. Good hardware
  2. A suitable RAID level

Regarding “good hardware” I decided to go with five Toshiba Enterprise MG08 HDDs. They are made for running 24/7 and I also had good experience with Toshiba HDDs in general over the years. Also their pricing is quite ok.
If it comes to resilience against HDD outage there are basically four options: RAID 1 (mirror), RAID 5 (parity), RAID 6 (double parity) and RAID 10 (mirror/stripe). With RAID 1 and 10 I’d loose 50% of the available storage capacity with 5 disks. That’s not acceptable for me. If performance would have been a priority for me then I’d have used RAID 10 most probably. But capacity is a thing for me. So the most obvious option was of course RAID 5. But that one has one issue at least: With that multiple TByte big disks nowadays a RAID rebuild takes lots of hours (at least). With RAID 5 you can only loose one disk if you don’t use a spare disk. The probability that another disk fails during a RAID rebuild that takes one to two days is quite high. So that risk is too high. RAID 6 allows loosing two disks and in my case with five HDDs I get 60% capacity utilization (which is at least 10% more than with RAID 5). That’s way better. Of course RAID 6 with an additional spare disk would be optimal but that’s also very expensive.

As boot/root disk I decided to use a small dedicated SSD disk. No need for redundancy here as this can be replaced quite easily.

As mentioned above I decided to use five Toshiba Enterprise MG08ACA16TE HDDs with a capacity of 16 TByte. With RAID 6 I get a usable capacity of 48 TByte or 43.66 TiB (tebibytes).

Also I bought the disks from two different dealers. The more dealers the better. If you buy from one dealer only the probability that you get disks from the same charge is quite high. If that charge has an production issue this affects all your disks. While you can’t bet on it to get disks from different charges if you buy from different dealers it still increases the probability that you get disks from different charges. This might further reduce the probability that more than one disk fails while you replace the failed disk and rebuild the RAID array. Rebuilding a RAID array with big disks can take quite some time! And just to remind you that this isn’t a theoretical issue I once had the problem that two Samsung EVO 870 SSD disks reported SMART errors at the around same time. Both had serial numbers the just differed in the last number. Searching the Internetz it became clear that quite a few people had problems with disks that had the same production date like mine.

Important note: Whatever RAID level you choose (besides RAID 0 which gives you no redundancy at all) you still need to backup your data! If you delete your data by accident (rm -fr / e.g.) no RAID level will protect your data!

I already mentioned the HDDs used and their pricing. So here are the other components I used that fulfill the decent pricing and low power consumption requirements:

  • ITX board: ASRock J5040-ITX - This board contains an Intel Quad-Core Pentium J5040 with up to 3.2 GHz (4/4 Cores/Threads) and a TDP of 10W, 1 PCIe 2.0 x1 slot (used for the additional SATA controller mentioned below) and four SATA3 ports. So together with the additional SATA controller I have 10 SATA ports available. If you can’t get a board with the J5040 processor an Intel Celeron Processor J4125 is also good enough. It’s only a little bit slower.
  • Power supply: be quiet! ATX 400W System Power 9 BN245 - To have enough power for the board, RAM and in case all disks are working at the same time (e.g. during RAID rebuild) I have chosen a 400W power supply. I wouldn’t go below that. Also this power supply is very silent.
  • Additional SATA controller: MZHOU PCIe SATA Controller Expansion Card 6Gbps SATA 3.0 PCIe 6-Port - As everything RAID related is managed by Linux I don’t needed a full fletched RAID controller. The only important thing is that it is well supported by Linux. In this case the card uses Marvell 88SE9215 and ASMedia ASM109X chips and they are well supported by Linux. But don’t expect performance miracles from this SATA controller 😉 But still for my use case it’s good enough. I connected the boot SSD and two of the RAID disks with the SATA ports the ASRock board offers and the other three RAID disks with the MZHOU SATA Controller. That should distribute the I/O load a bit.
  • RAM/Memory: HyperX Impact HX432S20IB2K2/16 16GB Kit*(2x8GB) 3200MHz DDR4 CL20 SODIMM - While the board officially only supports 8 GB of RAM this RAM kit works very well. The Internet also contains reports that even 2x16GB SODIMM work without problems.

Additionally I had a old tower case around and also a Samsung EVO 850 SSD which I used as boot disk and a Samsung EVO 840 SSD I wanted to use for Kubernetes shared via iSCSI later.

NFS and CIFS (Samba) are well supported by Linux. Both protocols are used to share a mountpoint with all my videos to whatever device. In my case that’s mainly a Amazon FireTV Cube and my Desktop PC.

I also wanted some storage for my little Kubernetes cluster. iSCSI is well supported by Linux via open-iscsi. Additionally Kubernetes has a iSCSI volume driver.

By using Logical Volume Manager (LVM) makes it easy and flexible to provide block storage for both use cases.

I’ll use Archlinux which always provides the latest Linux tools/software available. But Ubuntu or basically any other Linux distribution should be good enough. Also the tools used through this blog post should be basically available on every Linux.

So lets finally start…

After the storage server was assembled I installed Archlinux. This is of course out of scope and the installation process also depends on your Linux distribution. So as mentioned I installed Archlinux on the SDD and didn’t touched the RAID disks at all during installation for now. After the freshly installed Linux booted the first time I checked what was the disk setup all about. lsblk is a nice little utility for that task (make sure util-linux package is installed which is normally the case):

bash

[root@uhura ~]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 232.9G  0 disk 
├─sda1   8:1    0   512M  0 part /boot
├─sda2   8:2    0     2G  0 part [SWAP]
├─sda3   8:3    0    30G  0 part /
└─sda4   8:4    0 200.4G  0 part 
sdb      8:16   0  14.6T  0 disk 
sdc      8:32   0  14.6T  0 disk 
sdd      8:48   0  14.6T  0 disk 
sde      8:64   0 232.9G  0 disk 
sdf      8:80   0  14.6T  0 disk 
sdg      8:96   0  14.6T  0 disk 

So as you can see the root (/) disk ended up on /dev/sda3 (the SAMSUNG EVO 850 SSD). I used 30 GB for this one which should be enough as it will only contain the Linux OS. /dev/sde is the SAMSUNG EVO 840 SSD mentioned above but that one isn’t important. So the Toshiba disks for my RAID 6 are /dev/sd(b|c|d|f|g).

Before I actually did anything RAID related I already prepared for the case a HDD fails. So I wrote down the serial numbers of all disks that are part of the RAID. This makes it easier to identify the failed disk later in case you need to replace it. So this way you really make sure to remove the failed disk and not a still working one..

For this task we need hdparm installed:

bash

pacman -S hdparm

To get the serial number of /dev/sdb e.g. this command can be used:

bash

hdparm -i /dev/sdb | grep Serial | awk '{ print $4 }'
SerialNo=42AAEEEYFVGG

And for all my disks:

bash

for X in b c d f g; do
  echo "HDD /dev/sd${X}: $(hdparm -i /dev/sd${X} | grep Serial | awk '{ print $4 }')"
done

HDD /dev/sdb: SerialNo=42C123FYEVGG
HDD /dev/sdc: SerialNo=42N123531VGG
HDD /dev/sdd: SerialNo=42A123FMNVGG
HDD /dev/sdf: SerialNo=42C123Y3NVGG
HDD /dev/sdg: SerialNo=42S123F5FVGG

Save this information on a different computer, mobile, tablet or print it out. Afterwards I put a label on every disk with the device path and serial number. So if a disk fails at any time now I have all the information to remove the failed disk because it’s on the disk. More about that later.

Lets start doing something useful by creating a partition on every RAID disk. You can use whatever tool you want for this task. I used gdisk. That one allowed me to give the partitions a name. This is nice as these partition names can later be found in /dev/disk/by-partlabel/ and used for creating the RAID e.g. This makes identifying the partitions easier.

After testing I’ll create only one big partition on every disk used for the RAID. All these partitions are later used to create LVM volume groups and logical volumes. So there will one big disk array and on top of that the LVM volume groups.
But for testing a RAID failure and rebuild it makes sense to start with a very small partion on every disk (which I deleted again after done with testing) as the rebuild process can take VERY long with big partitions. So I’ll create a 1 GB partition on every disk to make recovery faster. As mentioned this is now only for testing purposes but the final partitions are created the same way just bigger.

So lets install gdisk:

bash

pacman -S gdisk

Start the tool and configure /dev/sdb:

bash

gdisk /dev/sdb

gdisk should normally create automatically a partition table of type GPT.

I created a new partition by pressing n and enter Partition Number (usually 1). Make sure the that the First sector starts at 2048 for optimal alignment and performance reasons. That should be the default anyways nowadays.

Now gdisk asks for the Last sector. If you go with me and just add a test partition enter +1G which will give you a partition of round about 1 GB of space. If you decide that you already want to create a partition with the final size then I suggest to NOT use the whole disk size! The reason for that is that if you need to replace the disk later because of disk error e.g. you might get a disk that don’t have that much sectors than your old one even if the new one has the same size! So for that reason it makes sense to leave about 100 MB of disk space unused. To calculate the Last sector we need some information about the disk which we get by pressing p key:

bash

Command (? for help): p
Disk /dev/sdb: 31251759104 sectors, 14.6 TiB
...
Sector size (logical/physical): 512/4096 bytes
...
First usable sector is 2048, last usable sector is 31251759070
...

If we want to keep 100 MB of free space the formula to get the sector count for 100 MB of disk space is:

bash

free space in MB * 1024 * 1024 / logical sector size = sector count

E.g.:

bash

100MB * 1024 * 1024 / 512 = 204800

The first (and only) partion starts at sector 2048 plus 204800 sectors that should be left free. The last usable sector is 31251759070. So the value for Last sector is 31251759070 - 204800 - 2048 = 31251554022.

For Hex code or GUID enter fd00 which is for Linux RAID.

gdisk also allows setting a partition name by pressing c. I used nas1 as partition name on the first disk and then just count up for the partitions on the other disks e.g. nas2|3|4|5.

So in my case the final partition table looks like this (I used 120 MB for the free space so the numbers are a little bit different):

bash

Command (? for help): p
Disk /dev/sdb: 31251759104 sectors, 14.6 TiB
Model: TOSHIBA MG08ACA1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 511F9A50-7158-FA4D-B0A3-D28AFA9A5A5F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 31251759070
Partitions will be aligned on 2048-sector boundaries
Total free space is 245727 sectors (120.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     31251513343   14.6 TiB    FD00  nas1

The same needs to be done for all the other RAID disks before continuing.

If all partitions on all disks are ready the RAID setup can start. First mdadm package is needed as it contains the tools to create a RAID:

bash

pacman -S mdadm

Next I loaded a few kernel modules. Normally they should be started automatically but just to get sure:

bash

modprobe -v raid6
modprobe -v dm-mo

Next create a RAID named nas (--name). If not specified the basename of the device is used which is /dev/md/nas and therefore is the same as specified as the value of --name. --level specifies the RAID level and that’s of course 6 as I want to create a RAID6 array. --raid-devices specifies the amount of disks the array consists of and that’s 5. And finally all the partitions needs to be specified that are used for the array. As you can see at this point the partition names can be used which is very handy:

bash

mdadm --create /dev/md/nas --name=nas --level=6 --raid-devices=5 /dev/disk/by-partlabel/nas[12345]

Additionally you can also specify the following (important) parameter but they are default anyways meanwhile (check the manpage man mdadm to verify):

bash

--chunk=64 --metadata=1.2

A chunk size of 64 (in KByte) is default and normally sufficient. You definitely want to use --metadata=1.2 if it’s not the default of your mdadm. From the man page:

plain

Use the new version-1 format superblock. This has fewer restrictions.
It can easily be moved between hosts with different endian-ness, and
a recovery operation can be checkpointed and restarted. The different
sub-versions store the superblock at different locations on the device,
either at the end (for 1.0), at the start (for 1.1) or 4K from the
start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred
1.x format). "default" is equivalent to "1.2".

So if you now fire the command above it will start creating the array and you can watch the status e.g.:

bash

cat /proc/mdstat 

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdg2[4] sdf2[3] sdd2[2] sdc2[1] sdb2[0]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  resync =  0.0% (2559952/14551881728) finish=1705.0min speed=142219K/sec
      bitmap: 109/109 pages [436KB], 65536KB chunk

With my setup it took about 24 hours using the disk space of all disks. With five 1 GB partitions this is only a matter of a minute or so. If the build is done the status looks like this:

bash

cat /proc/mdstat 

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdg2[4] sdf2[3] sdd2[2] sdc2[1] sdb2[0]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/109 pages [0KB], 65536KB chunk

To make the array setup permanent lets save the setup:

bash

mdadm --examine --scan --verbose >> /etc/mdadm.conf

The result will look like this:

bash

ARRAY /dev/md/nas  level=raid6 metadata=1.2 num-devices=5 UUID=b31b3747:2c3d907e:68eccf46:3920aae0 name=uhura:nas
   devices=/dev/sdg2,/dev/sdf2,/dev/sdd2,/dev/sdc2,/dev/sdb2

It makes sense to have a backup of that file somewhere else in case the disk array fails and you need to repair a degraded array.

To get some more information about the array and its state you can use the following command:

bash

mdadm --misc --detail /dev/md/nas

/dev/md/nas:
           Version : 1.2
     Creation Time : Sun Aug 15 22:37:49 2021
        Raid Level : raid6
        Array Size : 46876870656 (44705.27 GiB 48001.92 GB)
     Used Dev Size : 15625623552 (14901.76 GiB 16000.64 GB)
      Raid Devices : 5
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Aug 17 00:55:05 2021
             State : clean 
    Active Devices : 5
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : uhura:nas  (local to host uhura)
              UUID : 3bb811ef:60fc09a7:59d3b046:498ed04c
            Events : 18701

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       97        3      active sync   /dev/sdg1
       4       8      113        4      active sync   /dev/sdh1

Later I’ll setup a few LVM resources so lets install the required lvm package:

bash

pacman -S lvm

To initialize the RAID during boot /etc/mkinitcpio.conf needs to be updated. mkinitcpio can use a hook to assemble the arrays on boot. For more information see RAID - Configure mkinitcpio. Add the mdadm_udev and lvm2 hooks to the HOOKS array in /etc/mkinitcpio.conf after udev e.g.:

bash

HOOKS=(base udev mdadm_udev lvm2 autodetect modconf block filesystems keyboard fsck)

And recreate the ramdisk:

bash

mkinitcpio -p linux

This should also output something link this:

bash

...
  -> Running build hook: [mdadm_udev]
Custom /etc/mdadm.conf file will be used in initramfs for assembling arrays.
...

Now reboot the host. After it’s back lets check if the RAID is still there:

bash

mdadm --misc --detail /dev/md/nas

If you want you can run a RAID check from time to time (around every month). This is called scrubbing. See also RAID - Scrubbing at Archlinux Wiki. The check operation scans the drives for bad sectors and automatically repairs them. This can take quite some time! For this we can’t use /dev/md/nas as device name. We need to figure out the mdXXX name:

bash

ls -al /dev/md/nas

lrwxrwxrwx 1 root root 8 Sep 25 16:48 /dev/md/nas -> ../md127

So in my case the name is md127. To start the check:

bash

echo check > /sys/block/md127/md/sync_action

You can check the current state of the check like this:

bash

cat /proc/mdstat 

Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg2[3] sdd2[2] sdh2[4] sdc2[1] sdb2[0]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  check =  0.0% (2917056/14551881728) finish=1911.8min speed=126828K/sec
      bitmap: 0/109 pages [0KB], 65536KB chunk

As said this can take VERY long for a big RAID. But one can stop the check again:

bash

echo idle > /sys/block/md127/md/sync_action

To see if the check found any errors (if the scrubbing is done) you need to check /sys/block/md0/md/mismatch_cn:

bash

cat /sys/block/md0/md/mismatch_cnt

It is a good idea to set up a cron job as root to schedule a periodic scrub and also send status emails. See raid-check AUR which can assist with this. For typical platter drives, scrubbing can take approximately six seconds per gigabyte (that is one hour forty-five minutes per terabyte) so plan the start of your cron job or timer appropriately. For 5*16 TByte drives that means around six and a half days…

So lets create the first LVM resource a physical volume (PV):

bash

pvcreate /dev/md/nas

  Physical volume "/dev/md/nas" successfully created.

To get some information about that PV use (again /dev/md/nas is just a an alias of /dev/md127 so both are the same device):

bash

pvdisplay

  "/dev/md127" is a new physical volume of "<40.66 TiB"
  --- NEW Physical volume ---
  PV Name               /dev/md127
  VG Name               
  PV Size               <40.66 TiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               FOGS1V-LAZg-71mo-fM2C-dW6P-SpJG-7dLwlX

Lets create a volume group (VG) using the just create physical volume:

bash

vgcreate nas /dev/md127

And again get some details about that VG:

bash

vgdisplay

  --- Volume group ---
  VG Name               nas
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <40.66 TiB
  PE Size               4.00 MiB
  Total PE              10658116
  Alloc PE / Size       0 / 0
  Free  PE / Size       10658116 / <40.66 TiB
  VG UUID               MdPs73-CYVl-Sf59-3LBp-mKh0-0O0v-WAE3C7

Utilizing this VG we can create a logical volume (LV). In this example this LV will have a size of 2 GByte, it will be created on VG nas and its name will be lvtest:

bash

lvcreate -L 2G nas -n lvtest

And again get some information about this logical volume:

bash

lvdisplay 

  --- Logical volume ---
  LV Path                /dev/nas/lvtest
  LV Name                lvtest
  VG Name                nas
  LV UUID                5H0q9H-klXA-m8l9-ZY59-PUHa-hZiF-mvZCkP
  LV Write Access        read/write
  LV Creation host, time uhura, 2021-08-12 00:03:27 +0200
  LV Status              available
  # open                 0
  LV Size                2.00 GiB
  Current LE             512
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:0

Now create a ext4 filesystem on that LV:

bash

mkfs.ext4 /dev/mapper/nas-lvtest

mke2fs 1.46.3 (27-Jul-2021)
Creating filesystem with 524288 4k blocks and 131072 inodes
Filesystem UUID: 1e9ea973-cab2-4010-820e-1cb9120e5554
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

And finally mount it e.g.:

bash

mount /dev/mapper/nas-lvtest /mnt

Before hitting a real device error I wanted to simulate it at least once. Of course you can’t simulate every possible error but to get a little bit into it and practice a little bit without real data makes sense IHMO.

So make sure hard drives (sdX, UUID, …) are labeled accordingly as already mentioned above. I’m also assuming there is no additional SATA port free and no spare drive in place. So in my case that’s just that RAID6 which consists of 5 disks.

First lets make sure that the array is in good shape:

bash

mdadm --detail /dev/md/nas

/dev/md/nas:
           Version : 1.2
     Creation Time : Tue Aug 10 22:05:17 2021
        Raid Level : raid6
        Array Size : 43655645184 (41633.27 GiB 44703.38 GB)
     Used Dev Size : 14551881728 (13877.76 GiB 14901.13 GB)
      Raid Devices : 5
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Aug 12 21:51:58 2021
             State : clean 
    Active Devices : 5
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : uhura:nas  (local to host uhura)
              UUID : b31b3747:2c3d907e:68eccf46:3920aae0
            Events : 17489

    Number   Major   Minor   RaidDevice State
       0       8       18        0      active sync   /dev/sdb1
       1       8       34        1      active sync   /dev/sdc1
       2       8       50        2      active sync   /dev/sdd1
       3       8       98        3      active sync   /dev/sdg1
       4       8      114        4      active sync   /dev/sdh1

To simulate a device error we could now just pull the SATA cable of one of the harddisks. But if we would plug in the cable again the array would immediately start to rebuild because the superblock. Linux raid reserves a bit of space (called a superblock) on each component device. This space holds metadata about the RAID device and allows correct assembly of the array. If a matching superblock is detected on a disk the array rebuild starts. So we need to get rid of that superblock.

So lets fail the first device of the array manually:

bash

mdadm --fail /dev/md/nas /dev/sdb1

The array status looks like that now:

bash

mdadm --detail /dev/md/nas

/dev/md/nas:
           Version : 1.2
     Creation Time : Tue Aug 10 22:05:17 2021
        Raid Level : raid6
        Array Size : 43655645184 (41633.27 GiB 44703.38 GB)
     Used Dev Size : 14551881728 (13877.76 GiB 14901.13 GB)
      Raid Devices : 5
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Aug 12 22:09:40 2021
             State : clean, degraded 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : uhura:nas  (local to host uhura)
              UUID : b31b3747:2c3d907e:68eccf46:3920aae0
            Events : 17491

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       34        1      active sync   /dev/sdc1
       2       8       50        2      active sync   /dev/sdd1
       3       8       98        3      active sync   /dev/sdg1
       4       8      114        4      active sync   /dev/sdh1

/proc/mdstat now shows sdb1 as failed:

bash

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdg1[3] sdh1[4] sdb1[0](F) sdc1[1] sdd1[2]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
      bitmap: 0/109 pages [0KB], 65536KB chunk

unused devices: <none>

Lets verify if the mountpoint is still there:

bash

[root@host ~]# df -h
Filesystem                Size  Used Avail Use% Mounted on
dev                       7.7G     0  7.7G   0% /dev
run                       7.7G  808K  7.7G   1% /run
/dev/sda3                  30G  3.2G   26G  11% /
tmpfs                     7.7G     0  7.7G   0% /dev/shm
tmpfs                     7.7G     0  7.7G   0% /tmp
/dev/sda1                 511M   74M  438M  15% /boot
tmpfs                     1.6G     0  1.6G   0% /run/user/0
/dev/mapper/nas-lvtest  2.0G  707M  1.1G  39% /mnt

I copied a file to mountpoint /mnt/ and it’s still there:

bash

[root@host ~]# ls -al /mnt/
total 723168
drwxr-xr-x  3 root root      4096 Aug 12 00:07 .
drwxr-xr-x 17 root root      4096 Jul 20 23:01 ..
-rw-r--r--  1 root root 740492896 Aug 12 00:07 00003.ts
drwx------  2 root root     16384 Aug 12 00:06 lost+found

And the file still can be read despite the fact that the disk array is missing one disk:

bash

[root@host ~]# file /mnt/00003.ts 
/mnt/00003.ts: MPEG transport stream data

Just to get sure we remove the device (which basically already happened as seen above):

bash

mdadm --remove /dev/md/nas /dev/sdb1

Executing dmesg one should also see this operation in the kernel log:

bash

dmesg

[Thu Aug 12 22:09:40 2021] md/raid:md127: Disk failure on sdb1, disabling device.
                           md/raid:md127: Operation continuing on 4 devices.

Now the RAID superblock on the failed device can be deleted e.g.:

bash

mdadm --zero-superblock /dev/sdb1

And again verify that the /mnt mountpoint and the file is still there:

bash

ls -al /mnt/

total 723168
drwxr-xr-x  3 root root      4096 Aug 12 00:07 .
drwxr-xr-x 17 root root      4096 Jul 20 23:01 ..
-rw-r--r--  1 root root 740492896 Aug 12 00:07 00003.ts

Now the SATA cable of the failed device can be pulled and the host rebooted (just to have a clean state). Now lets have a look at /proc/mdstat again and see the difference (sdb1 is missing):

bash

cat /proc/mdstat 

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdf1[3] sdc1[2] sdg1[4] sdb1[1]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
      bitmap: 1/109 pages [4KB], 65536KB chunk

Shutdown the computer and plug in the HDD cable again. If the host has started we’ll see that device sdb is still not part of the array again:

bash

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdg2[3] sdc2[1] sdh2[4] sdd2[2]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
      bitmap: 1/109 pages [4KB], 65536KB chunk

unused devices: <none>

Normally you now need to partion the new drive now but in my case the partition is still there as it’s the same device and the partition table wasn’t deleted (just the RAID superblock was zeroed and the SATA cable was unplugged). If the hard disk would have been replaced with a completely new drive a new partition table (type GPT) and a new partition (type Linux RAID) needs to be created as I did at the beginning. So in my case it looks like this:

bash

sfdisk -l /dev/sdb

Disk /dev/sdb: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: TOSHIBA MG08ACA1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 511F9A50-7158-FA4D-B0A3-D28AFA9A5A5F

Device     Start         End     Sectors  Size Type
/dev/sdb1   2048 31251513343 31251511296 14.6T Linux RAID

As the partition /dev/sdb1 is already there I can add the “new” partition again to the RAID:

bash

mdadm --add /dev/md/nas /dev/sdb1

Now in /proc/mdstat the recovery process can be monitored:

bash

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdb1[5] sdg1[3] sdc1[1] sdh1[4] sdd1[2]
      43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
      [>....................]  recovery =  0.1% (18229760/14551881728) finish=1708.9min speed=141741K/sec
      bitmap: 0/109 pages [0KB], 65536KB chunk

Depending on the disk/array size the recovery process can take some time. Neverthe less /etc/mdadm.conf should be updated again after the change (making a backup of the old file before):

bash

cp /etc/mdadm.conf /etc/mdadm.conf.backup
mdadm --examine --scan --verbose > /etc/mdadm.conf

In my case the output should be basically the same as the previous content of /etc/mdadm.conf. The order of devices listed in devices list of the ARRAY keyword most probably is different but this doesn’t matter. The important thing here is that every member of the RAID must be listed there. The order doesn’t matter.

If you did only testing you can now remove the mountpoint, filesystem and LVM resources e.g.:

bash

umount /mnt
lvremove /dev/nas/lvtest

To repeat it again: Do NOT forget about this important points:

  • Be careful when working with filesystems, partitions, hard disks and LVM resources!
  • Whatever RAID level you choose you still NEED a backup of your data (RAID0 gives you NO redundancy at all of course)!
  • Install a cronjob/systemd timer that scrubs/checks your RAID regularly for bad sectors!
  • Monitor the result of the scrubbing process by either sending a status email or by providing metrics to a monitoring/alerting software like Prometheus
  • Use S.M.A.R.T. to monitor your drives! S.M.A.R.T. self tests most probably shouldn’t run at the same time as the scrubbing process as this has a performance impact. Also see my blog post Easy hard disk health monitoring on Linux with scrutiny and S.M.A.R.T