Building a Linux storage server
Introduction
I don’t wanted to install full featured NAS software like FreeNAS or unraid e.g. I had a few goals I wanted achieve (in no specific order):
- HDD storage should be quite resilient against hardware outage
- Lots of storage 😉
- Decent pricing
- Low power consumption
- Not too slow (in regards to I/O performance)
- NFS/CIFS (Samba) support
- iSCSI support (e.g. for Kubernetes Persistent Storage)
- Only Linux tools
That’s it 😉 So lets talk about how I achieved the goals.
Disclaimer
Working with hard disks, partitions, filesystems, and so on are potentially dangerous operations as you can loose data if running the wrong command at the wrong time! So be careful esp. if you operate on hard disks and/or partitions that already contain data! I’m not responsible for your data 😉
HDD storage should be quite resilient against HDD outage
This basically includes two main factors:
- Good hardware
- A suitable RAID level
Regarding “good hardware” I decided to go with five Toshiba Enterprise MG08
HDDs. They are made for running 24/7 and I also had good experience with Toshiba HDDs in general over the years. Also their pricing is quite ok.
If it comes to resilience against HDD outage there are basically four options: RAID 1 (mirror), RAID 5 (parity), RAID 6 (double parity) and RAID 10 (mirror/stripe). With RAID 1 and 10 I’d loose 50% of the available storage capacity with 5 disks. That’s not acceptable for me. If performance would have been a priority for me then I’d have used RAID 10 most probably. But capacity is a thing for me. So the most obvious option was of course RAID 5. But that one has one issue at least: With that multiple TByte big disks nowadays a RAID rebuild takes lots of hours (at least). With RAID 5 you can only loose one disk if you don’t use a spare disk. The probability that another disk fails during a RAID rebuild that takes one to two days is quite high. So that risk is too high. RAID 6 allows loosing two disks and in my case with five HDDs I get 60% capacity utilization (which is at least 10% more than with RAID 5). That’s way better. Of course RAID 6 with an additional spare disk would be optimal but that’s also very expensive.
As boot/root disk I decided to use a small dedicated SSD disk. No need for redundancy here as this can be replaced quite easily.
Lots of storage
As mentioned above I decided to use five Toshiba Enterprise MG08ACA16TE
HDDs with a capacity of 16 TByte. With RAID 6 I get a usable capacity of 48 TByte or 43.66 TiB (tebibytes).
Also I bought the disks from two different dealers. The more dealers the better. If you buy from one dealer only the probability that you get disks from the same charge is quite high. If that charge has an production issue this affects all your disks. While you can’t bet on it to get disks from different charges if you buy from different dealers it still increases the probability that you get disks from different charges. This might further reduce the probability that more than one disk fails while you replace the failed disk and rebuild the RAID array. Rebuilding a RAID array with big disks can take quite some time! And just to remind you that this isn’t a theoretical issue I once had the problem that two Samsung EVO 870 SSD disks reported SMART errors at the around same time. Both had serial numbers the just differed in the last number. Searching the Internetz it became clear that quite a few people had problems with disks that had the same production date like mine.
Important note: Whatever RAID level you choose (besides RAID 0 which gives you no redundancy at all) you still need to backup your data! If you delete your data by accident (rm -fr /
e.g.) no RAID level will protect your data!
Decent pricing and low power consumption
I already mentioned the HDDs used and their pricing. So here are the other components I used that fulfill the decent pricing and low power consumption requirements:
- ITX board: ASRock J5040-ITX - This board contains an Intel Quad-Core Pentium J5040 with up to 3.2 GHz (4/4 Cores/Threads) and a TDP of 10W,
1 PCIe 2.0 x1
slot (used for the additional SATA controller mentioned below) andfour SATA3 ports
. So together with the additional SATA controller I have 10 SATA ports available. If you can’t get a board with theJ5040
processor an Intel Celeron Processor J4125 is also good enough. It’s only a little bit slower. - Power supply:
be quiet! ATX 400W System Power 9 BN245
- To have enough power for the board, RAM and in case all disks are working at the same time (e.g. during RAID rebuild) I have chosen a 400W power supply. I wouldn’t go below that. Also this power supply is very silent. - Additional SATA controller:
MZHOU PCIe SATA Controller Expansion Card 6Gbps SATA 3.0 PCIe 6-Port
- As everything RAID related is managed by Linux I don’t needed a full fletched RAID controller. The only important thing is that it is well supported by Linux. In this case the card usesMarvell 88SE9215
and ASMediaASM109X
chips and they are well supported by Linux. But don’t expect performance miracles from this SATA controller 😉 But still for my use case it’s good enough. I connected the boot SSD and two of the RAID disks with the SATA ports the ASRock board offers and the other three RAID disks with the MZHOU SATA Controller. That should distribute the I/O load a bit. - RAM/Memory:
HyperX Impact HX432S20IB2K2/16 16GB Kit*(2x8GB) 3200MHz DDR4 CL20 SODIMM
- While the board officially only supports 8 GB of RAM this RAM kit works very well. The Internet also contains reports that even 2x16GB SODIMM work without problems.
Additionally I had a old tower case around and also a Samsung EVO 850 SSD
which I used as boot disk and a Samsung EVO 840 SSD
I wanted to use for Kubernetes shared via iSCSI
later.
NFS/CIFS (Samba)/iSCSI support
NFS
and CIFS (Samba)
are well supported by Linux. Both protocols are used to share a mountpoint with all my videos to whatever device. In my case that’s mainly a Amazon FireTV Cube
and my Desktop PC.
I also wanted some storage for my little Kubernetes cluster. iSCSI
is well supported by Linux via open-iscsi. Additionally Kubernetes has a iSCSI volume driver.
By using Logical Volume Manager (LVM) makes it easy and flexible to provide block storage for both use cases.
Only Linux tools
I’ll use Archlinux which always provides the latest Linux tools/software available. But Ubuntu or basically any other Linux distribution should be good enough. Also the tools used through this blog post should be basically available on every Linux.
So lets finally start…
Install Linux
After the storage server was assembled I installed Archlinux
. This is of course out of scope and the installation process also depends on your Linux distribution. So as mentioned I installed Archlinux on the SDD and didn’t touched the RAID disks at all during installation for now. After the freshly installed Linux booted the first time I checked what was the disk setup all about. lsblk
is a nice little utility for that task (make sure util-linux
package is installed which is normally the case):
[root@uhura ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 512M 0 part /boot
├─sda2 8:2 0 2G 0 part [SWAP]
├─sda3 8:3 0 30G 0 part /
└─sda4 8:4 0 200.4G 0 part
sdb 8:16 0 14.6T 0 disk
sdc 8:32 0 14.6T 0 disk
sdd 8:48 0 14.6T 0 disk
sde 8:64 0 232.9G 0 disk
sdf 8:80 0 14.6T 0 disk
sdg 8:96 0 14.6T 0 disk
So as you can see the root (/)
disk ended up on /dev/sda3
(the SAMSUNG EVO 850 SSD
). I used 30 GB for this one which should be enough as it will only contain the Linux OS. /dev/sde
is the SAMSUNG EVO 840 SSD
mentioned above but that one isn’t important. So the Toshiba disks for my RAID 6 are /dev/sd(b|c|d|f|g)
.
Write down HDD serial numbers
Before I actually did anything RAID related I already prepared for the case a HDD fails. So I wrote down the serial numbers of all disks that are part of the RAID. This makes it easier to identify the failed disk later in case you need to replace it. So this way you really make sure to remove the failed disk and not a still working one..
For this task we need hdparm
installed:
pacman -S hdparm
To get the serial number of /dev/sdb
e.g. this command can be used:
hdparm -i /dev/sdb | grep Serial | awk '{ print $4 }'
SerialNo=42AAEEEYFVGG
And for all my disks:
for X in b c d f g; do
echo "HDD /dev/sd${X}: $(hdparm -i /dev/sd${X} | grep Serial | awk '{ print $4 }')"
done
HDD /dev/sdb: SerialNo=42C123FYEVGG
HDD /dev/sdc: SerialNo=42N123531VGG
HDD /dev/sdd: SerialNo=42A123FMNVGG
HDD /dev/sdf: SerialNo=42C123Y3NVGG
HDD /dev/sdg: SerialNo=42S123F5FVGG
Save this information on a different computer, mobile, tablet or print it out. Afterwards I put a label on every disk with the device path and serial number. So if a disk fails at any time now I have all the information to remove the failed disk because it’s on the disk. More about that later.
Partitioning
Lets start doing something useful by creating a partition on every RAID disk. You can use whatever tool you want for this task. I used gdisk
. That one allowed me to give the partitions a name. This is nice as these partition names can later be found in /dev/disk/by-partlabel/
and used for creating the RAID e.g. This makes identifying the partitions easier.
After testing I’ll create only one big partition on every disk used for the RAID. All these partitions are later used to create LVM volume groups and logical volumes. So there will one big disk array and on top of that the LVM volume groups.
But for testing a RAID failure and rebuild it makes sense to start with a very small partion on every disk (which I deleted again after done with testing) as the rebuild process can take VERY long with big partitions. So I’ll create a 1 GB partition on every disk to make recovery faster. As mentioned this is now only for testing purposes but the final partitions are created the same way just bigger.
So lets install gdisk
:
pacman -S gdisk
Start the tool and configure /dev/sdb
:
gdisk /dev/sdb
gdisk
should normally create automatically a partition table of type GPT
.
I created a new partition by pressing n
and enter Partition Number
(usually 1
). Make sure the that the First sector
starts at 2048
for optimal alignment and performance reasons. That should be the default anyways nowadays.
Now gdisk
asks for the Last sector
. If you go with me and just add a test partition enter +1G
which will give you a partition of round about 1 GB of space. If you decide that you already want to create a partition with the final size then I suggest to NOT use the whole disk size! The reason for that is that if you need to replace the disk later because of disk error e.g. you might get a disk that don’t have that much sectors than your old one even if the new one has the same size! So for that reason it makes sense to leave about 100 MB of disk space unused. To calculate the Last sector
we need some information about the disk which we get by pressing p
key:
Command (? for help): p
Disk /dev/sdb: 31251759104 sectors, 14.6 TiB
...
Sector size (logical/physical): 512/4096 bytes
...
First usable sector is 2048, last usable sector is 31251759070
...
If we want to keep 100 MB of free space the formula to get the sector count for 100 MB of disk space is:
free space in MB * 1024 * 1024 / logical sector size = sector count
E.g.:
100MB * 1024 * 1024 / 512 = 204800
The first (and only) partion starts at sector 2048
plus 204800
sectors that should be left free. The last usable sector
is 31251759070
. So the value for Last sector
is 31251759070 - 204800 - 2048 = 31251554022
.
For Hex code or GUID
enter fd00
which is for Linux RAID
.
gdisk
also allows setting a partition name
by pressing c
. I used nas1
as partition name on the first disk and then just count up for the partitions on the other disks e.g. nas2|3|4|5
.
So in my case the final partition table looks like this (I used 120 MB
for the free space so the numbers are a little bit different):
Command (? for help): p
Disk /dev/sdb: 31251759104 sectors, 14.6 TiB
Model: TOSHIBA MG08ACA1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 511F9A50-7158-FA4D-B0A3-D28AFA9A5A5F
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 31251759070
Partitions will be aligned on 2048-sector boundaries
Total free space is 245727 sectors (120.0 MiB)
Number Start (sector) End (sector) Size Code Name
1 2048 31251513343 14.6 TiB FD00 nas1
The same needs to be done for all the other RAID disks before continuing.
If all partitions on all disks are ready the RAID setup can start. First mdadm
package is needed as it contains the tools to create a RAID:
pacman -S mdadm
Next I loaded a few kernel modules. Normally they should be started automatically but just to get sure:
modprobe -v raid6
modprobe -v dm-mo
Next create a RAID named nas
(--name
). If not specified the basename of the device is used which is /dev/md/nas
and therefore is the same as specified as the value of --name
. --level
specifies the RAID level and that’s of course 6
as I want to create a RAID6 array. --raid-devices
specifies the amount of disks the array consists of and that’s 5
. And finally all the partitions needs to be specified that are used for the array. As you can see at this point the partition names can be used which is very handy:
mdadm --create /dev/md/nas --name=nas --level=6 --raid-devices=5 /dev/disk/by-partlabel/nas[12345]
Additionally you can also specify the following (important) parameter but they are default anyways meanwhile (check the manpage man mdadm
to verify):
--chunk=64 --metadata=1.2
A chunk size of 64
(in KByte) is default and normally sufficient. You definitely want to use --metadata=1.2
if it’s not the default of your mdadm
. From the man page:
Use the new version-1 format superblock. This has fewer restrictions.
It can easily be moved between hosts with different endian-ness, and
a recovery operation can be checkpointed and restarted. The different
sub-versions store the superblock at different locations on the device,
either at the end (for 1.0), at the start (for 1.1) or 4K from the
start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred
1.x format). "default" is equivalent to "1.2".
So if you now fire the command above it will start creating the array and you can watch the status e.g.:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg2[4] sdf2[3] sdd2[2] sdc2[1] sdb2[0]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
[>....................] resync = 0.0% (2559952/14551881728) finish=1705.0min speed=142219K/sec
bitmap: 109/109 pages [436KB], 65536KB chunk
With my setup it took about 24 hours using the disk space of all disks. With five 1 GB partitions this is only a matter of a minute or so. If the build is done the status looks like this:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg2[4] sdf2[3] sdd2[2] sdc2[1] sdb2[0]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 0/109 pages [0KB], 65536KB chunk
To make the array setup permanent lets save the setup:
mdadm --examine --scan --verbose >> /etc/mdadm.conf
The result will look like this:
ARRAY /dev/md/nas level=raid6 metadata=1.2 num-devices=5 UUID=b31b3747:2c3d907e:68eccf46:3920aae0 name=uhura:nas
devices=/dev/sdg2,/dev/sdf2,/dev/sdd2,/dev/sdc2,/dev/sdb2
It makes sense to have a backup of that file somewhere else in case the disk array fails and you need to repair a degraded array.
To get some more information about the array and its state you can use the following command:
mdadm --misc --detail /dev/md/nas
/dev/md/nas:
Version : 1.2
Creation Time : Sun Aug 15 22:37:49 2021
Raid Level : raid6
Array Size : 46876870656 (44705.27 GiB 48001.92 GB)
Used Dev Size : 15625623552 (14901.76 GiB 16000.64 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Aug 17 00:55:05 2021
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : uhura:nas (local to host uhura)
UUID : 3bb811ef:60fc09a7:59d3b046:498ed04c
Events : 18701
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 97 3 active sync /dev/sdg1
4 8 113 4 active sync /dev/sdh1
Later I’ll setup a few LVM resources so lets install the required lvm
package:
pacman -S lvm
To initialize the RAID during boot /etc/mkinitcpio.conf
needs to be updated. mkinitcpio
can use a hook to assemble the arrays on boot. For more information see RAID - Configure mkinitcpio. Add the mdadm_udev
and lvm2
hooks to the HOOKS
array in /etc/mkinitcpio.conf
after udev
e.g.:
HOOKS=(base udev mdadm_udev lvm2 autodetect modconf block filesystems keyboard fsck)
And recreate the ramdisk:
mkinitcpio -p linux
This should also output something link this:
...
-> Running build hook: [mdadm_udev]
Custom /etc/mdadm.conf file will be used in initramfs for assembling arrays.
...
Now reboot the host. After it’s back lets check if the RAID is still there:
mdadm --misc --detail /dev/md/nas
Check array (scrubbing)
If you want you can run a RAID check from time to time (around every month). This is called scrubbing. See also RAID - Scrubbing at Archlinux Wiki. The check operation scans the drives for bad sectors and automatically repairs them. This can take quite some time! For this we can’t use /dev/md/nas
as device name. We need to figure out the mdXXX
name:
ls -al /dev/md/nas
lrwxrwxrwx 1 root root 8 Sep 25 16:48 /dev/md/nas -> ../md127
So in my case the name is md127
. To start the check:
echo check > /sys/block/md127/md/sync_action
You can check the current state of the check like this:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg2[3] sdd2[2] sdh2[4] sdc2[1] sdb2[0]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
[>....................] check = 0.0% (2917056/14551881728) finish=1911.8min speed=126828K/sec
bitmap: 0/109 pages [0KB], 65536KB chunk
As said this can take VERY long for a big RAID. But one can stop the check again:
echo idle > /sys/block/md127/md/sync_action
To see if the check found any errors (if the scrubbing is done) you need to check /sys/block/md0/md/mismatch_cn
:
cat /sys/block/md0/md/mismatch_cnt
It is a good idea to set up a cron job as root to schedule a periodic scrub and also send status emails. See raid-check AUR which can assist with this. For typical platter drives, scrubbing can take approximately six seconds per gigabyte (that is one hour forty-five minutes per terabyte) so plan the start of your cron job or timer appropriately. For 5*16 TByte drives that means around six and a half days…
LVM
So lets create the first LVM resource a physical volume (PV):
pvcreate /dev/md/nas
Physical volume "/dev/md/nas" successfully created.
To get some information about that PV use (again /dev/md/nas
is just a an alias of /dev/md127
so both are the same device):
pvdisplay
"/dev/md127" is a new physical volume of "<40.66 TiB"
--- NEW Physical volume ---
PV Name /dev/md127
VG Name
PV Size <40.66 TiB
Allocatable NO
PE Size 0
Total PE 0
Free PE 0
Allocated PE 0
PV UUID FOGS1V-LAZg-71mo-fM2C-dW6P-SpJG-7dLwlX
Lets create a volume group (VG) using the just create physical volume:
vgcreate nas /dev/md127
And again get some details about that VG:
vgdisplay
--- Volume group ---
VG Name nas
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size <40.66 TiB
PE Size 4.00 MiB
Total PE 10658116
Alloc PE / Size 0 / 0
Free PE / Size 10658116 / <40.66 TiB
VG UUID MdPs73-CYVl-Sf59-3LBp-mKh0-0O0v-WAE3C7
Utilizing this VG we can create a logical volume (LV). In this example this LV will have a size of 2 GByte, it will be created on VG nas
and its name will be lvtest
:
lvcreate -L 2G nas -n lvtest
And again get some information about this logical volume:
lvdisplay
--- Logical volume ---
LV Path /dev/nas/lvtest
LV Name lvtest
VG Name nas
LV UUID 5H0q9H-klXA-m8l9-ZY59-PUHa-hZiF-mvZCkP
LV Write Access read/write
LV Creation host, time uhura, 2021-08-12 00:03:27 +0200
LV Status available
# open 0
LV Size 2.00 GiB
Current LE 512
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 6144
Block device 253:0
Now create a ext4
filesystem on that LV:
mkfs.ext4 /dev/mapper/nas-lvtest
mke2fs 1.46.3 (27-Jul-2021)
Creating filesystem with 524288 4k blocks and 131072 inodes
Filesystem UUID: 1e9ea973-cab2-4010-820e-1cb9120e5554
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912
Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
And finally mount it e.g.:
mount /dev/mapper/nas-lvtest /mnt
Simulating a drive error
Before hitting a real device error I wanted to simulate it at least once. Of course you can’t simulate every possible error but to get a little bit into it and practice a little bit without real data makes sense IHMO.
So make sure hard drives (sdX, UUID, …) are labeled accordingly as already mentioned above. I’m also assuming there is no additional SATA port free and no spare drive in place. So in my case that’s just that RAID6 which consists of 5 disks.
First lets make sure that the array is in good shape:
mdadm --detail /dev/md/nas
/dev/md/nas:
Version : 1.2
Creation Time : Tue Aug 10 22:05:17 2021
Raid Level : raid6
Array Size : 43655645184 (41633.27 GiB 44703.38 GB)
Used Dev Size : 14551881728 (13877.76 GiB 14901.13 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Aug 12 21:51:58 2021
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : uhura:nas (local to host uhura)
UUID : b31b3747:2c3d907e:68eccf46:3920aae0
Events : 17489
Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb1
1 8 34 1 active sync /dev/sdc1
2 8 50 2 active sync /dev/sdd1
3 8 98 3 active sync /dev/sdg1
4 8 114 4 active sync /dev/sdh1
To simulate a device error we could now just pull the SATA cable of one of the harddisks. But if we would plug in the cable again the array would immediately start to rebuild because the superblock. Linux raid reserves a bit of space (called a superblock) on each component device. This space holds metadata about the RAID device and allows correct assembly of the array. If a matching superblock is detected on a disk the array rebuild starts. So we need to get rid of that superblock.
So lets fail the first device of the array manually:
mdadm --fail /dev/md/nas /dev/sdb1
The array status looks like that now:
mdadm --detail /dev/md/nas
/dev/md/nas:
Version : 1.2
Creation Time : Tue Aug 10 22:05:17 2021
Raid Level : raid6
Array Size : 43655645184 (41633.27 GiB 44703.38 GB)
Used Dev Size : 14551881728 (13877.76 GiB 14901.13 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Aug 12 22:09:40 2021
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : uhura:nas (local to host uhura)
UUID : b31b3747:2c3d907e:68eccf46:3920aae0
Events : 17491
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 34 1 active sync /dev/sdc1
2 8 50 2 active sync /dev/sdd1
3 8 98 3 active sync /dev/sdg1
4 8 114 4 active sync /dev/sdh1
/proc/mdstat
now shows sdb1
as failed:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg1[3] sdh1[4] sdb1[0](F) sdc1[1] sdd1[2]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
bitmap: 0/109 pages [0KB], 65536KB chunk
unused devices: <none>
Lets verify if the mountpoint is still there:
[root@host ~]# df -h
Filesystem Size Used Avail Use% Mounted on
dev 7.7G 0 7.7G 0% /dev
run 7.7G 808K 7.7G 1% /run
/dev/sda3 30G 3.2G 26G 11% /
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 7.7G 0 7.7G 0% /tmp
/dev/sda1 511M 74M 438M 15% /boot
tmpfs 1.6G 0 1.6G 0% /run/user/0
/dev/mapper/nas-lvtest 2.0G 707M 1.1G 39% /mnt
I copied a file to mountpoint /mnt/
and it’s still there:
[root@host ~]# ls -al /mnt/
total 723168
drwxr-xr-x 3 root root 4096 Aug 12 00:07 .
drwxr-xr-x 17 root root 4096 Jul 20 23:01 ..
-rw-r--r-- 1 root root 740492896 Aug 12 00:07 00003.ts
drwx------ 2 root root 16384 Aug 12 00:06 lost+found
And the file still can be read despite the fact that the disk array is missing one disk:
[root@host ~]# file /mnt/00003.ts
/mnt/00003.ts: MPEG transport stream data
Just to get sure we remove the device (which basically already happened as seen above):
mdadm --remove /dev/md/nas /dev/sdb1
Executing dmesg
one should also see this operation in the kernel log:
dmesg
[Thu Aug 12 22:09:40 2021] md/raid:md127: Disk failure on sdb1, disabling device.
md/raid:md127: Operation continuing on 4 devices.
Now the RAID superblock on the failed device can be deleted e.g.:
mdadm --zero-superblock /dev/sdb1
And again verify that the /mnt
mountpoint and the file is still there:
ls -al /mnt/
total 723168
drwxr-xr-x 3 root root 4096 Aug 12 00:07 .
drwxr-xr-x 17 root root 4096 Jul 20 23:01 ..
-rw-r--r-- 1 root root 740492896 Aug 12 00:07 00003.ts
Now the SATA cable of the failed device can be pulled and the host rebooted (just to have a clean state). Now lets have a look at /proc/mdstat
again and see the difference (sdb1
is missing):
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdf1[3] sdc1[2] sdg1[4] sdb1[1]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
bitmap: 1/109 pages [4KB], 65536KB chunk
Shutdown the computer and plug in the HDD cable again. If the host has started we’ll see that device sdb
is still not part of the array again:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdg2[3] sdc2[1] sdh2[4] sdd2[2]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
bitmap: 1/109 pages [4KB], 65536KB chunk
unused devices: <none>
Normally you now need to partion the new drive now but in my case the partition is still there as it’s the same device and the partition table wasn’t deleted (just the RAID superblock was zeroed and the SATA cable was unplugged). If the hard disk would have been replaced with a completely new drive a new partition table (type GPT
) and a new partition (type Linux RAID
) needs to be created as I did at the beginning. So in my case it looks like this:
sfdisk -l /dev/sdb
Disk /dev/sdb: 14.55 TiB, 16000900661248 bytes, 31251759104 sectors
Disk model: TOSHIBA MG08ACA1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 511F9A50-7158-FA4D-B0A3-D28AFA9A5A5F
Device Start End Sectors Size Type
/dev/sdb1 2048 31251513343 31251511296 14.6T Linux RAID
As the partition /dev/sdb1
is already there I can add the “new” partition again to the RAID:
mdadm --add /dev/md/nas /dev/sdb1
Now in /proc/mdstat
the recovery process can be monitored:
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdb1[5] sdg1[3] sdc1[1] sdh1[4] sdd1[2]
43655645184 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/4] [_UUUU]
[>....................] recovery = 0.1% (18229760/14551881728) finish=1708.9min speed=141741K/sec
bitmap: 0/109 pages [0KB], 65536KB chunk
Depending on the disk/array size the recovery process can take some time. Neverthe less /etc/mdadm.conf
should be updated again after the change (making a backup of the old file before):
cp /etc/mdadm.conf /etc/mdadm.conf.backup
mdadm --examine --scan --verbose > /etc/mdadm.conf
In my case the output should be basically the same as the previous content of /etc/mdadm.conf
. The order of devices listed in devices
list of the ARRAY
keyword most probably is different but this doesn’t matter. The important thing here is that every member of the RAID must be listed there. The order doesn’t matter.
If you did only testing you can now remove the mountpoint, filesystem and LVM resources e.g.:
umount /mnt
lvremove /dev/nas/lvtest
Closing notes
To repeat it again: Do NOT forget about this important points:
- Be careful when working with filesystems, partitions, hard disks and LVM resources!
- Whatever RAID level you choose you still NEED a backup of your data (RAID0 gives you NO redundancy at all of course)!
- Install a cronjob/systemd timer that scrubs/checks your RAID regularly for bad sectors!
- Monitor the result of the scrubbing process by either sending a status email or by providing metrics to a monitoring/alerting software like Prometheus
- Use S.M.A.R.T. to monitor your drives!
S.M.A.R.T.
self tests most probably shouldn’t run at the same time as the scrubbing process as this has a performance impact. Also see my blog post Easy hard disk health monitoring on Linux with scrutiny and S.M.A.R.T
Acknowledgement
- Softare-RAID 0,1,5,6 oder 10 (only German but excellent guide…)
- RAID - Archlinux
- LVM on software RAID - Archlinux