System startup fails with degraded RAID + encryption

Bug #324997 reported by Tapani Rantakokko on 2009-02-03
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: mdadm

Release: Ubuntu 8.04.2, from alternate installation media. Probably also affects Ubuntu 8.10.

Module: mdadm. Possibly affects others as well.

Version: 2.6.3. Probably also affects 2.6.7, which is used in 8.10.

Situation: I have installed \root and \swap to RAID 1 partitions with encryption and LVM. When one of the RAID disks is missing from the system, a prompt should ask whether I want to start with a degraded RAID setup or not. Answering Yes should start up the system.

What happens: This depends on whether "quiet splash" boot options are in use or not, as follows:

- When "quiet splash" is used, a prompt for LUKS encryption password appears normally, but fails again and again even if the user types the correct password. At this point the user probably thinks that the password is not correct, or that there's something wrong in encryption. Both assumptions are incorrect, as in fact there is nothing wrong with encryption nor the password... Nevertheless, the system fails to start. By pressing CTRL+ALT+F1, one can see these messages on screen, which also lead to wrong direction:
Starting up ...
Loading, please wait...
Setting up cryptographic volume md1_crypt (based on /dev/md1)
cryptsetup: cryptsetup failed, bad password or options?
cryptsetup: cryptsetup failed, bad password or options?

- Without "quiet splash" boot options, the user can observe system messages on screen during boot process. At a certain point, the missing RAID disk causes a long wait. During this wait, one can see these messages on screen:
Command failed: Not a block device
cryptsetup: cryptsetup failed, bad password or options?
... other stuff ...
Command failed: Not a block device
cryptsetup: cryptsetup failed, bad password or options?
Command failed: Not a block device
cryptsetup: cryptsetup failed, bad password or options?
cryptsetup: maximum number of tries exceeded
Done.
Begin: Waiting for root file system... ...
After a few minutes, the system prompts whether the user wants to start with a degraded RAID setup. After answering Yes and another few minutes of waiting, the system presents the command line (Busybox), ie. fails to start. This happens probably because \root is encrypted and needs to be opened with a password, however the password prompt already failed during the long RAID wait period.

Note that it is possible to start the system from Busybox by typing "cryptsetup luksOpen /dev/md1 md1_crypt", then typing LUKS password, and finally pressing CTRL+D. This proves that the encryption works and the problem is related to degraded RAID, handled by mdadm. Also the system starts properly when all RAID disks are present; the problem only appears with a degraded RAID.

More information:
For a long time, it was not possible to properly boot with a degraded RAID setup. This bug was never present in Debian, only in Ubuntu. See bug 120375. A solution was presented in 8.10, which was recently backported to 8.04.2. See bug 290885. Apparently, encryption was not dealt with the fix, and thus this bug likely affects all recent Ubuntu releases, including 8.10.

I consider this quite important to be fixed, as many RAID 1 users also need to use encryption for protecting their data, especially in the business world. Although there is a workaround for a starting the system with degraded RAID and encryption, the error messages clearly lead to wrong direction (encryption is not the problem), and even if the user knows what's going on it is too complicated to look up the workaround by googling in a stressing situation where a production server fails to start.

Suggestions:
- When usplash is used, there should be a notification about a possible degraded RAID array *before* LUKS encryption password is prompted, so that the user has a possibility to figure out that the failure to boot is related to RAID disks, not encryption
- When a degraded RAID is observed, LUKS password should actually be prompted *after* answering the question for starting the system with degraded RAID

Tapani Rantakokko (trantako) wrote :

In my original bug report, I used 8.04.1 that had been live upgraded to 8.04.2. Recently, I tried installing the same setup (RAID1 + encryption + LVM) on another computer, this time directly from Ubuntu 8.04.2 Alternate installation disk. Installation went fine, but in the very first boot after installation the computer did not start. I got these messages when I rebooted without "quiet splash" boot options:

md: invalid raid superblock magic on sda
raid array is not clean -- starting background reconstruction
check root= bootarg cat /proc/cmdline or missing modules, devices: cat /proc/modules ls
/dev ALERT! /dev/mapper/lvm_group-lvm_root does not exist

I got dropped to busybox. "ls /dev/mapper" did not reveal any md devices, so the error seemed correct. Then I figured to test "cat /proc/mdstat". Everything was fine there, both disks up, resync ongoing... First I could not figure out what was going on as the same setup method had previously worked. Then I realized that I was now using 8.04.2 installation disk, which includes some changes for degraded RAID boot procedure. I had already filed this bug, related to degraded RAID and encryption. And, what do you know, after typing the same fix in Busybox "cryptsetup luksOpen /dev/md1 md1_crypt" everything was going to be ok again: with "ls /dev/mapper" I could see my devices. I waited until RAID sync was over (cat /proc/mdstat shows the progress) and then booted with CTRL+D. It worked. I rebooted, and this time LUKS password was queried and system booted correctly.

So, what happened is very much related to this same bug: when you have LUKS encryption on top of RAID 1 array, and the array gets degraded either because of a disk is missing or, as in this case, needs to be resynced, the system won't start. Now it seems even more fatal, as the system fails to boot the very first time after installation, even if both disks are present and working. And once again, error messaged do not give hints to correct direction.

In this case, the problem could be handled at least partially by printing a message that tells that RAID resync is going on, and you can check its progress with this command, and when it is finished you can restart the computer and then it will boot normally. Currently, the user gets confused and probably thinks that the installation did not work at all, although there's nothing wrong with it.

jerryharsh (gerald-jerryh) wrote :

I just installed V8.10 with kernel 2.6.27-11, Alternate CD, 64bit, RAID-1, + encryption on AMD Phenom II 940.

Everything worked fine until I deliberately disconnected one of the mirror drives after placing 'fallback' option and a alternate boot drive in menu.lst (wasn't at the time aware of the 'bootdegraded=true' option. So, I received the full page screen explaining what was occurring. While I am reading this screen, it advances to a second screen and pauses for my input, but the "yes/no" input question appears ON THIS 2ND SCREEN. And, of course now it's too late to answer a 'y' or a 'N'.

Believe this method of asking for user input, after it's too late, should be considered a bug.

Mikhail (mikhail-manuilov) wrote :

Experiencing same problem on Ubuntu 9.04 server (see attach). kernel: 2.6.28-13-server, mdadm: 2.6.7.1

ceg (ceg) wrote :

This is Bug #251164 boot impossible due to missing initramfs failure hook / event driven initramfs

Gareth Evans (garethevans-9) wrote :

Still a problem on 14.04.2 - a workaround I stumbled upon here FYI but don't know if it works:
https://feeding.cloud.geek.nz/posts/the-perils-of-raid-and-full-disk-encryption-on-ubuntu/

Any chance of a fix?

Mechanix (mechanix) wrote :

yes, I can confirm this. Using 14.04. Installation was successful but booting into a degraded raid system won´t work. This is a major issue when using encryption and raid as it gives you a fake security if one drive fails.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers