Comment 0 for bug 1587142

Revision history for this message
Benoît Thébaudeau (btheb) wrote : Reboot hangs once RAID1 resynced

I'm booting a fully patched 16.04 from an Intel Rapid Storage Technology enterprise RAID1 volume (ThinkServer TS140 with two SATA ST1000NM0033-9ZM drives, ext4 root partition, no LVM, UEFI mode).

If the RAID volume is recovering or resyncing for whatever reason, then `sudo systemctl reboot` and `sudo systemctl poweroff` work fine (I had to `sudo systemctl --now disable lvm2-lvmetad lvm2-lvmpolld lvm2-monitor` in order to consistently get that). However, once the recovery/resync is complete and clean, the reboot and poweroff commands above hang forever after "Reached target Shutdown.". Note that issuing `sudo swapoff -a` beforehand (suggested in the bug #1464917) does not help.

Then, if the server has been forcibly restarted with the power button, the Intel Matrix Storage Manager indicates a "Normal" status for the RAID1 volume, but Ubuntu then resyncs the volume anyway:

[ 1.223649] md: bind<sda>
[ 1.228426] md: bind<sdb>
[ 1.230030] md: bind<sdb>
[ 1.230738] md: bind<sda>
[ 1.232985] usbcore: registered new interface driver usbhid
[ 1.233494] usbhid: USB HID core driver
[ 1.234022] md: raid1 personality registered for level 1
[ 1.234876] md/raid1:md126: not clean -- starting background reconstruction
[ 1.234956] input: CHESEN USB Keyboard as /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10:1.0/0003:0A81:0101.0001/input/input5
[ 1.236273] md/raid1:md126: active with 2 out of 2 mirrors
[ 1.236797] md126: detected capacity change from 0 to 1000202043392
[ 1.246271] md: md126 switched to read-write mode.
[ 1.246834] md: resync of RAID array md126
[ 1.247325] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 1.247503] md126: p1 p2 p3 p4
[ 1.248269] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[ 1.248774] md: using 128k window, over a total of 976759940k.

Note that the pain of "resync upon every (re)boot" cannot even be a bit relieved thanks to bitmaps because mdadm does not support them for IMSM containers:

$ sudo mdadm --grow --bitmap=internal /dev/md126
mdadm: Cannot add bitmaps to sub-arrays yet

I also get this in syslog during boot when the individual drives are detected, but this seems to be harmless:

May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/sbin/mdadm --incremental /dev/sdb --offroot' failed with exit code 1.
May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/lib/udev/hdparm' failed with exit code 1.

May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/sbin/mdadm --incremental /dev/sda --offroot' failed with exit code 1.
May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/lib/udev/hdparm' failed with exit code 1.

During a resync, `sudo sh -c 'echo idle > /sys/block/md126/md/sync_action'` actually stops it as expected, but it restarts immediately though nothing seems to have triggered it:

May 30 18:17:02 wssrv1 kernel: [ 3106.826710] md: md126: resync interrupted.
May 30 18:17:02 wssrv1 kernel: [ 3106.836320] md: checkpointing resync of md126.
May 30 18:17:02 wssrv1 kernel: [ 3106.836623] md: resync of RAID array md126
May 30 18:17:02 wssrv1 kernel: [ 3106.836625] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
May 30 18:17:02 wssrv1 kernel: [ 3106.836626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
May 30 18:17:02 wssrv1 kernel: [ 3106.836627] md: using 128k window, over a total of 976759940k.
May 30 18:17:02 wssrv1 kernel: [ 3106.836628] md: resuming resync of md126 from checkpoint.
May 30 18:17:02 wssrv1 mdadm[982]: RebuildStarted event detected on md device /dev/md/Volume0

I attach screenshots of the hanging shutdown log after a `sudo sh -c 'echo 8 > /proc/sys/kernel/printk'`. The second screenshot shows that the kernel has deadlocked in md_write_start(). Note that `sudo systemctl start debug-shell` is unusable on this machine at this point because Ctrl+Alt+F9 brings tty9 without any keyboard.

I have also tried with much lower values for vm.dirty_background_ratio and vm.dirty_ratio, but to no avail.

Linux 4.6.0-040600-generic_4.6.0-040600.201605151930_amd64 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/ did not help either.

More information below:

$ lsb_release -rd
Description: Ubuntu 16.04 LTS
Release: 16.04

$ uname -a
Linux wssrv1 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ apt-cache policy systemd
systemd:
  Installed: 229-4ubuntu6
  Candidate: 229-4ubuntu6
  Version table:
 *** 229-4ubuntu6 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     229-4ubuntu4 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md126 : active raid1 sda[1] sdb[0]
      976759808 blocks super external:/md127/0 [2/2] [UU]
      [>....................] resync = 3.3% (32651584/976759940) finish=85.9min speed=183164K/sec

md127 : inactive sdb[1](S) sda[0](S)
      5288 blocks super external:imsm

unused devices: <none>

$ sudo mdadm -D /dev/md127
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2

           UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
  Member Arrays : /dev/md/Volume0

    Number Major Minor RaidDevice

       0 8 0 - /dev/sda
       1 8 16 - /dev/sdb

$ sudo mdadm -D /dev/md126
/dev/md126:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 976759808 (931.51 GiB 1000.20 GB)
  Used Dev Size : 976759940 (931.51 GiB 1000.20 GB)
   Raid Devices : 2
  Total Devices : 2

          State : clean, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

  Resync Status : 5% complete

           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
    Number Major Minor RaidDevice State
       1 8 0 0 active sync /dev/sda
       0 8 16 1 active sync /dev/sdb

$ sudo mdadm -E /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 92b6e3e4
         Family : 92b6e3e4
     Generation : 00000075
     Attributes : All supported
           UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
       Checksum : 5ad6e3c8 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk00 Serial : Z1W50P5E
          State : active
             Id : 00000000
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

[Volume0]:
           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [UU]
    Failed disk : none
      This Slot : 0
     Array Size : 1953519616 (931.51 GiB 1000.20 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 7630936
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : repair
      Map State : normal <-- normal
     Checkpoint : 201165 (512)
    Dirty State : dirty

  Disk01 Serial : Z1W519DN
          State : active
             Id : 00000001
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

$ sudo mdadm -E /dev/sdb
/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 92b6e3e4
         Family : 92b6e3e4
     Generation : 00000075
     Attributes : All supported
           UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
       Checksum : 5ad6e3c8 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : Z1W519DN
          State : active
             Id : 00000001
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

[Volume0]:
           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [UU]
    Failed disk : none
      This Slot : 1
     Array Size : 1953519616 (931.51 GiB 1000.20 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 7630936
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : repair
      Map State : normal <-- normal
     Checkpoint : 201165 (512)
    Dirty State : dirty

  Disk00 Serial : Z1W50P5E
          State : active
             Id : 00000000
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)