Comment 3 for bug 1750519

Revision history for this message
Ryan Harper (raharper) wrote :

The logs indicate that we could not stop the raid array after numerous retries.

Feb 20 08:41:24 infra1 cloud-init[2863]: Running command ['mdadm', '--detail', '--scan', '-v'] with allowed return codes [0, 1] (capture=True)
Feb 20 08:41:24 infra1 cloud-init[2863]: mdadm detail scan after assemble:
Feb 20 08:41:24 infra1 cloud-init[2863]: ARRAY /dev/md/infra1:0 level=raid5 num-devices=3 metadata=1.2 spares=1 name=infra1:0 UUID=9ab31ec9:f764d630:e27e2ca5:6a083092
Feb 20 08:41:24 infra1 cloud-init[2863]: devices=/dev/sdb,/dev/sdc,/dev/sdd

That's the RAID5 array

Here's the storage tree layout:

Feb 20 08:41:26 infra1 cloud-init[2863]: Current device storage tree:
Feb 20 08:41:26 infra1 cloud-init[2863]: sda
Feb 20 08:41:26 infra1 cloud-init[2863]: sdb
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdc
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdd
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sde
Feb 20 08:41:26 infra1 cloud-init[2863]: sdf
Feb 20 08:41:26 infra1 cloud-init[2863]: shutdown running on holder type: 'raid' syspath: '/sys/class/block/md127'

And curtin tries up to 60 times to stop the raid:

Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 0
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_action/sync_max = idle
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_max/sync_max = max
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: setting array sync_{min,max}=0
Feb 20 08:41:26 infra1 kernel: [ 86.124825] bcache: register_bcache() error opening /dev/md127: device already registered
Feb 20 08:41:26 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:26 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:26 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Feb 20 08:41:26 infra1 cloud-init[2863]: md127 : active (auto-read-only) raid5 sdd[3] sdc[1] sdb[0]
Feb 20 08:41:26 infra1 cloud-init[2863]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Feb 20 08:41:26 infra1 cloud-init[2863]: bitmap: 0/8 pages [0KB], 65536KB chunk
Feb 20 08:41:26 infra1 cloud-init[2863]: unused devices: <none>
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop failed, retrying in 0.2 seconds

...

Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 59
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_action/sync_max = idle
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_max/sync_max = 0
Feb 20 08:41:40 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:40 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:40 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Feb 20 08:41:40 infra1 cloud-init[2863]: md127 : active (auto-read-only) raid5 sdd[3] sdc[1] sdb[0]
Feb 20 08:41:40 infra1 cloud-init[2863]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Feb 20 08:41:40 infra1 cloud-init[2863]: bitmap: 0/8 pages [0KB], 65536KB chunk
Feb 20 08:41:40 infra1 cloud-init[2863]: unused devices: <none>
Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm: stop failed, retrying in 0.2 seconds

The interesting bit is the bcache error; it appears to be related to the array. But I don't know how at this time.

Is this reproducible with wiped disks? or Only on redeploy? You filed https://bugs.launchpad.net/curtin/+bug/1750580 which I've not yet figured out if bcache supports using raid devices. If bcache over raid is not supported, then this is fallout from that lack of support.