An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"

Bug #1750519 reported by Ante Karamatić
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Fix Released
Medium
Unassigned

Bug Description

I had a machine failed to deploy for yet unknown reason. I've tried deploying it again and that failed rather quickly. Logs in MAAS show:

curtin: Installation started. (17.1-11-ga4c9636b-0ubuntu1~16.04.1)
third party drivers not installed or necessary.
Invalid number 0 of holding devices: "[]"
An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Invalid number 0 of holding devices: "[]"
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: Invalid number 0 of holding devices: "[]"
        An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
        Invalid number 0 of holding devices: "[]"

Stderr: ''

I've attached curtin data. I'll see if I can come up with more verbose logs.

Tags: cpe-onsite

Related branches

Revision history for this message
Ante Karamatić (ivoks) wrote :
Revision history for this message
Ante Karamatić (ivoks) wrote :

Boot log, including verbose curtin data. Note that this is from a different machine, but with exactly the same disk configuration.

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (4.0 KiB)

The logs indicate that we could not stop the raid array after numerous retries.

Feb 20 08:41:24 infra1 cloud-init[2863]: Running command ['mdadm', '--detail', '--scan', '-v'] with allowed return codes [0, 1] (capture=True)
Feb 20 08:41:24 infra1 cloud-init[2863]: mdadm detail scan after assemble:
Feb 20 08:41:24 infra1 cloud-init[2863]: ARRAY /dev/md/infra1:0 level=raid5 num-devices=3 metadata=1.2 spares=1 name=infra1:0 UUID=9ab31ec9:f764d630:e27e2ca5:6a083092
Feb 20 08:41:24 infra1 cloud-init[2863]: devices=/dev/sdb,/dev/sdc,/dev/sdd

That's the RAID5 array

Here's the storage tree layout:

Feb 20 08:41:26 infra1 cloud-init[2863]: Current device storage tree:
Feb 20 08:41:26 infra1 cloud-init[2863]: sda
Feb 20 08:41:26 infra1 cloud-init[2863]: sdb
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdc
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdd
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sde
Feb 20 08:41:26 infra1 cloud-init[2863]: sdf
Feb 20 08:41:26 infra1 cloud-init[2863]: shutdown running on holder type: 'raid' syspath: '/sys/class/block/md127'

And curtin tries up to 60 times to stop the raid:

Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 0
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_action/sync_max = idle
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_max/sync_max = max
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: setting array sync_{min,max}=0
Feb 20 08:41:26 infra1 kernel: [ 86.124825] bcache: register_bcache() error opening /dev/md127: device already registered
Feb 20 08:41:26 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:26 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:26 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Feb 20 08:41:26 infra1 cloud-init[2863]: md127 : active (auto-read-only) raid5 sdd[3] sdc[1] sdb[0]
Feb 20 08:41:26 infra1 cloud-init[2863]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Feb 20 08:41:26 infra1 cloud-init[2863]: bitmap: 0/8 pages [0KB], 65536KB chunk
Feb 20 08:41:26 infra1 cloud-init[2863]: unused devices: <none>
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop failed, retrying in 0.2 seconds

...

Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 59
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_action/sync_max = idle
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/block/md127/md/sync_max/sync_max = 0
Feb 20 08:41:40 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:40 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:40 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [rai...

Read more...

Changed in curtin:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Ryan Harper (raharper) wrote :

Note, that the logs you pasted were not in the logs attached; they appear to be different issues. Namely, the description:

Stdout: Invalid number 0 of holding devices: "[]"
        An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"

Indicates a make-bcache failure; the kernel did not bind a backing device to a bcacheN device. This is also likely related to bcache over RAID5.

Revision history for this message
Ante Karamatić (ivoks) wrote :

I've deployed three machines with exactly the same config. One deployed, two didn't. When I removed the bcache from those other two, they still failed to deploy (saying that md0 device is in use). I've wiped the disks now, and that will tell me if it works on clean machines.

However, expectation is that machines can be reused with MAAS and therefore curtin should clean up the disks before it assembles any bcaches or raids. IMHO.

Revision history for this message
Ryan Harper (raharper) wrote :

Here's what was happening. After one deployment, the raid array had a bcache superblock on it. when we clear a raid array, we wipe the raid metadata, but not any of the data on the assembled devices. so when the same disks where re-assembled into md0 again, it immediately looked like a bcache device, but it was inactive; not bound to the cache device.

There will be two issues to fix.

1) when we attempt to validate that a bcache device was created, we need to check that it is actively running (/sys/class/block/<backing dev kname>/bcache/running == 1) the mere presence of a bcache sysfs dir is not sufficient; if not running ,then we should run make-bcache -B which would create new superblock and activate it in the kernel.

2) In clear-holders, for layered devices (raid, bcache, lvm, crypt) we should wipe the superblock of the composed devices (/dev/md0, /dev/lvoot, etc) before we decompose them so that if we put the same devices back together we don't "accidently" restore data.

Revision history for this message
Ryan Harper (raharper) wrote :

I was able to recreate this scenario by enabling a bcache over a RAID5 array, and enabling curtin's dirty-disk mode in our test harness which deploys the storage configuration twice.

Changed in curtin:
status: Triaged → Confirmed
Ryan Harper (raharper)
Changed in curtin:
status: Confirmed → Fix Committed
Revision history for this message
Ante Karamatić (ivoks) wrote :
Download full text (3.4 KiB)

With the patch, I'm still experiencing problem described in comment 3, but I've seen that only once. I'm also experiencing this:

        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: SUCCESS: configuring raid: md0
        start: cmd-install/stage-partitioning/builtin/cmd-block-meta: configuring bcache: bcache0
        get_path_to_storage_volume for volume md0
        devsync for /dev/md0
        Running command ['partprobe', '/dev/md0'] with allowed return codes [0, 1] (capture=False)
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        devsync happy - path /dev/md0 now exists
        return volume path /dev/md0
        get_path_to_storage_volume for volume sde
        Processing serial BTHC63800AX9480MGN via udev to BTHC63800AX9480MGN
        devsync for /dev/sde
        Running command ['partprobe', '/dev/sde'] with allowed return codes [0, 1] (capture=False)
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        devsync happy - path /dev/sde now exists
        return volume path /dev/sde
        caching device does not yet exist at /sys/class/block/sde/bcache. Make cache and get uuid
        Running command ['make-bcache', '-C', '/dev/sde'] with allowed return codes [0] (capture=True)
        out=[UUID: 80409bc0-2339-47f5-97b3-1d27ddc3fdec
        Set UUID: b8dc8e2e-1b85-4761-9696-b23cb0cb69f0
        version: 0
        nbuckets: 915725
        block_size: 1
        bucket_size: 1024
        nr_in_set: 1
        nr_this_dev: 0
        first_bucket: 1
        ]
        check just created bcache /dev/sde if it is registered, try=1
        Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
        Found bcache dev /dev/sde at expected path /sys/fs/bcache/b8dc8e2e-1b85-4761-9696-b23cb0cb69f0
        validating bcache caching device '/dev/sde' from sys_path '/sys/fs/bcache/b8dc8e2e-1b85-4761-9696-b23cb0cb69f0'
        Resolving symlink /sys/fs/bcache/b8dc8e2e-1b85-4761-9696-b23cb0cb69f0/cache0 -> ../../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/host0/port-0:0/expander-0:0/port-0:0:4/end_device-0:0:4/target0:0:4/0:0:4:0/block/sde/bcache
        Found match: bcache_device=/dev/sde target_device=sde
        bcache dev /dev/sde at path /sys/fs/bcache/b8dc8e2e-1b85-4761-9696-b23cb0cb69f0 successfully registered on attempt 1/60
        An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md0')
        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: configuring bcache: bcache0
        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: curtin command block-meta
        Traceback (most recent call last):
          File "/curtin/curtin/commands/main.py", line 215, in main
            ret = args.func(args)
          File "/curtin/curtin/commands/block_meta.py", line 67, in block_meta
            meta_custom(args)
          File "/curtin/curtin/commands/block_meta.py", line 1243, in meta_custom
            handl...

Read more...

Revision history for this message
Ante Karamatić (ivoks) wrote :

Another occurrence:

Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 6
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_action/sync_max = recover
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_max/sync_max = 0
Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Mar 21 12:14:34 infra3 cloud-init[2891]: [>....................] recovery = 0.1% (1029720/976631296) finish=1016251.6min speed=0K/sec
Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
Mar 21 12:14:34 infra3 cloud-init[2891]:
Mar 21 12:14:34 infra3 cloud-init[2891]: unused devices: <none>
Mar 21 12:14:34 infra3 cloud-init[2891]:
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 7
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_action/sync_max = recover
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_max/sync_max = 0
Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Mar 21 12:14:34 infra3 cloud-init[2891]: [>....................] recovery = 0.1% (1029720/976631296) finish=1016251.6min speed=0K/sec
Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1750519] Re: An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Download full text (3.5 KiB)

Please attach the entire curtin install.log in debug mode; It's
critical to see what the storage configuration was, what curtin
cleared from the disks.

On Wed, Mar 21, 2018 at 7:15 AM, Ante Karamatić
<email address hidden> wrote:
> Another occurrence:
>
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 6
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_action/sync_max = recover
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_max/sync_max = 0
> Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
> Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
> Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
> Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
> Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> Mar 21 12:14:34 infra3 cloud-init[2891]: [>....................] recovery = 0.1% (1029720/976631296) finish=1016251.6min speed=0K/sec
> Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
> Mar 21 12:14:34 infra3 cloud-init[2891]:
> Mar 21 12:14:34 infra3 cloud-init[2891]: unused devices: <none>
> Mar 21 12:14:34 infra3 cloud-init[2891]:
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 7
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_action/sync_max = recover
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/block/md127/md/sync_max/sync_max = 0
> Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
> Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
> Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
> Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
> Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> Mar 21 12:14:34 infra3 cloud-init[2891]: [>....................] recovery = 0.1% (1029720/976631296) finish=1016251.6min speed=0K/sec
> Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
>...

Read more...

Revision history for this message
Scott Moser (smoser) wrote : Fixed in Cloud-init 18.1

This bug is believed to be fixed in curtin in 18.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.