An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| curtin |
Medium
|
Unassigned |
Bug Description
I had a machine failed to deploy for yet unknown reason. I've tried deploying it again and that failed rather quickly. Logs in MAAS show:
curtin: Installation started. (17.1-11-
third party drivers not installed or necessary.
Invalid number 0 of holding devices: "[]"
An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Invalid number 0 of holding devices: "[]"
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: Invalid number 0 of holding devices: "[]"
An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Invalid number 0 of holding devices: "[]"
Stderr: ''
I've attached curtin data. I'll see if I can come up with more verbose logs.
Related branches
- Scott Moser: Approve on 2018-03-01
- Server Team CI bot: Approve (continuous-integration) on 2018-03-01
-
Diff: 381 lines (+120/-58)7 files modifiedcurtin/block/__init__.py (+11/-9)
curtin/block/clear_holders.py (+62/-28)
curtin/commands/block_meta.py (+9/-4)
examples/tests/mdadm_bcache.yaml (+22/-5)
tests/unittests/test_block.py (+10/-6)
tests/unittests/test_clear_holders.py (+4/-4)
tests/vmtests/test_mdadm_bcache.py (+2/-2)
Ante Karamatić (ivoks) wrote : | #1 |
Ante Karamatić (ivoks) wrote : | #2 |
Ryan Harper (raharper) wrote : | #3 |
The logs indicate that we could not stop the raid array after numerous retries.
Feb 20 08:41:24 infra1 cloud-init[2863]: Running command ['mdadm', '--detail', '--scan', '-v'] with allowed return codes [0, 1] (capture=True)
Feb 20 08:41:24 infra1 cloud-init[2863]: mdadm detail scan after assemble:
Feb 20 08:41:24 infra1 cloud-init[2863]: ARRAY /dev/md/infra1:0 level=raid5 num-devices=3 metadata=1.2 spares=1 name=infra1:0 UUID=9ab31ec9:
Feb 20 08:41:24 infra1 cloud-init[2863]: devices=
That's the RAID5 array
Here's the storage tree layout:
Feb 20 08:41:26 infra1 cloud-init[2863]: Current device storage tree:
Feb 20 08:41:26 infra1 cloud-init[2863]: sda
Feb 20 08:41:26 infra1 cloud-init[2863]: sdb
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdc
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sdd
Feb 20 08:41:26 infra1 cloud-init[2863]: `-- md127
Feb 20 08:41:26 infra1 cloud-init[2863]: sde
Feb 20 08:41:26 infra1 cloud-init[2863]: sdf
Feb 20 08:41:26 infra1 cloud-init[2863]: shutdown running on holder type: 'raid' syspath: '/sys/class/
And curtin tries up to 60 times to stop the raid:
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 0
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/
Feb 20 08:41:26 infra1 cloud-init[2863]: /sys/class/
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: setting array sync_{min,max}=0
Feb 20 08:41:26 infra1 kernel: [ 86.124825] bcache: register_bcache() error opening /dev/md127: device already registered
Feb 20 08:41:26 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:26 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:26 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Feb 20 08:41:26 infra1 cloud-init[2863]: md127 : active (auto-read-only) raid5 sdd[3] sdc[1] sdb[0]
Feb 20 08:41:26 infra1 cloud-init[2863]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Feb 20 08:41:26 infra1 cloud-init[2863]: bitmap: 0/8 pages [0KB], 65536KB chunk
Feb 20 08:41:26 infra1 cloud-init[2863]: unused devices: <none>
Feb 20 08:41:26 infra1 cloud-init[2863]: mdadm: stop failed, retrying in 0.2 seconds
...
Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm: stop on /dev/md127 attempt 59
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/
Feb 20 08:41:40 infra1 cloud-init[2863]: /sys/class/
Feb 20 08:41:40 infra1 cloud-init[2863]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Feb 20 08:41:40 infra1 cloud-init[2863]: mdadm stop failed, retrying
Feb 20 08:41:40 infra1 cloud-init[2863]: /proc/mdstat:
Feb 20 08:41:40 infra1 cloud-init[2863]: Personalities : [raid6] [raid5] [rai...
Changed in curtin: | |
importance: | Undecided → Medium |
status: | New → Triaged |
Ryan Harper (raharper) wrote : | #4 |
Note, that the logs you pasted were not in the logs attached; they appear to be different issues. Namely, the description:
Stdout: Invalid number 0 of holding devices: "[]"
An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]"
Indicates a make-bcache failure; the kernel did not bind a backing device to a bcacheN device. This is also likely related to bcache over RAID5.
Ante Karamatić (ivoks) wrote : | #5 |
I've deployed three machines with exactly the same config. One deployed, two didn't. When I removed the bcache from those other two, they still failed to deploy (saying that md0 device is in use). I've wiped the disks now, and that will tell me if it works on clean machines.
However, expectation is that machines can be reused with MAAS and therefore curtin should clean up the disks before it assembles any bcaches or raids. IMHO.
Ryan Harper (raharper) wrote : | #6 |
Here's what was happening. After one deployment, the raid array had a bcache superblock on it. when we clear a raid array, we wipe the raid metadata, but not any of the data on the assembled devices. so when the same disks where re-assembled into md0 again, it immediately looked like a bcache device, but it was inactive; not bound to the cache device.
There will be two issues to fix.
1) when we attempt to validate that a bcache device was created, we need to check that it is actively running (/sys/class/
2) In clear-holders, for layered devices (raid, bcache, lvm, crypt) we should wipe the superblock of the composed devices (/dev/md0, /dev/lvoot, etc) before we decompose them so that if we put the same devices back together we don't "accidently" restore data.
Ryan Harper (raharper) wrote : | #7 |
I was able to recreate this scenario by enabling a bcache over a RAID5 array, and enabling curtin's dirty-disk mode in our test harness which deploys the storage configuration twice.
Changed in curtin: | |
status: | Triaged → Confirmed |
Changed in curtin: | |
status: | Confirmed → Fix Committed |
Ante Karamatić (ivoks) wrote : | #8 |
With the patch, I'm still experiencing problem described in comment 3, but I've seen that only once. I'm also experiencing this:
finish: cmd-install/
start: cmd-install/
devsync for /dev/md0
Running command ['partprobe', '/dev/md0'] with allowed return codes [0, 1] (capture=False)
Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
devsync happy - path /dev/md0 now exists
return volume path /dev/md0
Processing serial BTHC63800AX9480MGN via udev to BTHC63800AX9480MGN
devsync for /dev/sde
Running command ['partprobe', '/dev/sde'] with allowed return codes [0, 1] (capture=False)
Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
devsync happy - path /dev/sde now exists
return volume path /dev/sde
caching device does not yet exist at /sys/class/
Running command ['make-bcache', '-C', '/dev/sde'] with allowed return codes [0] (capture=True)
out=[UUID: 80409bc0-
Set UUID: b8dc8e2e-
version: 0
nbuckets: 915725
block_size: 1
nr_in_set: 1
]
check just created bcache /dev/sde if it is registered, try=1
Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
Found bcache dev /dev/sde at expected path /sys/fs/
validating bcache caching device '/dev/sde' from sys_path '/sys/fs/
Resolving symlink /sys/fs/
Found match: bcache_
bcache dev /dev/sde at path /sys/fs/
An error occured handling 'bcache0': RuntimeError - ('Unexpected old bcache device: %s', '/dev/md0')
finish: cmd-install/
finish: cmd-install/
Traceback (most recent call last):
File "/curtin/
ret = args.func(args)
File "/curtin/
File "/curtin/
Ante Karamatić (ivoks) wrote : | #9 |
Another occurrence:
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 6
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Mar 21 12:14:34 infra3 cloud-init[2891]: [>.....
Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
Mar 21 12:14:34 infra3 cloud-init[2891]:
Mar 21 12:14:34 infra3 cloud-init[2891]: unused devices: <none>
Mar 21 12:14:34 infra3 cloud-init[2891]:
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 7
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Mar 21 12:14:34 infra3 cloud-init[2891]: [>.....
Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
Ryan Harper (raharper) wrote : Re: [Bug 1750519] Re: An error occured handling 'bcache0': ValueError - Invalid number 0 of holding devices: "[]" | #10 |
Please attach the entire curtin install.log in debug mode; It's
critical to see what the storage configuration was, what curtin
cleared from the disks.
On Wed, Mar 21, 2018 at 7:15 AM, Ante Karamatić
<email address hidden> wrote:
> Another occurrence:
>
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 6
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
> Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
> Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
> Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
> Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
> Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> Mar 21 12:14:34 infra3 cloud-init[2891]: [>.....
> Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
> Mar 21 12:14:34 infra3 cloud-init[2891]:
> Mar 21 12:14:34 infra3 cloud-init[2891]: unused devices: <none>
> Mar 21 12:14:34 infra3 cloud-init[2891]:
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop failed, retrying in 0.2 seconds
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: stop on /dev/md127 attempt 7
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm: setting array sync_action=idle
> Mar 21 12:14:34 infra3 cloud-init[2891]: /sys/class/
> Mar 21 12:14:34 infra3 cloud-init[2891]: Running command ['mdadm', '--manage', '--stop', '/dev/md127'] with allowed return codes [0] (capture=True)
> Mar 21 12:14:34 infra3 cloud-init[2891]: mdadm stop failed, retrying
> Mar 21 12:14:34 infra3 cloud-init[2891]: /proc/mdstat:
> Mar 21 12:14:34 infra3 cloud-init[2891]: Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
> Mar 21 12:14:34 infra3 cloud-init[2891]: md127 : active raid5 sdd[3] sdc[1] sdb[0]
> Mar 21 12:14:34 infra3 cloud-init[2891]: 1953262592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> Mar 21 12:14:34 infra3 cloud-init[2891]: [>.....
> Mar 21 12:14:34 infra3 cloud-init[2891]: bitmap: 2/8 pages [8KB], 65536KB chunk
>...
This bug is believed to be fixed in curtin in 18.1. If this is still a problem for you, please make a comment and set the state back to New
Thank you.
Changed in curtin: | |
status: | Fix Committed → Fix Released |
Boot log, including verbose curtin data. Note that this is from a different machine, but with exactly the same disk configuration.