Deploying a server with bcache on top of HDD and mdadm can frequently fail

Bug #2054672 reported by DUFOUR Olivier
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Alexsander de Souza
3.3
Fix Committed
Undecided
Unassigned
3.4
Fix Released
Undecided
Unassigned
curtin
Fix Committed
Undecided
Alexsander de Souza

Bug Description

Environment :
* MAAS 3.3 and 3.4
* Ubuntu 22.04
* deployment / commissioning OS : 20.04 and 22.04
* Servers to deploy with slow drives such as HDD

When deploying a server using Bcache as its device for rootfs, especially on top of software RAID (mdadm) and with slow drives such as hard drives, the installation of Ubuntu, on the storage configuration step, can fail quite frequently.

#
# Reproducer :
#
It is possible to recreate the environment with slow drives with Libvirt with the following setup :
1) Create around 6 or more VMs with (see the script "create-slow-vms.sh" for the exact commands) :
 * 3 vCPUs
 * 4 GB of RAM
 * 3 disks :
   * 1 x 10 GB fast, as bcache
   * 2 x 30 GB with limited IOPS (150 iops, 30MB/s top speed)

2) With the following disk topology (see reproducer-storage-config.png) :
 * /dev/vda --> 2 partitions
   - 1GB for md0
   - 29GB for md1
 * /dev/vdb --> 2 partitions
   - 1GB for md0
   - 29GB for md1
 * /dev/md0 --> ext4 for /boot
 * /dev/vdc (fast drive) --> bcache0 cache set
 * /dev/md1 --> bcache0 backend storage
 * /dev/bcache0 --> ext4 for /

3) Deploy Ubuntu 22.04 to all VMs
--> some of the VMs will fail with the same error with Curtin

4) (Optional) Also not erasing the drives when releasing and redeploying right away the server seem to increase hugely the likelyness of failing the deployment.

#
# logs
#
I'm attaching to the bug report some more logs :
* quick-summary-logs.txt --> some logs from baremetal servers on customer's hardware.
* reproducer-installation-output.txt --> full installation output from a failing in my reproducer test.

# theory
And at a first glance, it seems to be a race condition, because when reusing the same server and retrying to deploy again Ubuntu, it may works right.
This may be triggered because the hard drives are already sollicited with mdadm currently syncing the disks together and may become even slower when some changes, such as creating a bcache backend device, is requested and then curtin failing with the race condition.

On a large deployment such as Openstack, this make the installation process cumbersome as one or multiple servers may randomly fail to deploy.

Looking at the logs of the installation output from MAAS, curtin seems to fail to confirm the backend storage
# main differences
## working
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: check just created bcache /dev/md1 if it is registered, try=2
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: TIMED udevadm_settle(): 0.018
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: Found bcache dev /dev/md1 at expected path /sys/class/block/md1/bcache
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: validating bcache backing device '/dev/md1' from sys_path '/sys/class/block/md1/bcache'
2024-02-06T10:09:43+00:00 server-node3 cloud-init[2701]: bcache device /sys/class/block/md1/bcache using bcache kname: bcache6
2024-02-06T10:09:44+00:00 server-node3 cloud-init[2701]: bcache device /sys/class/block/md1/bcache has slaves: ['md1']

## non-working
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: check just created bcache /dev/md1 if it is registered, try=2
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: TIMED udevadm_settle(): 0.019
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: Found bcache dev /dev/md1 at expected path /sys/class/block/md1/bcache
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: validating bcache backing device '/dev/md1' from sys_path '/sys/class/block/md1/bcache'
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: bcache dev /dev/md1 at path /sys/class/block/md1/bcache successfully registered on attempt 2/60
2024-02-06T10:09:52+00:00 server-node1 cloud-init[2698]: devname '/dev/md1' had holders: []

Related branches

Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :

Subscribed ~Field High

It penalises greatly an ongoing deployment with a customer relying on Bcache with hard-drives.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Alexsander de Souza (alexsander-souza)
milestone: none → 3.5.0
Changed in curtin:
assignee: nobody → Alexsander de Souza (alexsander-souza)
Changed in maas:
status: Triaged → In Progress
Changed in curtin:
status: New → Fix Committed
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

Curtin updated to 23.1.1-1099-g585dd3a9-0ubuntu1~ubuntu22.04.1

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

Hello

Thank you for your help so far.
I've made more tests on my lab with the daily build of Curtin (22.1-1153-gfc39d744-0ubuntu1+318~trunk~ubuntu22.04.1)

After doing more in depth analysis, I've noticed 3 different scenarios where installations with bcache can fail with Curtin :

1) If the servers are released without having their disks cleaned
--> curtin-logs-without-disk-erasing.tar
Problem : Curtin seems to fail to stop mdadm because bcache is on top, as a wild guess, it might be necessary for curtin to try to stop bcache first and then mdadm to be able to progress any further.

2) If the servers are released with only quick disk erasing and then redeployed
(This is a common scenario with hard-drives since the vast majority of them don't have the feature of secure erase like SSDs, and otherwise using MAAS to erase all the data hard-drives can literally take multiple days to complete)
--> curtin-logs-after-quick-disk-erase.tar
Problem : Partly related to the first issue, MAAS quick erase method doesn't seem to be thorough enough to remove all the partition signature such as bcache on the disks.

3) When using a commission script (manual-clean-disks.sh) to compensate MAAS quick erase not being thorough enough, and redeploying after the race condition can happen (initial subject of this bug report).
I believe it might be fixed since I cannot reproduce it on my lab, but I would need to test on the customer's environment to be able to confirm definitely.

Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

We are discussing this greedy behaviour of bcache in https://bugs.launchpad.net/maas/+bug/1887558

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

At least for the issue #3, with the custom disk cleaning script, I've confirmed for the deployment to be more reliable.

Do we have any idea of the timing for the first fix to be embedded into MAAS' snap ?

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.