[2.5] pod VMs fail to commission due to corrupted initrd
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
New
|
Undecided
|
Unassigned |
Bug Description
On MaaS HA 2.5, some pod VMs are in 'Ready' state and a few VMs seem to get stuck on commissioning.
2019-01-08-15:12:40 foundationcloud
Status: {u'landscapeamq
summary: |
- maas 2.5: pod VMs stuck on commissioning + [2.5] Pod VMs stuck on commissioning |
Changed in maas: | |
status: | Incomplete → New |
Changed in maas: | |
status: | Incomplete → New |
summary: |
- [2.5] Pod VMs stuck on commissioning + [2.5] pod VMs fail to commission due to corrupted initrd |
First I looked at the logs from `landscapesql-1`, which (according to your logs) seems to be in Ready state. Indeed, it looks like cloud-init runs and commissions the machine successfully.
2019-01- 08T14:53: 01+00:00 landscapesql-1 cloud-init[925]: All scripts successfully ran 08T14:53: 01+00:00 landscapesql-1 cloud-init[925]: Cloud-init v. 18.4-0ubuntu1~ 18.04.1 finished at Tue, 08 Jan 2019 14:53:01 +0000. Datasource DataSourceMAAS [http:// 10-244- 40-0--21. maas-internal: 5248/MAAS/ metadata/]. Up 176.97 seconds
2019-01-
Then I looked at the logs for `landscapeha-1`, which seems to still be in 'Commissioning' state. In contrast, there is no rsyslog output, and the machine is terminated by libvirtd less than a second after attempting startup:
2019-01-08 14:49:22.811+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100), qemu version: 2.11.1(Debian 1:2.11+ dfsg-1ubuntu7. 9), hostname: leafeon
...
2019-01-08 14:49:23.162+0000: shutting down, reason=destroyed
However, later in the logs I see the startup message with no corresponding shutdown message, so I don't know if the machine actually booted and is attempting to commission.
Can you see any pattern regarding which hosts failed to commission, or is it random every time? Do these machines have a unique networking or storage configuration? (For example, maybe MAAS has created machines based on interface constraints without a PXE network. I thought we had a separate open bug on that, but I can't find it.)
Are these hosts still in commissioning state according to the MAAS database? What happens if you try to commission them again? Can you use a tool such as virt-manager to browse the hypervisor and determine if anything is suspicious about the VM configuration, such as missing or incorrectly attached NICs, a duplicate MAC address, etc?