MAAS power-on timeout is too low for LXD

Bug #2026181 reported by Simon Fels
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

Hey :-)

We have a LXD cluster running on six Arm servers (Ampere eMAG and Altra). On the cluster we have a set of VMs we manually registered to MAAS and deploy regularly from our CI for testing. We don't use the builtin pod functionality.

Quite often we run into the following problem: Our CI allocates and deploys a machine and MAAS starts to power on the VM. After some time it detects the VM never started and stops the process and marking the deployment as failed. Looking into the logs we see the following:

 Wed, 05 Jul. 2023 10:09:01 TFTP Request - bootaa64.efi
 Wed, 05 Jul. 2023 10:08:22 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.
 Wed, 05 Jul. 2023 10:08:22 Node changed status - From 'Deploying' to 'Failed deployment'
 Wed, 05 Jul. 2023 10:08:22 Marking node failed - Power on for the node failed: Failed talking to node's BMC: Failed to power erk3dh. BMC never transitioned from off to on.
 Wed, 05 Jul. 2023 10:07:45 Powering on
 Wed, 05 Jul. 2023 10:07:35 Deploying

MAAS started to power on the VM at 10:07:45 and detected at 10:08:22 that it was never successfully powered on. This roughly matches the DEFAULT_WAITING_POLICY (35s) in src/provisioningserver/drivers/power/__init__.py

Checking the LXD logs, the VM is powered on by MAAS and finishing the start operation 30s later than what MAAS expects:

2023-07-05T10:07:46Z lxd.daemon[3211129]: time="2023-07-05T10:07:46Z" level=debug msg="Start started" instance=vm16 instanceType=virtua
l-machine project=default stateful=false
[...]
2023-07-05T10:08:53Z lxd.daemon[3211129]: time="2023-07-05T10:08:53Z" level=debug msg="Start finished" instance=vm16 instanceType=virtual-machine project=default stateful=false

Some of the VMs have PCI passthrough enabled and may run on a busy system. We tried to shorten the time it takes to finish the start operation but that is not easy. Is there a way to higher the timeout or make it configurable?

This is with MAAS 3.3.4.

Thanks!

Simon Fels (morphis)
description: updated
Alberto Donato (ack)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.5.0
Changed in maas:
milestone: 3.5.0 → 3.5.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.