Creating MAAS Virsh VM does not work (libvirt: error)

Bug #2053033 reported by Bartłomiej Poniecki-Klotz
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Björn Tillenius
3.3
Fix Released
Critical
Björn Tillenius
3.4
Fix Released
Critical
Björn Tillenius

Bug Description

When creating the Machines on KVM hosts, there is a problem with the connection to the virsh console.
The issue occurs during communication with virsh.

provisioningserver.drivers.pod.virsh.VirshError: Failed to login to virsh console.

Environment details:

$ snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.3.5-13222-g.78dd996c0 32636 3.3/stable canonical✓ -

I used both MAAS CLI and UI to check power (easy way to call communication to the virsh console)

Logs:
- maas.log: https://pastebin.canonical.com/p/ZtGPSmbQp6/
- rackd.log: https://pastebin.canonical.com/p/6thxw55mrQ/
- regiond.log: https://pastebin.canonical.com/p/c9shkDFTD5/

The issue was previously discussed in the issue: https://bugs.launchpad.net/maas/+bug/1942390

Related branches

Revision history for this message
Bartłomiej Poniecki-Klotz (barteus) wrote :

subscribed ~field-critical

Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
status: New → In Progress
importance: Undecided → Critical
milestone: none → 3.5.x
Revision history for this message
Björn Tillenius (bjornt) wrote :

To re-iterate the findings from bug 1942390:

I got access to another environment and after adding some debugging, I noticed that virsh is intermittently returning this error:

b'libvirt: error : cannot execute binary /usr/bin/pkttyagent: '

In the journal, I see these messages:

Feb 12 03:47:01 leafeon audit[57692]: AVC apparmor="DENIED" operation="exec" profile="snap.maas.supervisor"
 name="/usr/bin/pkttyagent" pid=57692 comm="virsh" requested_mask="x" denied_mask="x" fsuid=0 ouid=0

Revision history for this message
Björn Tillenius (bjornt) wrote :

I still don't know why these errors are happening, but tomorrow I'll try to retry on such errors and see if that can work around the problem, until we find the root cause.

Revision history for this message
Jeff Lane  (bladernr) wrote :

We have recently also had virsh stop working on MAAS (3.4.0), none of our VMs are deployable. Looking at syslog I do see nearly 8000 errors like Bjorn mentioned in comment #2

Revision history for this message
Jeff Lane  (bladernr) wrote :

2024-02-12 10:44:03 provisioningserver.rpc.power: [critical] kvm-small-2: Power off failed.
        Traceback (most recent call last):
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 661, in callback
            self._startRunCallbacks(result)
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1750, in gotResult
            current_context.run(_inlineCallbacks, r, gen, status)
        --- <exception caught here> ---
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/rpc/power.py", line 242, in eb_cancelled
            failure.trap(CancelledError)
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 451, in trap
            self.raiseException()
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
            result = current_context.run(
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/rpc/power.py", line 292, in change_power_state
            yield perform_power_driver_change(
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
            result = current_context.run(
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/drivers/power/__init__.py", line 372, in perform_power
            yield power_func(system_id, context)
          File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
            result = current_context.run(gen.send, result)
          File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/drivers/pod/virsh.py", line 1316, in power_control_virsh
            raise VirshError("Failed to login to virsh console.")
        provisioningserver.drivers.pod.virsh.VirshError: Failed to login to virsh console.

Revision history for this message
Björn Tillenius (bjornt) wrote :

I still don't know why this is happening. The MAAS snap didn't change, but the core22 snap did change. pkttyagent comes from the core22 snap.

But it also doesn't happen on all machines. I have access to three machines with the same revision of the maas and core22 snaps, but only only one of the machines show the problem.

The good news is that it looks like we can work around it in MAAS. I've tried running virsh manually in the snap, and while the error is reported, the connection succeeds. So we only need to tweak the regexp a bit. It probably was intended for credentials errors.

I'm going to work on a fix in MAAS now. I'll also investigate pushing the fix to a hotfix snap branch, which can be used until we release the official point release.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 3.5.x → 3.5.0
Revision history for this message
Björn Tillenius (bjornt) wrote :

I've created snap hotfix branches for 3.3 and 3.4:

  3.3/stable/hotfix-bug-2053033
  3.4/stable/hotfix-bug-2053033

Our automated tests haven't finished running yet, but I've tested them locally and can confirm that they work and fix the problem. More testing would be welcome. I'm confident they don't cause any serious problems, and it's easy to revert.

You can install them like this: sudo snap refresh --channel=3.4/stable/hotfix-bug-2053033 maas

I will report as soon as our automated testing finish.

When we release 3.4.1 and 3.3.6, we will include this fix as well and close the hotfix branches, which means that they will point to stable again.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Unfortunately, that doesn't seem (so far) to have resolved this on my end... I

tracking: 3.4/stable/hotfix-bug-2053033
refresh-date: today at 18:12 EST
channels:
  3.4/stable: 3.4.0-14321-g.1027c7664 2024-01-05 (32469) 138MB -
  3.4/candidate: 3.4.0-14321-g.1027c7664 2024-01-05 (32550) 138MB -
  3.4/beta: ↑
  3.4/edge: 3.4.1-14332-g.9eb3ca242 2024-02-14 (33516) 138MB -
...
installed: 3.4.0-14322-g.2ac3386f1 (33528) 138MB -

but looking at our MAAS all the affected KVMs still show power errors (and it is only some of the virsh hosts, others seem OK for the moment and successfully report power status)

Revision history for this message
Björn Tillenius (bjornt) wrote :

Jeff, if it's only some of the virsh hosts that show an error, it's likely it's a different issue. The "Failed to login to virsh console." error is very generic and will show for a lot of different reasons.

I've already asked this in other channels, but for reference, could you please run the following?

  bjorn@maas-dev:~$ sudo snap run --shell maas.supervisor
  root@maas-dev:/home/bjorn# HOME=/var/snap/maas/current/root virsh --connect $power_address list

You can get the power_address from `maas $profile vm-host parameters $vm_host_id`

That should give you the underlying error.

Revision history for this message
Ethan Myers (ethanmyers) wrote :

I was running into this issue and can confirm the 3.4 hotfix worked for me. Thanks!

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

I confirm this one fixed it for me:
installed: 3.4.1-14332-g.9eb3ca242 (33516) 138MB -

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

I can confirm this can impact any existing deployments if refreshing the core22 to the latest stable version *and* restarting MAAS snap after.

With the previous version of the snap core22, there is no error related to pkttyagent like below :
#
# before refresh
#
ubuntu@infra-1:~$ sudo snap run --shell maas
root@infra-1:/home/ubuntu# virsh -c qemu+ssh://maas@10.244.40.30/system pool-info pool_virt_images
Name: pool_virt_images
...

#
# refreshing core22 snap
#
ubuntu@infra-1:~$ sudo snap refresh core22
core22 20240111 from Canonical✓ refreshed
#
# restart maas
#
ubuntu@infra-1:~$ sudo snap restart maas
2024-02-16T04:12:09Z INFO Waiting for "snap.maas.supervisor.service" to stop.
Restarted.
ubuntu@infra-1:~$ sudo snap run --shell maas
root@infra-1:/home/ubuntu# virsh -c qemu+ssh://maas@10.244.40.30/system pool-info pool_virt_images
libvirt: error : cannot execute binary /usr/bin/pkttyagent: Permission denied
Name: pool_virt_images
...

Revision history for this message
Nobuto Murata (nobuto) wrote :

Just for the record, I posted the summary of the situation as:
https://forum.snapcraft.io/t/enabling-polkitd-on-ubuntu-core/31295/2

It looks like it's the mixture of core22 adding a new package (polkitd) in the middle of the lifecycle and how MAAS was interacting with libvirt.

Revision history for this message
Alan Baghumian (alanbach) wrote :

I also encountered this issue today and installing the hot-fix on the Rack and Region Controllers resolved the issue.

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.