stx-openstack: Unable to open /dev/kvm No such file or directory

Bug #1999445 reported by Thales Elero Cervi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Thales Elero Cervi

Bug Description

Brief Description
-----------------
Initial tests of stx-openstack on Debian, using the stx-libvirt image based on stx-debian and the .deb packages for libvirt and qemu failed.
The libvirt pod initializes successfully and starts the host libvirtd, but on the host the log shows that the kvm device was not open.

Severity
--------
Major: stx-openstack virtualization functions are degraded

Steps to Reproduce
------------------
* Upload stx-openstack (Debian stx)
* Helm-override libvirt image to use a custom built stx-libvirt
* Apply stx-openstack

Expected Behavior
------------------
libvirt log should not show a failure when opening /dev/kvm

Actual Behavior
----------------
libvirt log should shows a failure when opening /dev/kvm

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
master:
* starlingx/master/debian/monolithic/20221206T070000Z

+ https://review.opendev.org/c/starlingx/integ/+/866412

Last Pass
---------
N/Aq

Timestamp/Logs
--------------
$ sudo head -n 5 /var/log/libvirt/libvirtd.log
2022-12-06 10:55:40.938+0000: 3356907: info : libvirt version: 7.0.0, package: 3.stx.3 (STX Builder <email address hidden> Thu, 01 Dec 2022 21:21:14 +0000)
2022-12-06 10:55:40.938+0000: 3356907: info : hostname: controller-0
2022-12-06 10:55:40.938+0000: 3356907: error : virHostCPUGetTscInfo:1360 : Unable to open /dev/kvm: No such file or directory
2022-12-06 10:55:42.400+0000: 3356907: error : virHostCPUGetTscInfo:1360 : Unable to open /dev/kvm: No such file or directory
2022-12-06 10:55:42.406+0000: 3356907: error : virHostCPUGetTscInfo:1360 : Unable to open /dev/kvm: No such file or directory

Test Activity
-------------
Developer Testing

Workaround
----------
N/A

Changed in starlingx:
assignee: nobody → Thales Elero Cervi (tcervi)
Revision history for this message
Thales Elero Cervi (tcervi) wrote (last edit ):

This might only be happening on virtual deployments (Nested Virtualization). Need to get a physical environment to install and check.

I noticed that previously, on CentOS, my vbox test vms did not have the option "Enable VT-x/AMD-V" enabled and with stx-openstack applied the kvm character device (/dev/kvm) was there.

Now, on Debian, my vbox test vms did not have the kvm character device (/dev/kvm) even after the application is successfully applied. When I enable the option "Enable VT-x/AMD-V" though, the kvm device is there.

Will investigate it further.

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

The kvm device is always available when stx is installed on labs (as long as the virtualization functions are enabled in the BIOS) and on virtual machines if Nested Virtualization is enabled ("Enable VT-x/AMD-V" on Virtualbox).

The problem here is no longer if the char device exists, but who owns it and which permissions are set. The Debian migration is incomplete on what regards /dev/kvm permissions and it will be handled as part as this bug fix.

---------------------------------------------------------------------------------------------------
For reference, on a CentOS installation where the stx-libvirt-master-centos-* is the image used for the libvirt container, the users/groups and kvm device permissions are the following:

Host:
$ sudo cat /etc/group | egrep 'nova|libvirt|qemu|kvm'
nova:x:162:nova
libvirt:x:991:nova
kvm:x:36:qemu
qemu:x:107:
$ sudo cat /etc/passwd | egrep 'nova|libvirt|qemu|kvm'
nova:x:994:162:OpenStack Nova Daemons:/var/lib/nova:/sbin/nologin
qemu:x:107:107:qemu user:/:/sbin/nologin
$ ls -lha /dev/kvm
crw-rw-rw- 1 root kvm 10, 232 Dez 15 12:11 /dev/kvm

Container:
# cat /etc/group | egrep 'nova|libvirt|qemu|kvm'
kvm:x:36:qemu,nova
qemu:x:107:
libvirt:x:993:
nova:x:42424:
# cat /etc/passwd | egrep 'nova|libvirt|qemu|kvm'
qemu:x:107:107:qemu user:/:/sbin/nologin
nova:x:42424:42424:nova user:/var/lib/nova:/usr/sbin/nologin

That is not quiet what is currently seen on a stx debian installation:
Host:
$sudo cat /etc/group | egrep 'nova|libvirt|qemu|kvm'
nova:x:162:nova
libvirt:x:991:nova
kvm:x:102:
$ sudo cat /etc/passwd | egrep 'nova|libvirt|qemu|kvm'
nova:x:994:162:OpenStack Nova Daemons:/var/lib/nova:/sbin/nologin
$ ls -lha /dev/kvm
crw-rw---- 1 root 36 10, 232 dez 14 20:25 /dev/kvm

Even after switching the libvirt container image to be the stx-libvirt-master-debian-*, some mismatch persists:
$ ls -lha /dev/kvm
crw-rw---- 1 root uuidd 10, 232 dez 14 23:54 /dev/kvm

This is probably because the user/group is not aligned from within the new container.
Container:
# cat /etc/passwd | egrep 'nova|libvirt|qemu|kvm'
nova:x:994:162:OpenStack Nova Daemons:/var/lib/nova:/sbin/nologin
libvirt-qemu:x:64055:109:Libvirt Qemu,,,:/var/lib/libvirt:/usr/sbin/nologin
# cat /etc/group | egrep 'nova|libvirt|qemu|kvm'
nova:x:162:nova
libvirt:x:991:nova
kvm:x:109:nova
libvirt-qemu:x:64055:libvirt-qemu

Apparently we need to align the kvm group inside the container with the group on the host.
And also seems like we missed a couple of libvirt packages when porting it to Debian, mainly the libvirt-daemon-system [1] that has a post-install script which creates the libvirt-qemu user accordingly.

Will be working on it now.

[1] https://salsa.debian.org/libvirt-team/libvirt/-/blob/debian/7.0.0-3/debian/libvirt-daemon-system.postinst

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Just for reference, when using the new stx-libvirt-master-debian-* the mismatch regards the kvm group id.

Inside the container, kvm group id is 109:
# cat /etc/group | egrep 'nova|libvirt|qemu|kvm'
nova:x:162:nova
libvirt:x:991:nova
kvm:x:109:nova
libvirt-qemu:x:64055:libvirt-qemu

While on the host, this id refers to uuidd group:
$ sudo cat /etc/group | grep 109
uuidd:x:109:

That is why the kvm char device ends up with wrong ownership:
$ ls -lha /dev/kvm
crw-rw---- 1 root uuidd 10, 232 dez 14 23:54 /dev/kvm

tags: added: stx.8.0 stx.distro.openstack
Revision history for this message
Thales Elero Cervi (tcervi) wrote (last edit ):

On Debian this libvirt and qemu users/groups setup changed a bit and seems to be easier to maintain.
Previously there was no mismatch between the container kvm GID and the host kvm GID because both had our qemu rpm installed and it was forcing the GID to 36 [1].

After a bit of digging around the history of debian/qemu I found the following timeline:

* The kvm group was firstly created by the qemu-system.postinst script [2]
* Then, it was moved to the qemu-system-common.postinst script [3]
* Finally, it was removed at all [4], relying now on the debian/systemd patch [5] that already creates the group and sets the device (/dev/kvm) permissions and ownership accordingly.

Since on Debian sytemd is already creating the kvm group and handling the /dev/kvm permission and ownership, we can remove this step from our libvirt container setup script (libvirt.sh [6]) and rely on Debian installation defaults. Will create an openstack-helm-infra patch for it.

[1] https://opendev.org/starlingx/integ/src/branch/master/virt/qemu/centos/qemu-kvm.spec#L722
[2] https://salsa.debian.org/qemu-team/qemu/-/commit/dbb34ed82d28a07afc24ecbf62ecdd0dfc34b741
[3] https://salsa.debian.org/qemu-team/qemu/-/blob/debian/qemu_2.1+dfsg-12+deb8u6/debian/qemu-system-common.postinst
[4] https://salsa.debian.org/qemu-team/qemu/-/commit/cb8737ef48a37eddf12ac199b46f9034273ba6d3
[5] https://salsa.debian.org/systemd-team/systemd/-/commit/4fc3fa53bfa6e16ceb6cd312f49003839b56144a
[6] https://github.com/openstack/openstack-helm-infra/blob/master/libvirt/templates/bin/_libvirt.sh.tpl#L36

The only thing that I still need to align is the container nova user/group addition to kvm group. I need to understand how it should align with users/groups on the host.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/868206

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/868209
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/39f75382fa7b59d9433bfd1c812d9c6b31f762b2
Submitter: "Zuul (22348)"
Branch: master

commit 39f75382fa7b59d9433bfd1c812d9c6b31f762b2
Author: Thales Elero Cervi <email address hidden>
Date: Tue Dec 20 09:47:32 2022 -0300

    Add patch to libvirt setup script

    On Debian this libvirt and qemu users/groups setup changed and it
    seems to be easier to maintain now, so we can drop a libvirt script
    setup step.

    Previously, on CentOS, there was no mismatch between the container kvm
    GID and the host kvm GID because both had our qemu rpm installed and
    it was forcing the GID to 36 [1]. On Debian it was removed at all [2],
    relying now on the debian/systemd patch [3][4] that already creates
    the group and sets the device (/dev/kvm) permissions and ownership
    accordingly.

    Since on Debian sytemd is already creating the kvm group and handling
    the /dev/kvm permission and ownership, we can remove this step from our
    libvirt container setup script and rely on Debian installation defaults.

    [1] https://opendev.org/starlingx/integ/src/branch/master/virt/qemu/centos/qemu-kvm.spec#L722
    [2] https://salsa.debian.org/qemu-team/qemu/-/commit/cb8737ef48a37eddf12ac199b46f9034273ba6d3
    [3] https://salsa.debian.org/systemd-team/systemd/-/commit/4fc3fa53bfa6e16ceb6cd312f49003839b56144a
    [4] https://bugs.launchpad.net/ubuntu/+source/gnome-boxes/+bug/1767302/comments/18

    Test Plan:
    PASS - Build openstack-helm-infra
    PASS - Build stx-openstack-fluxcd package
    PASS - Build stx-openstack helm charts
    PASS - Upload/Apply/Remove the application
    PASS - Check that the script skipped the kvm device permission set
    PASS - Check that the host kvm device has the correct permissions and
           ownership.
    PASS - Check the container and host users and groups

    Partial-Bug: 1999445

    Signed-off-by: Thales Elero Cervi <email address hidden>
    Change-Id: I47e5be5f34989f932902d2b7f97ef23bedac3260

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (master)

Change abandoned by "Thales Elero Cervi <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/868206
Reason: There is no need to add any extra libvirt package to a Debian platform. Everything is already placed inside the container.

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Yesterday I compared both a CentOS and a Debian platform regarding libvirt packages installed into the host.

<TL;DR> Conclusion: There is no need to add any extra libvirt package to a Debian platform.

The confusion came from the fact the on CentOS, a meta package "libvirt" was installed and would install all the other libvirt related packages.
Differently, on Debian all packages were separated so we can install only what is really required for it to work as peer our requirements.
Checking on CentOS, we do not have any ENABLED systemd service related to libvirt.
The daemons are started from within the libvirt container (in chroot) that has privileged mode enabled

On CentOS, even after the application is applied, libvirt container started and running, there is nos systemd service required on the host. Again, it runs inside the container on privileged mode. So the only change now is that there is a libvirtd.pid, created when the container started the daemon.
The conclusion then is that on Debian there is no need to add any further libvirt packages to the host, since all the required packages are already running inside the container.
Sanity tests showed that, on Debian, libivirt container started successfully and VMs were launched and pingable.

Also, with the already merged change to the application, the device /dev/kvm was checked with the correct user/group and access.

The second code change will be abandoned and this Launchpad can be closed now.

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.