nova-next job fail on Ubuntu Bionic

Bug #1819794 reported by Ghanshyam Mann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
melanie witt
devstack
Fix Released
Undecided
melanie witt

Bug Description

OpenStack CI is moving to Ubuntu Bionic. All zuulv3 native jobs based on devstack job are already running on bionic but legacy jobs are still running on xenial.

While migrating the legacy jobs to Bionic, nova-next job start failing - https://review.openstack.org/#/c/639017

The problem seems to be on TLS console proxy side. This is the only legacy job which enables tls-proxy service.

libvirt.libvirtError: internal error: process exited while connecting to monitor: 2019-03-08T02:23:53.402326Z qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
Mar 08 02:23:53.452557 ubuntu-bionic-rax-iad-0003574603 nova-compute[31993]: ERROR nova.virt.libvirt.driver [None req-a3dbdf7a-29cc-4e36-a760-2d92da2d13f0 tempest-DeleteServersAdminTestJSON-2130806780 tempest-DeleteServersAdminTestJSON-2130806780] [instance: 3b8927dd-fd26-4050-9428-d0db99856d60] Failed to start libvirt guest: libvirt.libvirtError: internal error: process exited while connecting to monitor: 2019-03-08T02:23:53.402326Z qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]

Complete logs

http://logs.openstack.org/17/639017/4/check/nova-next/566ea7a/logs/screen-n-cpu.txt.gz#_Mar_08_02_23_53_452557

http://logs.openstack.org/17/639017/4/check/nova-next/566ea7a/logs/screen-n-cond-cell1.txt.gz#_Mar_08_02_23_54_553579

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

trying with disabling the tls-proxy

 https://review.openstack.org/#/c/639017/

Revision history for this message
melanie witt (melwitt) wrote :

I see that the attempt disabling tls-proxy failed with:

2019-03-13 01:47:24.187 | + lib/tls:deploy_int_CA:356 : local ca_target_file=/etc/pki/nova-novnc/ca-cert.pem
2019-03-13 01:47:24.189 | + lib/tls:deploy_int_CA:358 : sudo cp /opt/stack/data/CA/int-ca/ca-chain.pem /etc/pki/nova-novnc/ca-cert.pem
2019-03-13 01:47:24.195 | cp: cannot stat '/opt/stack/data/CA/int-ca/ca-chain.pem': No such file or directory
2019-03-13 01:47:24.197 | + lib/tls:deploy_int_CA:1 : exit_trap

which makes sense, because that's the reason we needed to include tls-proxy in the first place. So, I don't think the presence of tls-proxy is the problem.

From the original failure with tls-proxy included, I see that the relevant error in screen-n-cpu.txt is actually this:

Feb 25 15:08:10.760376 ubuntu-bionic-rax-iad-0002993189 nova-compute[31732]: 2019-02-25T15:08:10.650607Z qemu-system-x86_64: -vnc 127.0.0.1:0,tls,x509verify=/etc/pki/libvirt-vnc: Failed to start VNC server: Cannot load certificate '/etc/pki/libvirt-vnc/server-cert.pem' & key '/etc/pki/libvirt-vnc/server-key.pem': Error while reading file.

This is the root cause of the problem.

I'm wondering if it's because there's been a change in the user:group we need for the certs directory. Currently, we're using libvirt-qemu:libvirt-qemu:

https://github.com/openstack-dev/devstack/blob/94ca9f6756e7b677b1ee3fd2e32b555447e950dd/lib/nova_plugins/functions-libvirt#L158

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

nova-next passed with tls-proxy diasable. https://review.openstack.org/#/c/639017/7.

I missed disabling the NOVA_CONSOLE_PROXY_COMPUTE_TLS var n PS6 and it failed with cert error agains.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/643129

Revision history for this message
Matt Riedemann (mriedem) wrote :

As a workaround the tls-proxy part of nova-next is being disabled.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/643129
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d74a5b23a59e552159b57a0a42a23ff08215941b
Submitter: Zuul
Branch: master

commit d74a5b23a59e552159b57a0a42a23ff08215941b
Author: ghanshyam <email address hidden>
Date: Wed Mar 13 18:12:41 2019 +0000

    Disable the tls-proxy in nova-next & fix nova-tox-functional-py35 parent

    While moving the legacy job nova-next on bionic, tls-proxy
    did not work and leads to nova-next job fail.

    To proceed further on Bionic migration which is blocked by nova-next failure,
    this commit temporary disable the tls-proxy service until bug#1819794 is fixed.

    Also this updates the parent of nova-tox-functional-py35 from openstack-tox
    to openstack-tox-functional-py35 in order to handle the upcoming change
    of the infra CI default node type from ubuntu-xenial to ubuntu-bionic.

    The python3.5 binary is not provided on ubuntu-bionic and the shared
    "py35" job definitions in the openstack-zuul-jobs repository have been
    patched to force them to run on ubuntu-xenial [1]. We should inherit
    from one of these jobs for jobs that rely on python3.5.

    [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-March/003746.html

    Related-Bug: #1819794

    Change-Id: Ie46311fa9195b8f359bfc3f61514fc7f70d78084

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Download full text (3.1 KiB)

There are two things here,

(1) Mel already debugged this above: "Cannot load certificate ... & key ..." for tls-proxy. The "Error while reading file" (how "helpful") seems like something that is coming GnuTLS or whichever library is being used by Ubuntu.

(2) Warnings about "VMX" flag -- these can be ignored, but let's find the root cause as it is seems very bizarre, because:

  - There is no '-cpu' parameter in the offending guest[*] QEMU
    command-line, requesting for 'vmx'
  - Current upstream QEMU does _not_ support VMX for TCG (plain emulation)

So the second problem (which are warnings only) seem to be somehow specific to Ubuntu -- did they change something in their QEMU package?

[*] instance-00000002
------------------------------------------------------------
2019-03-08 02:23:56.064+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.10), hostname: ubuntu-bionic-rax-iad-0003574603
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-x86_64 -name guest=instance-00000002,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-instance-00000002/master-key.aes -machine pc-i440fx-bionic,accel=tcg,usb=off,dump-guest-core=off -m 64 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid ae55dba3-ea1c-43f1-a537-1ef9d6c55d40 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=18.1.0,serial=ae55dba3-ea1c-43f1-a537-1ef9d6c55d40,uuid=ae55dba3-ea1c-43f1-a537-1ef9d6c55d40,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-instance-00000002/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/opt/stack/data/nova/instances/ae55dba3-ea1c-43f1-a537-1ef9d6c55d40/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -add-fd set=0,fd=29 -chardev pty,id=charserial0,logfile=/dev/fdset/0,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0,tls,x509verify=/etc/pki/libvirt-vnc -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
2019-03-08T02:23:56.124629Z qemu-system-x86_64: -chardev pty,id=charserial0,logfile=/dev/fdset/0,logappend=on: char device redirected to /dev/pts/0 (label charserial0)
2019-03-08T02:23:56.135964Z qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
2019-03-08T02:23:56.169501Z qemu-system-x86_64: -vnc 127.0.0.1:0,tls,x509verify=/etc/pki/libvirt-vnc: Failed to start VNC server: Cannot load certificate '/etc/pki/libvirt-vnc/server-cert.pem' & key '/etc/pki/libvirt-vnc/server-key.pem': Error while reading file.
2019-03-08 02:23:56.174+0000: shutting down, reason=failed
--------------------------------------------...

Read more...

Revision history for this message
melanie witt (melwitt) wrote :

I was able to reproduce this locally with devstack and ran the qemu command for VNC through strace:

  $ LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none strace /usr/bin/qemu-system-x86_64 -vnc 127.0.0.1:0,tls,x509verify=/etc/pki/libvirt-vnc

and got the result:

  openat(AT_FDCWD, "/etc/pki/libvirt-vnc/server-key.pem", O_RDONLY) = -1 EACCES (Permission denied)
write(2, "qemu-system-x86_64: -vnc 127.0.0"..., 236qemu-system-x86_64: -vnc 127.0.0.1:0,tls,x509verify=/etc/pki/libvirt-vnc: Failed to start VNC server: Cannot load certificate '/etc/pki/libvirt-vnc/server-cert.pem' & key '/etc/pki/libvirt-vnc/server-key.pem': Error while reading file.

Looking at the file permissions on xenial (OpenSSL 1.0.2):

  $ ll /etc/pki/libvirt-vnc/
  total 20
  drwxr-xr-x 2 libvirt-qemu libvirt-qemu 4096 Mar 22 17:30 ./
  drwxr-xr-x 4 root root 4096 Mar 22 17:30 ../
  -rw-r--r-- 1 root root 2554 Mar 22 17:30 ca-cert.pem
  -rw-r--r-- 1 root root 1367 Mar 22 17:30 server-cert.pem
  -rw-r--r-- 1 root root 1704 Mar 22 17:30 server-key.pem

The server-key.pem file is readable to all.

Looking at the files on bionic (OpenSSL 1.1.0):

 $ ll /etc/pki/libvirt-vnc/
 total 20
 drwxr-xr-x 2 libvirt-qemu libvirt-qemu 4096 Mar 22 23:48 ./
 drwxr-xr-x 4 root root 4096 Mar 22 23:48 ../
 -rw-r--r-- 1 root root 2554 Mar 22 23:48 ca-cert.pem
 -rw-r--r-- 1 root root 1367 Mar 22 23:48 server-cert.pem
 -rw------- 1 root root 1704 Mar 22 23:48 server-key.pem

The server-key.pem file is readable only for root. This is the root cause, no pun intended.

So, I think we need to set the ownership of the files under the /etc/pki/<console> directories to the user:group intended to read them.

Revision history for this message
melanie witt (melwitt) wrote :

Devstack patch proposed here: https://review.openstack.org/643045

Nova patch to re-enable console with TLS proposed here: https://review.openstack.org/645432

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/645432
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=872a9b4e7cfd155ac990c939a3f99b0f3dce26c0
Submitter: Zuul
Branch: master

commit 872a9b4e7cfd155ac990c939a3f99b0f3dce26c0
Author: melanie witt <email address hidden>
Date: Fri Mar 22 01:27:45 2019 +0000

    Re-enable testing of console with TLS in nova-next job

    This is a partial revert of Ie46311fa9195b8f359bfc3f61514fc7f70d78084.

    Depends-On: https://review.openstack.org/643045

    Related-Bug: #1819794

    Change-Id: I1bf37edb4dc3bdb6f23d077eae32e81ef48bdcdc

Revision history for this message
melanie witt (melwitt) wrote :

The devstack change merged and console with TLS testing has been re-enabled in the nova-next job.

Changed in devstack:
assignee: nobody → melanie witt (melwitt)
status: New → Fix Released
Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/648442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (stable/stein)

Change abandoned by "Dr. Jens Harbott <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/devstack/+/648442
Reason: Doesn't look like this is needed anymore, feel free to reopen if I'm wrong

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.