neutron-tempest-plugin jobs timing out on nested-virt nodes

Bug #1999249 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Unassigned

Bug Description

It seems that since we moved our nested-virt jobs to Ubuntu 22.04 with patch https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/857031 jobs like neutron-tempest-plugin-{openvswitch,linuxbridge,ovn} are timing out often.
In case of such time out (probably) all tests which requires booting of vm and ssh to it are failing as vms seems to not be ready at all in given time, which is pretty long e.g. 900 seconds in some cases.

Builds can be found at https://zuul.opendev.org/t/openstack/builds?job_name=neutron-tempest-plugin-openvswitch&job_name=neutron-tempest-plugin-linuxbridge&job_name=neutron-tempest-plugin-openvswitch-iptables_hybrid&job_name=neutron-tempest-plugin-ovn&result=TIMED_OUT&skip=0

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (63.5 KiB)

I looked into it and was able to narrow it down further:-

- Happens randomly on jammy nested virt nodes on provider:- vexxhost-ca-ymq-1(69 out of 150 runs TIMED_OUT), On other providers have not seen failure yet, ovh-gra1(0/65), ovh-bhs1(0/78)[3]
- Guest VMs are not booting properly when the issue happens, from console logs they are just stuck, from one of the logs noticed sometimes vm boots till login prompt[1] but SSH time out, may be it was just slow
- On the affected nodes, guest vms boot fine when running with qemu(virt_type=qemu)[2]
- Running libguestfs-test-tool stuck at different step[4] on the affected node, so issue is outside of OpenStack/Nova
- Running libguestfs-test-tool with LIBGUESTFS_BACKEND_SETTINGS=force_tcg passes on the affected node.

May be issue is with some compute nodes in vexxhost-ca-ym1-1 provider as issue is not seen always, need to take infra help to figure out the root cause.
If more data is required from L1 host we can get node on hold by running[0]

Until the root cause is known with the provider temporary we can do [2].

[0] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867609
[1] https://860bd4f2227d1daddb95-ca0266261d8d95f33f1974de7a62fd54.ssl.cf5.rackcdn.com/866489/5/gate/neutron-tempest-plugin-linuxbridge/8decd6e/tmp74ty0f_b
[2] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867320/2
[3]
for log in $(curl 'https://zuul.openstack.org/api/builds?job_name=neutron-tempest-plugin-openvswitch&job_name=neutron-tempest-plugin-linuxbridge&job_name=neutron-tempest-plugin-openvswitch-iptables_hybrid&job_name=neutron-tempest-plugin-ovn&result=SUCCESS&limit=225' 2>/dev/null|jq -r .[].log_url); do URL=${log}zuul-info/inventory.yaml && curl -L ${URL} 2>/dev/null|zgrep provider:;done

for log in $(curl 'https://zuul.openstack.org/api/builds?job_name=neutron-tempest-plugin-openvswitch&job_name=neutron-tempest-plugin-linuxbridge&job_name=neutron-tempest-plugin-openvswitch-iptables_hybrid&job_name=neutron-tempest-plugin-ovn&result=TIMED_OUT&limit=70' 2>/dev/null|jq -r .[].log_url); do URL=${log}zuul-info/inventory.yaml && curl -L ${URL} 2>/dev/null|zgrep provider:;done

[4]
$ sudo LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 libguestfs-test-tool
     ************************************************************
     * IMPORTANT NOTICE
     *
     * When reporting bugs, include the COMPLETE, UNEDITED
     * output below in your bug report.
     *
     ************************************************************
libguestfs: trace: set_verbose true
libguestfs: trace: set_verbose = 0
libguestfs: trace: set_verbose true
libguestfs: trace: set_verbose = 0
LIBGUESTFS_DEBUG=1
LIBGUESTFS_TRACE=1
PATH=/sbin:/usr/sbin:/usr/bin:/bin:/usr/local/sbin:/usr/local/bin
SELinux: sh: 1: getenforce: not found
libguestfs: trace: add_drive_scratch 104857600
libguestfs: trace: get_tmpdir
libguestfs: trace: get_tmpdir = "/tmp"
libguestfs: trace: disk_create "/tmp/libguestfsJuHDwY/scratch1.img" "raw" 104857600
libguestfs: trace: disk_create = 0
libguestfs: trace: add_drive "/tmp/libguestfsJuHDwY/scratch1.img" "format:raw" "cachemode:unsafe"
libguestfs: trace: add_drive = 0
libguestfs: trace: add_drive...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867922

Revision history for this message
yatin (yatinkarel) wrote :

SUCCESS job from vexxhost-ca-ymq-1 have:-
        host_id: 35ec78e50e5355d3d5e2c145bdb214aa31904a36c75f9ee9d3f19071
        host_id: 56d325dc9ce506ac024b732f7652c8a8db94f9164cb903855ffa3e95
        host_id: 5813883c83caccaa3a0c0599b6f207a149486a836bfa8ad77e4019e3
        host_id: 5c6a3996d7e7cd2385d0c544b614bab0287c9b61ab643c448d71edef
        host_id: 68e528f249a5306a71b361e81a43662de02fb75378667d5ec1627b3c
        host_id: 770b45a8f95a61fea7214ec11663c0476f7ae8997f84c0f3b5111fd3
        host_id: 91a5387c25afab79da993031133460fc8fcdfb274710fbd4ae39719d
        host_id: f3366849cf781cc54584a9ff1c0363d98c5e0d36dfcd203e83161288

TIMED_OUT jobs from vexxhost-ca-ymq-1 have:-
        host_id: 56d325dc9ce506ac024b732f7652c8a8db94f9164cb903855ffa3e95
        host_id: 6cc97bc57f540569368fcc47255180c5d21ed00a22cad83eeb600cec
        host_id: 70670f45d0dc4eaae28e6553525eec409dfb6f80e8d6c8dcef7d7bf5
        host_id: 86c687840cd74cd63cbb095748afa5c9cd0f6fcea898d90aa030cc68
        host_id: 8926aa5796637312bf5e46a0671a88021c208235fafdfcf22931eb01
        host_id: 94cd367e7821f5d74cf44c5ebafd9af18d2b6dff64a9bee067337cf6
        host_id: 9b04887c7eaa35b9ce51ba54d51ec13e6f618a6f7afc52cb421a0078
        host_id: c984fb897502bc826ccaf0e258b6071e76c29b305bc5b31b301de76a

Revision history for this message
Guilherme Steinmuller Pimentel (guilhermesp) wrote :

Hi there,

We had historical issues with nested virt + hypervisors running bionic.

All failing hostIDs are being scheduled to these bionic hypervisors; meanwhile, the successful ones are going to focal hypervisors.

We have converted most of the hypervisors in vexxhost-ca-ymq-1; however, we should keep going upgrading the rest that still runs bionic.

be mindful that some workloads could still be scheduled to bionic hypervisors and therefore fail, but as we progress with the upgrades, we hope you see the failures reducing.

Thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-tempest-plugin (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867922
Committed: https://opendev.org/openstack/neutron-tempest-plugin/commit/3c30984a53005ed2d7a6a2d37f304bbd631be62d
Submitter: "Zuul (22348)"
Branch: master

commit 3c30984a53005ed2d7a6a2d37f304bbd631be62d
Author: yatin <email address hidden>
Date: Fri Dec 16 15:59:14 2022 +0000

    Revert "Update nested-virt testing for the 2023.1 cycle"

    This reverts commit f0d7d3ee057a8a95c48cf8c343474fe96233bb5d.

    Reason for revert: vexxhost node provider is having issues with
    jammy nodes as guest vms are not booting on 40% of scenario jobs and leading to failures as mentioned in #1999249.
    Also guest vms started to take too much memory(1GiB+) in jammy[1] so it's not possible to run multiple guests vms together like we do in our tests. Using swap makes vm boot too slow(200+ sec) on those systems and without swap it just ooom-kills.

    Until vexxhost node provider supports jammy hosts or we are
    able to run our tests in non nested-virt providers[2] reverting
    this switch, may need some job splits or ideally fixing [1] in nova
    or by some global libvirt/qemu config[3].

    [1] https://bugs.launchpad.net/nova/+bug/1949606
    [2] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867934
    [3] https://listman.redhat.com/archives/libvirt-users/2022-December/013844.html

    Change-Id: Iad827b4bd04534bf19e189cebb2839ebe4d3837e
    Related-Bug: #1999249

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
yatin (yatinkarel) wrote :

Just to update as per last update[1] all pending hypervisors in vexxhost-ca-ymq-1 are upgraded but a new issue[2] noticed in the upgraded hypervisors and that resulted into disabling[3] vexxhost-ca-ymq-1. So based on this we can switch back to jammy nodes.

[1] Apr 27 18:00:22 <guilhermesp_____> at this point 100% of hv in ca-ymq-1 should be fixed in regards to that
[2] https://bugs.launchpad.net/neutron/+bug/2017992
[3] https://review.opendev.org/c/openstack/project-config/+/881810

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/882719

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-tempest-plugin (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/882719
Committed: https://opendev.org/openstack/neutron-tempest-plugin/commit/667393b5297b53ca457b73a08952874f684c5142
Submitter: "Zuul (22348)"
Branch: master

commit 667393b5297b53ca457b73a08952874f684c5142
Author: yatinkarel <email address hidden>
Date: Tue May 9 18:31:55 2023 +0530

    Revert "Revert "Update nested-virt testing for the 2023.1 cycle""

    This reverts commit 3c30984a53005ed2d7a6a2d37f304bbd631be62d.

    All the pending hypervisors are upgraded in
    vexxhost-ca-ymq-1 and that fixes the nested-virt
    issue.
    There is currently mirror issue[1] which is being
    investigated but since the vexxhost provider is
    disabled[2] we can switch jobs to jammy.

    re enablement of vexxhost provider once mirror
    issue is resolved shouldn't impact our jobs.

    [1] https://bugs.launchpad.net/neutron/+bug/2017992
    [2] https://review.opendev.org/c/openstack/project-config/+/881810

    Related-Bug: #1999249
    Change-Id: I60b7d94da0774558b35794fde522fa82c0259422

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-tempest-plugin (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/867609
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.