Scenario test test_mac_learning_vms_on_same_network fails intermittently in the ovn job

Bug #1952066 reported by Slawek Kaplonski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Unassigned

Bug Description

Failure examples:

https://b30211aa4f809fc4a91b-baf4f807d40559415da582760ebf9456.ssl.cf2.rackcdn.com/817525/7/check/neutron-tempest-plugin-scenario-ovn/c356679/testr_results.html
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b75/815962/5/check/neutron-tempest-plugin-scenario-ovn/b75474f/testr_results.html

Stacktrace:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
    ssh.connect(self.host, port=self.port, username=self.username,
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/paramiko/client.py", line 368, in connect
    raise NoValidConnectionsError(errors)
paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.220

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/test_mac_learning.py", line 166, in test_mac_learning_vms_on_same_network
    self._prepare_listener(non_receiver, 2)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/test_mac_learning.py", line 138, in _prepare_listener
    self._check_cmd_installed_on_server(server['ssh_client'], server['id'],
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/test_mac_learning.py", line 121, in _check_cmd_installed_on_server
    ssh_client.execute_script('which %s' % cmd)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/common/ssh.py", line 224, in execute_script
    channel = self.open_session()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/common/ssh.py", line 149, in open_session
    client = self.connect()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/common/ssh.py", line 137, in connect
    return super(Client, self)._get_ssh_connection(*args, **kwargs)
  File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 126, in _get_ssh_connection
    raise exceptions.SSHTimeout(host=self.host,
tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.220 via SSH timed out.
User: ubuntu, Password: None

We need to check why this specific test is failing in the ovn job more often than other tests.

Revision history for this message
yatin (yatinkarel) wrote :

In both the job failures and an other[1] i see "No correct output in console of server <id> found. Guest operating system status can't be checked.". Also in these tests console output is not getting collected so it's hard to say what's going wrong.

In this tests 3 servers are getting created, and in the three failures linked here different server(sender, receiver and non_receiver) preparation failed, so it looks more generic.

Also seen a failure in different test(test_multiple_ports_portrange_remote) in same job[2] which also creating 3 servers, but as console log got collected there it got known that it failed as there was kernel panic. May be test_mac_learning_vms_on_same_network test also faced similar issue and collecting console log in it may give better idea.

[1] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_296/818443/7/check/neutron-tempest-plugin-scenario-ovn/2969549/testr_results.html
[2] https://9470f97960e4f689dfc7-d2f445d35e133f70dec3633372bebbba.ssl.cf2.rackcdn.com/818911/2/check/neutron-tempest-plugin-scenario-ovn/c2ecf34/testr_results.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/819410

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-tempest-plugin (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/819410
Committed: https://opendev.org/openstack/neutron-tempest-plugin/commit/c4597e696de6ff48bbd76112059894e299e14718
Submitter: "Zuul (22348)"
Branch: master

commit c4597e696de6ff48bbd76112059894e299e14718
Author: yatinkarel <email address hidden>
Date: Fri Nov 26 14:09:18 2021 +0530

    Log console output for mac_learning and multicast tests

    Would be useful to debug ssh failures in test vms.

    Also for trunk_tests move check_connectivity method call to
    _configure_vlan_subport as there SSH is attempted and that can
    fail.

    Related-Bug: #1952066
    Change-Id: I64a1fd8118c9db1f337b7bf97bb9a77f974149b9

Revision history for this message
yatin (yatinkarel) wrote :

Console logs are now collected[1], SSH failed as VM failed to boot with:-

 (1min 33s / 1min 32s)[K[[0;1;31m TIME [0m] Timed out waiting for device dev-disk-by\x2dlabel-UEFI.device.
[[0;1;33mDEPEND[0m] Dependency failed for /boot/efi.
[[0;1;33mDEPEND[0m] Dependency failed for Local File Systems.
[[0;1;33mDEPEND[0m] Dependency failed for File System Check on /dev/disk/by-label/UEFI.
[[0;32m OK [0m] Started Set console font and keymap.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.

This can happen on slow systems, i noticed multiple ubuntu test vms running at the time of failure and system is on high load, switch to nested virt nodes[2] should also help in avoiding this issue.

[1] https://aa215ec0c297538f736c-5e9e014bbb914c03e8d5277ea7a5cf3c.ssl.cf1.rackcdn.com/819410/3/gate/neutron-tempest-plugin-scenario-ovn/58c5d56/testr_results.html
[2] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/821067

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/821067
Committed: https://opendev.org/openstack/neutron-tempest-plugin/commit/165e40922470ae26f9bd99fb28e51abf832ee798
Submitter: "Zuul (22348)"
Branch: master

commit 165e40922470ae26f9bd99fb28e51abf832ee798
Author: yatinkarel <email address hidden>
Date: Wed Dec 8 19:19:34 2021 +0530

    Switch scenario jobs to nested-virt nodes

    To avoid intermittent failures and job timeouts
    in scenario jobs, move these jobs to nested virt nodes.

    Only moving scenario jobs as those are mostly affected
    by intermittent issues(SSH Tempest failures) and job
    timeouts. Also nested virt nodes are provided only by a
    few nodepool providers.

    Also switching non-scenario jobs to use cirros uec
    image to avoid kernel panic issue.

    Initial tests were done with [2] and issues/improvements
    are being tracked at [3].

    [1] https://bugs.launchpad.net/nova/+bug/1939108
    [2] https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/819590
    [3] https://etherpad.opendev.org/p/neutron-ci-improvements

    Related-Bug: #1952066
    Related-Bug: #1953479
    Change-Id: I2f78f9de1ad1dc8c34688951c6bb2d5648d5dc3f

Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.