compute and network tempest tests failing on fs35 train

Bug #1991093 reported by Ananya Banerjee
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

compute and network tempest tests are failing on fs35 train. The list of failing tests are not consistent but the errors are similar:

    2022-09-25 10:31:07,950 339886 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.170 after 4 attempts. Proxy client: no proxy client
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh Traceback (most recent call last):
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 113, in _get_ssh_connection
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh sock=proxy_chan)
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in connect
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/util.py", line 283, in retry_on_signal
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh return function()
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in <lambda>
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
    2022-09-25 10:31:07.950 339886 ERROR tempest.lib.common.ssh socket.timeout: timed out

https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/e4797bb/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/abc147a/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/ad82654/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
milestone: none → zed-1
tags: added: ci promotion-blocker
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Download full text (3.4 KiB)

I am not sure, wether this finding is relevant or not but looking at the nova-compute.log, following traceback is commonly observed.

----
2022-09-30 09:12:47.756 7 DEBUG nova.virt.ironic.driver [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Returning 4 available node(s) get_available_nodes /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:899
2022-09-30 09:12:47.764 7 ERROR nova.compute.manager [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] No compute node record for host undercloud.localdomain: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host undercloud.localdomain could not be found.
2022-09-30 09:12:47.764 7 DEBUG nova.compute.resource_tracker [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Auditing locally available compute resources for undercloud.localdomain (node: aed2b0f6-8506-48db-9fea-3df1801301d4) update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:870
2022-09-30 09:12:47.765 7 DEBUG oslo_concurrency.lockutils [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Lock "ironic-node-aed2b0f6-8506-48db-9fea-3df1801301d4" acquired by "nova.virt.ironic.driver.IronicDriver._node_from_cache.<locals>._sync_node_from_cache" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2022-09-30 09:12:47.765 7 DEBUG nova.virt.ironic.driver [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Using cache for node aed2b0f6-8506-48db-9fea-3df1801301d4, age: 0.009539365768432617 _sync_node_from_cache /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1018
2022-09-30 09:12:47.765 7 DEBUG oslo_concurrency.lockutils [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Lock "ironic-node-aed2b0f6-8506-48db-9fea-3df1801301d4" released by "nova.virt.ironic.driver.IronicDriver._node_from_cache.<locals>._sync_node_from_cache" :: held 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2022-09-30 09:12:47.766 7 DEBUG nova.compute.resource_tracker [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Hypervisor/Node resource view: name=aed2b0f6-8506-48db-9fea-3df1801301d4 free_ram=0MB free_disk=0GB free_vcpus=unknown pci_devices=None _report_hypervisor_resource_view /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:1023
2022-09-30 09:12:47.766 7 DEBUG oslo_concurrency.lockutils [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2022-09-30 09:12:47.774 7 WARNING nova.compute.resource_tracker [req-4987498c-9b47-4c3f-ab65-af8bfccf8b6b - - - - -]
No compute node record for undercloud.localdomain:aed2b0f6-8506-48db-9fea-3df1801301d4: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host undercloud.localdomain could not be found

Following are logs:-
1. https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/e8133d7/logs/undercloud/var/log/containers/nova/nova-compute.log.txt...

Read more...

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :

The above traceback is also observed in green run, so probably not the root cause

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
yatin (yatinkarel) wrote :
Download full text (3.2 KiB)

<<< Seems neutron openvswitch-agent not starting[1]

<<< [1] https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/e8133d7/logs/undercloud/var/log/extra/podman/containers/neutron_ovs_agent/stdout.log.txt.gz

This is from undercloud node and unrelated to tempest failures.

I see randomly different tempest test fails with below Traceback:-
2022-10-03 10:13:04,796 321780 INFO [tempest.lib.common.ssh] Creating ssh connection to '10.0.0.160:22' as 'cirros' with public key authentication
2022-10-03 10:14:04,855 321780 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.160 (timed out). Number attempts: 1. Retry after 2 seconds.
2022-10-03 10:15:07,413 321780 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.160 (timed out). Number attempts: 2. Retry after 3 seconds.
2022-10-03 10:16:10,972 321780 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.160 (timed out). Number attempts: 3. Retry after 4 seconds.
2022-10-03 10:17:15,531 321780 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.160 (timed out). Number attempts: 4. Retry after 5 seconds.
2022-10-03 10:18:21,092 321780 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.160 after 4 attempts. Proxy client: no proxy client
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 113, in _get_ssh_connection
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh sock=proxy_chan)
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in connect
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/util.py", line 283, in retry_on_signal
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh return function()
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in <lambda>
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh socket.timeout: timed out
2022-10-03 10:18:21.092 321780 ERROR tempest.lib.common.ssh

This is happening from long as i see similar failure from 9th Sept, before that too there were some different kind of failures and around same time i see those jobs were moved to run on ibm cloud https://review.rdoproject.org/r/c/rdo-jobs/+/44496.

I couldn't find why it's failing but considering similar job is not seeing this issue with internal cloud, the issue seems be related to the performance on ibm cloud, goo...

Read more...

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.