Queens, Rocky, Stein, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps failing. Timeout

Bug #1876087 reported by Amol Kahat
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Amol Kahat

Bug Description

Description
===========

- Network tests are failing on the queens jobs.

Tests:
- tempest.scenario.test_network_basic_ops.TestNetworkBasicOps

Traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 88, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 404, in test_network_basic_ops
    self.check_public_network_connectivity(should_connect=True)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 208, in check_public_network_connectivity
    self.servers, mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 585, in check_public_network_connectivity
    mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 568, in check_vm_connectivity
    msg=msg)
  File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 678, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Timed out waiting for 192.168.24.105 to become reachable

Pipeline
========
- openstack-periodic-wednesday-weekend

Affected Jobs
=============
- periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens
- periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-queens
- periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset019-queens
- periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset030-queens

Logs
====
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/998a119/logs/undercloud/home/zuul/tempest/tempest.html.gz
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-queens/e9236d1/logs/undercloud/home/zuul/tempest/tempest.html.gz
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset019-queens/f6e3bc3/logs/undercloud/home/zuul/tempest/tempest.html.gz
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset030-queens/069baf9/logs/undercloud/home/zuul/tempest/tempest.html.gz

wes hayutin (weshayutin)
summary: - Network tests are failing on queens jobs
+ Queens, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps
+ failing. Timeout
Changed in tripleo:
milestone: none → victoria-1
Revision history for this message
wes hayutin (weshayutin) wrote : Re: Queens, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps failing. Timeout
Download full text (8.9 KiB)

This is really hard to identify, but I believe it's related to network, more specifically the openvswitch.
I checked the logs for neutron. The requests were executed properly, no failures, but on openvswitch log, I see the following errors:

2020-04-30 10:06:37.281 80095 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-0274982a-50f5-4940-88e9-9eca4ee53099 - - - - -] Configuring tunnel endpoints to other OVS agents tunnel_sync /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1930
2020-04-30 10:06:37.283 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.283 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
2020-04-30 10:06:37.283 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.284 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
2020-04-30 10:06:37.285 80095 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: {"data":[["40be3a55-a226-4324-904c-7b085dcd6670","initial","br-ex"],["6cc79323-4dd0-4ad0-8b57-0ea869d51a6c","initial","br-int"],["c24511d9-33c5-445f-ad84-206519bf03f1","initial","br-tun"]],"headings":["row","action","name"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:260

If it is relevant for this bug or not, only the ovs guys can tell for sure.

This error happens before the test be executed, as you can notice that the first request from the test to setup network is the req-2e78b6d1-f293-47cf-9861-9316b92d5a04, but perhaps is affecting how the iptables/routes are handled.

I also found some interesting logs here:

'neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api.DhcpAgentNotifyAPI._native_event_send_dhcp_notification--9223372036854012965', 'neutron.services.segments.plugin.NovaSegmentNotifier._notify_port_created_or_deleted--9223372036854248080', 'neutron.plugins.ml2.plugin.Ml2Plugin.notify_sg_on_port_change-66733164'] for port, after_create _notify_loop /usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py:167
2020-04-30 10:10:31.445 30 DEBUG neutron_lib.callbacks.manager [req-a4c2abfc-c635-4eb6-a547-eb88bba5c952 - - - - -] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_create_port_callback--9223372036854676145'] for port, precommit_create _notify_loop /usr/lib/pyth...

Read more...

Revision history for this message
wes hayutin (weshayutin) wrote :

https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/998a119/logs/subnode-1/var/log/extra/errors.txt.txt.gz

is this related?

2020-04-30 10:06:37.256 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.257 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
2020-04-30 10:06:37.257 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.257 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
2020-04-30 10:06:37.283 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.283 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
2020-04-30 10:06:37.283 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-04-30 10:06:37.284 ERROR /var/log/containers/neutron/openvswitch-agent.log: 80095 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Bridge name --format=json]: PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

OK.. I think the missing run time deps is red herring, please ignore.

Revision history for this message
wes hayutin (weshayutin) wrote :

https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/998a119/logs/subnode-1/var/log/extra/journal.txt.gz

r 30 10:10:34 node-0000074445 dockerd-current[16828]: time="2020-04-30T10:10:34.026636586Z" level=error msg="Handler for POST /v1.26/containers/neutron-dnsmasq-qdhcp-f0ee86ea-b7a5-4678-981c-762af5e39a9b/stop returned error: No such container: neutron-dnsmasq-qdhcp-f0ee86ea-b7a5-4678-981c-762af5e39a9b"

https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/998a119/logs/subnode-1/var/log/containers/neutron/dhcp-agent.log.txt.gz

 Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-f0ee86ea-b7a5-4678-981c-762af5e39a9b', 'dnsmasq', '--no-hosts', '--no-resolv', '--pid-file=/var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/pid', '--dhcp-hostsfile=/var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/host', '--addn-hosts=/var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/addn_hosts', '--dhcp-optsfile=/var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/opts', '--dhcp-leasefile=/var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/leases', '--dhcp-match=set:ipxe,175', '--local-service', '--bind-dynamic', '--dhcp-range=set:subnet-b82d7dd3-d080-4f2d-9252-1ab8d371b0e5,10.100.0.0,static,255.255.255.240,86400s', '--dhcp-option-force=option:mtu,1300', '--dhcp-lease-max=16', '--conf-file=', '--domain=openstacklocal'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2020-04-30 10:10:34.380 79296 DEBUG neutron.agent.linux.utils [req-a4c2abfc-c635-4eb6-a547-eb88bba5c952 - - - - -] Unable to access /var/lib/neutron/dhcp/f0ee86ea-b7a5-4678-981c-762af5e39a9b/pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:250
2020-04-30 10:10:35.383 79296 DEBUG neutron.agent.linux.utils [req-a4c2abfc-c635-4eb6-a547-eb88bba5c952 - - - - -] Unable to access /var/lib/neutron/external/pids/f0ee86ea-b7a5-4678-981c-762af5e39a9b.pid get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:250
2020-04-30 10:10:35.383 79296 DEBUG neutron.agent.linux.utils [req-a4c2abfc-c635-4eb6-a547-eb88bba5c952 - - - - -] Unable to access

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I see https://bugzilla.redhat.com/show_bug.cgi?id=1658141 which mentions the missing libibverbs.so.1 being required by mellanox cards?

Revision history for this message
Amol Kahat (amolkahat) wrote :

libibverbs.so.1,libmlx5 packages required by openvswith. After installing those packages there are no warnings.

wes hayutin (weshayutin)
summary: - Queens, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps
- failing. Timeout
+ Queens, Rocky, Stein,
+ tempest.scenario.test_network_basic_ops.TestNetworkBasicOps failing.
+ Timeout
Revision history for this message
yatin (yatinkarel) wrote :

I think this is happening due to corrupted cached cirros image in nodepool images, to confirm it i logged into a running undercloud and found:-

We had hit similar issues in past, i don't have bug handy currently for that though.

[zuul@undercloud ~]$ ls -lh /opt/cache/files/
total 32K
-rw-r--r--. 1 root root 273 May 7 09:52 cirros-0.3.4-x86_64-disk.img
-rw-r--r--. 1 root root 273 May 7 09:52 cirros-0.3.5-x86_64-disk.img
-rw-r--r--. 1 root root 273 May 7 09:52 cirros-0.4.0-x86_64-disk.img
-rw-r--r--. 1 root root 275 May 7 09:52 cirros-0.4.0-x86_64-uec.tar.gz
-rw-r--r--. 1 root root 274 May 7 09:52 cirros-0.5.1-aarch64-disk.img
-rw-r--r--. 1 root root 273 May 7 09:52 cirros-0.5.1-x86_64-disk.img
-rw-r--r--. 1 root root 275 May 7 09:52 cirros-0.5.1-x86_64-uec.tar.gz
-rw-r--r--. 1 root root 1.6K May 7 09:52 source-repository-images
(reverse-i-search)`ca': ls -lh /opt/^Cche/files/
[zuul@undercloud ~]$ cat /opt/cache/files/cirros-0.4.0-x86_64-disk.img
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://github.com/cirros-dev/cirros/releases/download/0.4.0/cirros-0.4.0-x86_64-disk.img">here</a>.</p>
</body></html>

The reason it's not happening upstream is:- cirros images are cached differently, since cirros url is redirected, and without -L it download redirect text instead of image, https://softwarefactory-project.io/r/gitweb?p=config.git;a=blob;f=nodepool/elements/upstream-image-needs/post-install.d/89-get_cirros;h=5217aa06fa76788446d40b1715ddd4f6ba650b64;hb=HEAD#l27 is used to cache cirros images, will push a patch to fix this, and get nodepool images updated so we quickly get the results.

The reason it's not happening in train+:- os_tempest is used where cirros image is getting downloaded instead of using cached, https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/tempest.yml#L68

Revision history for this message
yatin (yatinkarel) wrote :

<< will push a patch to fix this, and get nodepool images updated so we quickly get the results.
Pushed https://softwarefactory-project.io/r/#/c/18156/ to fix it, issue caused after new nodepool element is used https://softwarefactory-project.io/r/#/c/18156/.

Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

<< will push a patch to fix this, and get nodepool images updated so we quickly get the results.

Both centos7 and centos8 nodepool images are updated with the fix https://softwarefactory-project.io/r/#/c/18156/, and jobs are passing, checked stein[1] and rocky[2] jobs, some jobs failed but that's not related to cirros image.

[1] https://trunk-primary.rdoproject.org/api-centos-stein/api/civotes_detail.html?commit_hash=edcb7ffc2c558f7cb32ac495cdd098bed50dc8e7&distro_hash=ec0fb22b2c9b310bc07f760c8b283a4f737ca226
[2] https://trunk-primary.rdoproject.org/api-centos-rocky/api/civotes_detail.html?commit_hash=1613ad90a9a52bc66678efe51ed0112b933d20c8&distro_hash=e05032f35f2ba610d55ec016e2b7e435f4b82e39

Revision history for this message
Rafael Folco (rafaelfolco) wrote :

Issue vanished https://review.rdoproject.org/r/#/c/27349/
Thanks for fixing the cirros image

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.