"test_established_tcp_session_after_re_attachinging_sg" failing in "neutron-tempest-plugin-openvswitch-iptables_hybrid-distributed-dhcp" CI job

Bug #2079047 reported by Rodolfo Alonso
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Undecided
Unassigned

Bug Description

The test "neutron_tempest_plugin.scenario.test_security_groups.StatefulNetworkSecGroupTest.test_established_tcp_session_after_re_attachinging_sg" is failing 100% of times in the CI job "neutron-tempest-plugin-openvswitch-iptables_hybrid-distributed-dhcp".

I've tested with and without WSGI module [1] and it doesn't make any difference, the test is failing with both configurations.

[1]https://review.opendev.org/c/openstack/neutron/+/927958

description: updated
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

To help with later search. The error traceback:

```
Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/neutron_tempest_plugin/common/utils.py", line 85, in wait_until_true
    eventlet.sleep(sleep)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/eventlet/greenthread.py", line 38, in sleep
    hub.switch()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 310, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 10 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/neutron_tempest_plugin/scenario/test_security_groups.py", line 922, in test_established_tcp_session_after_re_attachinging_sg
    con.test_connection(should_pass=False)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/neutron_tempest_plugin/common/utils.py", line 239, in test_connection
    wait_until_true(self._test_connection,
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/neutron_tempest_plugin/common/utils.py", line 90, in wait_until_true
    raise WaitTimeout("Timed out after %d seconds" % timeout)
neutron_tempest_plugin.common.utils.WaitTimeout: Timed out after 10 seconds
```

The first failure in openSearch is on Aug 23: https://opensearch.logs.openstack.org/_dashboards/app/data-explorer/discover?security_tenant=global#?_a=(discover:(columns:!(_source),isDirty:!f,sort:!()),metadata:(indexPattern:'94869730-aea8-11ec-9e6a-83741af3fdcd',view:discover))&_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2024-08-23T04:00:00.000Z',to:'2024-08-24T04:00:00.000Z'))&_q=(filters:!(),query:(language:kuery,query:'%22in%20test_established_tcp_session_after_re_attachinging_sg%22'))

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

First failure for the job is Aug 26.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It may be significant that non-iptables_hybrid flavor of the job - neutron-tempest-plugin-openvswitch-distributed-dhcp - doesn't seem to fail the same way.

I wonder though if opensearch logs are enough to pinpoint the failure in time since it's in experimental queue and so its results are collected on demand / sporadically. I was triggering a lot of experimental queues lately while working on nested snat support for ovn, maybe that's why it started to pop up in logs around that time. It would of course help if we'd have the jobs somewhere else, e.g. in periodic.

Changed in neutron:
status: New → Confirmed
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Testing patch: https://review.opendev.org/c/openstack/neutron/+/927958

I initially thought that could be related to the eventlet/wsgi migration. But this job is failing with both configurations.

Revision history for this message
Brian Haley (brian-haley) wrote :

@Liu - can you look at this and see if it's related to the distributed dhcp code? Thanks

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

As discussed @ team meeting, 1) opensearch only carries ~10 days of data, so the fact it started to spike there doesn't indicate the time when it broke. and 2) the job is in both periodic and experimental (which share jobs) so we should have at least some periodic runs in the logs and any spiking cannot be explained away by a change in activity on experimental queue, as I've suggested in the previous comment.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.