Compute tests are failing with failed to reach ACTIVE status and task state "None" within the required time.

Bug #1964940 reported by chandan kumar
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
yatin
tripleo
Invalid
Critical
Unassigned

Bug Description

On Fs001 CentOS Stream 9 wallaby, Multiple compute server tempest tests are failing with following error [1][2]:
```
{1} tempest.api.compute.images.test_images.ImagesTestJSON.test_create_image_from_paused_server [335.060967s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/tempest/api/compute/images/test_images.py", line 99, in test_create_image_from_paused_server
        server = self.create_test_server(wait_until='ACTIVE')
      File "/usr/lib/python3.9/site-packages/tempest/api/compute/base.py", line 270, in create_test_server
        body, servers = compute.create_test_server(
      File "/usr/lib/python3.9/site-packages/tempest/common/compute.py", line 267, in create_test_server
        LOG.exception('Server %s failed to delete in time',
      File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
        self.force_reraise()
      File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
        raise self.value
      File "/usr/lib/python3.9/site-packages/tempest/common/compute.py", line 237, in create_test_server
        waiters.wait_for_server_status(
      File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 100, in wait_for_server_status
        raise lib_exc.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (ImagesTestJSON:test_create_image_from_paused_server) Server 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1 failed to reach ACTIVE status and task state "None" within the required time (300 s). Server boot request ID: req-4930f047-7f5f-4d08-9ebb-8ac99b29ad7b. Current status: BUILD. Current task state: spawning.
```

Below is the list of other tempest tests failing on the same job.[2]
```
tempest.api.compute.images.test_images.ImagesTestJSON.test_create_image_from_paused_server[id-71bcb732-0261-11e7-9086-fa163e4fa634]
tempest.api.compute.admin.test_volume.AttachSCSIVolumeTestJSON.test_attach_scsi_disk_with_config_drive[id-777e468f-17ca-4da4-b93d-b7dbf56c0494]
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_attached_volume[id-d0f3f0d6-d9b6-4a32-8da4-23015dcab23c,volume]
tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesV270Test.test_create_get_list_interfaces[id-2853f095-8277-4067-92bd-9f10bd4f8e0c,network]
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_shelved_state[id-bb0cb402-09dd-4947-b6e5-5e7e1cfa61ad]
setUpClass (tempest.api.compute.images.test_images_oneserver_negative.ImagesOneServerNegativeTestJSON)
tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest_v242.test_tagged_boot_devices[id-a2e65a6c-66f1-4442-aaa8-498c31778d96,image,network,slow,volume]
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_suspended_state[id-1f82ebd3-8253-4f4e-b93f-de9b7df56d8b]
tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port[id-73fe8f02-590d-4bf1-b184-e9ca81065051,network]
setUpClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSONUnderV235)
```

Here is the traceback from nova-compute logs [3],
```
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [req-4930f047-7f5f-4d08-9ebb-8ac99b29ad7b d5ea6c724785473b8ea1104d70fb0d14 64c7d31d84284a28bc9aaa4eaad2b9fb - default default] [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] Instance failed to spawn: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] Traceback (most recent call last):
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7231, in _create_guest_with_network
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] guest = self._create_guest(
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib64/python3.9/contextlib.py", line 126, in __exit__
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] next(self.gen)
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] actual_event = event.wait()
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 125, in wait
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] result = hub.switch()
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] return self.greenlet.switch()
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] eventlet.timeout.Timeout: 300 seconds
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1]
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] During handling of the above exception, another exception occurred:
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1]
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] Traceback (most recent call last):
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2640, in _build_resources
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] yield resources
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2409, in _build_and_run_instance
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] self.driver.spawn(context, instance, image_meta,
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 4193, in spawn
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] self._create_guest_with_network(
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 7257, in _create_guest_with_network
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] raise exception.VirtualInterfaceCreateException()
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-03-15 09:05:39.011 2 ERROR nova.compute.manager [instance: 6d1d8906-46fd-42ad-8b4e-0f89adb25ed1]
```

This job https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby is broken from 13th Mar, 2021 and earlier
https://bugs.launchpad.net/tripleo/+bug/1960310 is also seen on this.

Since we have two runs having same tests failures, so logging the bug for further investigation.

Logs:

[1]. https://logserver.rdoproject.org/17/40517/1/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/94e16ac/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

[2]. https://logserver.rdoproject.org/40/40440/1/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/6ce8796/logs/undercloud/var/log/tempest/failing_tests.log.txt.gz

[3]. https://logserver.rdoproject.org/17/40517/1/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/94e16ac/logs/undercloud/var/log/tempest/failing_tests.log.txt.gz

[4]. https://logserver.rdoproject.org/17/40517/1/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/94e16ac/logs/overcloud-novacompute-0/var/log/containers/nova/nova-compute.log.1.gz

description: updated
Revision history for this message
Ronelle Landy (rlandy) wrote :

Another example log:

https://logserver.rdoproject.org/54/36254/80/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/ad62ee9/logs/undercloud/var/log/tempest/stestr_results.html.gz

tempest.lib.exceptions.TimeoutException: Request timed out
Details: (ServersTestMultiNic:test_verify_duplicate_network_nics) Server cd3eb8e1-d42e-4789-9e66-d7e0be00dbb0 failed to reach ACTIVE status and task state "None" within the required time (300 s). Server boot request ID: req-c2e948e6-b19a-4dc3-99a1-c393675d2334. Current status: BUILD. Current task state: spawning.

Changed in tripleo:
importance: High → Critical
milestone: yoga-2 → yoga-3
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

https://sf.hosted.upshift.rdu2.redhat.com/logs/75/317875/1/check/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset035-internal-rhos-17/6107ffc/logs/overcloud-novacompute-0/var/log/containers/nova/nova-compute.log.1.gz

also shows:

2022-03-15 11:52:36.856 7 DEBUG oslo_concurrency.lockutils [req-0ee08fbb-e521-49cd-8cec-55b3a6d33069 0200c0761ba548b3b659c37777d87064 03668b49493545f394094cc50dab3e19 - default default] Lock "compute_resources" released by "nova.compute.resource_tracker.ResourceTracker.abort_instance_claim" :: held 0.217s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:371
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [req-0ee08fbb-e521-49cd-8cec-55b3a6d33069 0200c0761ba548b3b659c37777d87064 03668b49493545f394094cc50dab3e19 - default default] [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] Traceback (most recent call last):
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7243, in _create_guest_with_network
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] post_xml_callback=post_xml_callback)
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] next(self.gen)
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] actual_event = event.wait()
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] File "/usr/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] result = hub.switch()
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] return self.greenlet.switch()
2022-03-15 11:52:36.857 7 ERROR nova.compute.manager [instance: 2c5636ae-afdf-4c33-bc93-021d701aa3ad] eventlet.timeout.Timeout: 300 seconds

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

It seems like some Neutron (or OVN) issue here. I took a look at logs from https://logserver.rdoproject.org/54/36254/80/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/ad62ee9/logs/undercloud/var/log/tempest/stestr_results.html.gz and I checked vm created in the test tempest.api.compute.security_groups.test_security_groups.SecurityGroupsTestJSON.test_server_security_groups (uuid of the vm: 2b66e196-c206-4f0c-bfab-5a2eefe112ad)

VM was created by Nova around 16:42, and it updated port in Neutron at that time:

2022-03-15 16:42:07.030 2 DEBUG nova.network.neutron [req-ca324b7c-1fa7-4463-9fd3-1873ad272a60 33dd76647428416bade2e9761ac1c966 7845cd695c68496f8586c041a92a25b9 - default default] [instance: 2b66e196-c206-4f0c-bfab-5a2eefe112ad] Successfully updated port: 6a712e97-bc61-49a0-aee6-66d4fcd7b72d _update_port /usr/lib/python3.9/site-packages/nova/network/neutron.py:585

And later timedout waiting for vif plugging event to be send by Neutron about 300 seconds later:

2022-03-15 16:47:08.554 2 ERROR nova.compute.manager [req-ca324b7c-1fa7-4463-9fd3-1873ad272a60 33dd76647428416bade2e9761ac1c966 7845cd695c68496f8586c041a92a25b9 - default default] [instance: 2b66e196-c206-4f0c-bfab-5a2eefe112ad] Instance failed to spawn: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-03-15 16:47:08.554 2 ERROR nova.compute.manager [instance: 2b66e196-c206-4f0c-bfab-5a2eefe112ad] Traceback (most recent call last):

Now, in Neutron logs, the only things related to the port provisioning are:

- transition to active not triggered due to provisioning block by L2 entity:
2022-03-15 16:42:06.338 15 DEBUG neutron.db.provisioning_blocks [req-f30cecff-57fd-4141-b963-a8e3a8dbd1f0 662983d3036b4e7a9cd55bfd01284d06 bcbe34bb1407478089c9e4bb99594495 - default default] Transition to ACTIVE for port object 6a712e97-bc61-49a0-aee6-66d4fcd7b72d will not be triggered until provisioned by entity L2. add_provisioning_component /usr/lib/python3.9/site-packages/neutron/db/provisioning_blocks.py:73

- port binding:
2022-03-15 16:42:06.724 15 DEBUG neutron.plugins.ml2.managers [req-f30cecff-57fd-4141-b963-a8e3a8dbd1f0 662983d3036b4e7a9cd55bfd01284d06 bcbe34bb1407478089c9e4bb99594495 - default default] Bound port: 6a712e97-bc61-49a0-aee6-66d4fcd7b72d, host: overcloud-novacompute-0.localdomain, vif_type: ovs, vif_details: {"port_filter": true, "connectivity": "l2"}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '743a70a6-b626-4738-8845-0b904456acdc', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 49632, 'network_id': 'e00f2b59-05a7-4921-8b34-4ee494c04abc'}}] _bind_port_level /usr/lib/python3.9/site-packages/neutron/plugins/ml2/managers.py:928

- and OVN reports status UP, but it's way to long after vm was already deleted:

2022-03-15 16:50:31.218 15 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-dbbfd0fb-bec7-4a80-83af-c863ca531175 - - - - -] OVN reports status up for port: 6a712e97-bc61-49a0-aee6-66d4fcd7b72d

Revision history for this message
Marios Andreou (marios-b) wrote :

i wonder if this may be related to https://bugs.launchpad.net/neutron/+bug/1961184 ?

the fix for that merged in master https://review.opendev.org/c/openstack/neutron/+/830624 but we don't have it on wallaby

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (4.2 KiB)

I looked into this and agree with slawek that something is wrong on Neutron OVN side. Adding my findings as below:-

Some Data points:-
- Issue is random as jobs succeeds some time[1], so likely some race or missing events somehow
- Issue is not specific to wallaby, i can see similar failure in all releases(train+)[2]
- Issue is not specific to Distro, seeing in both C8, RHEL8 and C9 jobs[2]
- Issue happening from long, i could see failures one month back, before that logs are not persisted, adding reference to logs from last month[2]
- Issue also seen in jobs running with 1 controller[3], found only few occurances, looked only in wallaby and train.

<< - and OVN reports status UP, but it's way to long after vm was already deleted:
<< 2022-03-15 16:50:31.218 15 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-dbbfd0fb-bec7-4a80-83af-c863ca531175 - - - - -] OVN reports status up for port: 6a712e97-bc61-49a0-aee6-66d4fcd7b72d

Seems ^ is triggered instead(of PortBindingUpdateUpEvent, missed somehow) by Maintenance task: Fixing resource 6a712e97-bc61-49a0-aee6-66d4fcd7b72d (type: ports) at create/update check_for_inconsistencies /usr/lib/python3.9/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py:358

One thing i noticed in logs i checked is "OVN reports status up for port" is not logged after event
PortBindingUpdateUpEvent. But didn't got what can cause it as i see it's the first statement to be executed with the event[4][5].

Considering the related event, [6] looked suspicious which is backported till Ussuri. But since it seen also in Train, may be [6] just increased reproducibility or it is some general issue with events processing. Will involve Luis(author of [6]) and someone with better understanding around these.

[1]
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby
[2]
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/66b141d/logs/undercloud/var/log/tempest/stestr_results.html.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/6b04066/logs/undercloud/var/log/tempest/stestr_results.html.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/ee71b17/logs/undercloud/var/log/tempest/stestr_results.html.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable2/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-victoria/ef11bd8/logs/undercloud/var/log/tempest/stestr_results.html.gz
https://logserver.rdoproject.org/openstack-periodic-integration-stable3/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-ussuri/da5136f/logs/u...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :

Forgot to add i was also trying to reproduce it in https://review.rdoproject.org/r/c/testproject/+/40609, but didn't succeed yet as it fails early due to other issues.

Revision history for this message
yatin (yatinkarel) wrote :

<< Considering the related event, [6] looked suspicious which is backported till Ussuri
It's backported to Train[1] as well but since it's in networking-ovn change id is different.

[1] https://review.opendev.org/c/openstack/networking-ovn/+/823279

Revision history for this message
Marios Andreou (marios-b) wrote :

so we had a couple of green runs here on the 23 [1] and 24 [2] in the periodic pipeline [3]

The last 2 fails on 26/27 are unrelated to this bug (didnt even reach tempest)

I think we may be clear of this. Not sure if we can move this to done or is there any more investigation from Neutron side @Yatin or anyone else working here?

[1] https://review.rdoproject.org/zuul/build/8510cc1f851944b8ac8426fa311452c7

[2] https://review.rdoproject.org/zuul/build/98701b0132d14d81b399bb07fea25e2a

[3] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby&project=openstack/tripleo-ci

Revision history for this message
yatin (yatinkarel) wrote :

@Marios, as commented previously the issue is random and affecting multiple releases, so just with few success it can't be concluded that issue is cleared.

I was checking with @slawek and @ltomasbo on this but it is not clear yet what causing this issue exactly and if https://review.opendev.org/q/Ib071889271f4e4d6acd83b219bf908a9ae80ce5c can cause the issue.

Also i tested the revert of https://review.opendev.org/c/openstack/neutron/+/823269 with osp 17 jobs and the issue didn't occur in multiple runs but again issue was random so can't be sure as if the revert patch helped in it. Can do similar tests in wallaby jobs to rule out if this patch is related or not.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to Sean, there is a workaround to unblock this, by disregarding tests coverage for the vif-plugged events feature completely (since it is just unstable with OVN today):

we have the option too turn it off entirely or for live migration.
we allso have other workaroud options on master for some edge caces
https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.wait_for_vif_plugged_event_during_hard_reboot
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.vif_plugging_is_fatal
https://docs.openstack.org/nova/latest/configuration/config.html#compute.live_migration_wait_for_vif_plug

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/837341
Reason: as we discussed with the team, we should to keep this feature tested

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

Just to update, it's faced in Downstream as well https://bugzilla.redhat.com/show_bug.cgi?id=2081631, we are currently debugging it on an env where it reproduced.

One thing we did is to test with revert of suspected patch https://review.opendev.org/c/openstack/networking-ovn/+/823279 and we are not seeing the failure with the revert.

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

This issue could be related to how "ovsdbapp" attends to the OVN DB events. The events are catch by the Neutron server (as we can see in the Neutron logs). The events are processed in the ``RowEventHandler.notify``. If the specific event ``match_fn`` returns True, this event is added to the ``RowEventHandler.notifications`` queue.

The problem seems to be in the ``RowEventHandler.notify_loop``, that is a method executed in a separate thread that processes the events added to the ``RowEventHandler.notifications`` queue. When "ovsdbapp" is imported from the Neutron server, the "threading" library has been patched by "eventlet"; instead of using OS threads we are using user threads. If we sometimes don't process the stored event is because we fail yielding to this thread.

I've proposed a solution using the OS threads [1]. It is now running in the environment deployed by Yatin. So far has executed 400+ tests without any error.

Regards.

[1]https://review.opendev.org/c/openstack/ovsdbapp/+/841238

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

We have tried ongoing patch [0] cherry-pick with tripleo-ci wallaby jobs(where we are hitting bug[2]) in testproject[3].

With this patch neutron-openvswitch-agent on undercloud is not starting correctly.

neutron_ovs_agent container is unhealthy

https://logserver.rdoproject.org/40/36140/49/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/aceb20b/logs/undercloud/var/log/extra/podman/podman_allinfo.log.txt.gz

~~~
6a4a8ba8ef7e 192.168.24.1:8787/tripleowallabycentos9/openstack-neutron-openvswitch-agent:f83994cb11ef42e4c94ee49af211c863-updated-20220513064107 kolla_start 45 minutes ago Up 45 minutes ago (unhealthy) neutron_ovs_agent
~~~

No openvswitch entry in "openstack network agent list"
~~~
+--------------------------------------+----------------+--------------------------------------+-------------------+-------+-------+----------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------+--------------------------------------+-------------------+-------+-------+----------------------+
| 385461b2-c04d-4bb3-8f5e-cf6d2e3bd524 | Baremetal Node | f2c6882b-285f-494a-86b7-01c24fbe1206 | None | :-) | UP | ironic-neutron-agent |
| 72bb93d0-8465-4eff-a1a3-18983fa72328 | Baremetal Node | 820119f8-6432-4883-a171-363c99f6fc53 | None | :-) | UP | ironic-neutron-agent |
| 82f73f9b-31be-42c0-98e0-2e52c7e15472 | Baremetal Node | 4022af37-9502-412f-9df2-e94af29a93aa | None | :-) | UP | ironic-neutron-agent |
| b78283ec-a9a2-41fb-a8fa-248499eee23f | Baremetal Node | 6c8c2707-5d5f-49f8-b3b4-b3b4719321c9 | None | :-) | UP | ironic-neutron-agent |
| e5dbf620-8d9f-40c3-9319-3d2c9848f17b | DHCP agent | undercloud.localdomain | nova | :-) | UP | neutron-dhcp-agent |
| fb42b407-25ff-456a-b772-c279a0d8e362 | L3 agent | undercloud.localdomain | nova | :-) | UP | neutron-l3-agent |
+--------------------------------------+----------------+--------------------------------------+-------------------+-------+-------+----------------------+
~~~

openvswitch-agent.log at [4]

[0] https://review.opendev.org/c/openstack/ovsdbapp/+/841238
[1] https://review.opendev.org/c/openstack/ovsdbapp/+/841429
[2] https://bugs.launchpad.net/tripleo/+bug/1964940
[3] https://review.rdoproject.org/r/c/testproject/+/36140
[4] https://logserver.rdoproject.org/40/36140/49/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/aceb20b/logs/undercloud/var/log/containers/neutron/openvswitch-agent.log.txt.gz

Changed in neutron:
assignee: nobody → yatin (yatinkarel)
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
yatin (yatinkarel) wrote :
Download full text (6.4 KiB)

Pushed revert to neutron master:- https://review.opendev.org/c/openstack/neutron/+/843426, will also be backported to other branches.

Pasting investigation results from https://bugzilla.redhat.com/show_bug.cgi?id=2081631#c6 here for reference:-

So we further debugged this and below are the findings:-

When the issue reproduces for a server/port:-
- PortBindingUpdateUpEvent is received and put into queue, at this point self.notifications Queue size is large, seen 250+
- The Queue is filled with PortBindingChassisEvent for chassisredirect port in just 2-3 seconds
- All the PortBindingChassisEvent is for same port just switching chassis[1], this is just snippet there were total 274 enteries for this particular case, for some cases seen 350+ too.
- Same can be seen in ovn-controller log[2], added just snippet and there were in total 134 enteries on one controller and 135 on other.

And this resulted into a known old unfixed OVN bug https://bugzilla.redhat.com/show_bug.cgi?id=1974898. So until that is fixed seems we need to revert https://review.opendev.org/c/openstack/networking-ovn/+/823279 which likely causing the issue more often as that switched monitoring to SB DB instead of NB DB, and NB and SB queues are different and NB events will not be impacted with large SB event queue.

[1] 2022-05-25 09:11:04.511 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.513 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[], up=[False]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.554 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.557 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843426
Committed: https://opendev.org/openstack/neutron/commit/e6d27be4747eb4573dcc5c0e1e7ac7550d20f951
Submitter: "Zuul (22348)"
Branch: master

commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/843760

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/843761

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/843763

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/843771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/843778

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843760
Committed: https://opendev.org/openstack/neutron/commit/3a11ab74402cde842461569515c39c441555f348
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 3a11ab74402cde842461569515c39c441555f348
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094
    (cherry picked from commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951)

tags: added: in-stable-yoga
tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843761
Committed: https://opendev.org/openstack/neutron/commit/4c5bd288466f8f72e95deca204d08246d9aa9549
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 4c5bd288466f8f72e95deca204d08246d9aa9549
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094
    (cherry picked from commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951)
    (cherry picked from commit 3a11ab74402cde842461569515c39c441555f348)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843763
Committed: https://opendev.org/openstack/neutron/commit/1648d40fe3e43f4ea2ce515e73c6fc891e248ef6
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 1648d40fe3e43f4ea2ce515e73c6fc891e248ef6
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094
    (cherry picked from commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951)
    (cherry picked from commit 3a11ab74402cde842461569515c39c441555f348)
    (cherry picked from commit 4c5bd288466f8f72e95deca204d08246d9aa9549)
    Conflicts: neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
            neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py

tags: added: in-stable-wallaby
Revision history for this message
Marios Andreou (marios-b) wrote :

waiting for these to merge https://review.opendev.org/q/I159460be27f2c5f105be4b2865ef84aeb9a00094

The master one merged couple days ago so must already by available to the periodic jobs but I have to dig a bit and confirm it has made it through the network component already.

In any case we'll need this open for a few more days to verify that the fixes are good

Thank you ykarel !

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

i checked that the patches from https://review.opendev.org/q/topic:bug%252F1964940 are already in the promoted network components for master/wallaby/train notes & links @ https://gist.github.com/marios/94314e2a7e72d05c690682b66514181b

so we need to watch the integration lines in the next few days see if the issue is addressed

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843771
Committed: https://opendev.org/openstack/neutron/commit/3034b0403d95a163f1d31274bd7327cce6dd71bd
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 3034b0403d95a163f1d31274bd7327cce6dd71bd
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094
    (cherry picked from commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951)
    (cherry picked from commit 3a11ab74402cde842461569515c39c441555f348)
    (cherry picked from commit 4c5bd288466f8f72e95deca204d08246d9aa9549)
    Conflicts: neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
            neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py
    (cherry picked from commit 1648d40fe3e43f4ea2ce515e73c6fc891e248ef6)
    Conflicts: neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
            neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py
            neutron/tests/unit/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)
Download full text (3.3 KiB)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/843778
Committed: https://opendev.org/openstack/neutron/commit/63cdd1a5b9d982dcaee0d632225b3f7ff1fbf88c
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 63cdd1a5b9d982dcaee0d632225b3f7ff1fbf88c
Author: yatinkarel <email address hidden>
Date: Thu May 26 14:57:48 2022 +0530

    Revert "Use Port_Binding up column to set Neutron port status"

    This reverts commit 37d4195b516f12b683b774f0561561b172dd15c6.
    Conflicts:
            neutron/common/ovn/constants.py
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Also revert below 2 commits which were added on
    top of the parent commit:-

    Revert "Ensure subports transition to DOWN"
    This reverts commit 5e036a6b281e4331f396473e299b26b2537d5322.

    Revert "Ensure only the right events are processed"
    This reverts commit 553f462656c2b7ee1e9be6b1e4e7c446c12cc9aa.

    Reason for revert: These patches have caused couple of issues[1][2][3].
    [1][2] are same issue just one is seen in c8/c9-stream and other in
    rhel8 and both contains much info about the issue.
    [3] is currently happening only in rhel8/rhel9 as this issue is visible
    only with the patch in revert and ovn-2021>=21.12.0-55(fix of [4]) which
    is not yet available in c8/c9-stream.

    [1][2] happens randomly as the patch under revert has moved the
    events to SB DB which made a known OVN issue[5] occur more often as in
    that issue SB DB Event queue floods with too many events of
    PortBindingChassisEvent making other events like PortBindingUpdateUpEvent
    to wait much longer and hence triggering VirtualInterfaceCreateException.

    NB DB Event queue is different and hence with revert we are trying to
    lower the side effect of the OVN issue[5].

    This patch can be re reverted once [3] and [5] are fixed.

    [1] https://bugs.launchpad.net/tripleo/+bug/1964940/
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=2081631
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=2090604
    [4] https://bugzilla.redhat.com/show_bug.cgi?id=2037433
    [5] https://bugzilla.redhat.com/show_bug.cgi?id=1974898

    Closes-Bug: #1964940
    Closes-Bug: rhbz#2081631
    Closes-Bug: rhbz#2090604
    Related-Bug: rhbz#2037433
    Related-Bug: rhbz#1974898
    Change-Id: I159460be27f2c5f105be4b2865ef84aeb9a00094
    (cherry picked from commit e6d27be4747eb4573dcc5c0e1e7ac7550d20f951)
    (cherry picked from commit 3a11ab74402cde842461569515c39c441555f348)
    (cherry picked from commit 4c5bd288466f8f72e95deca204d08246d9aa9549)
    Conflicts: neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
            neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py
    (cherry picked from commit 1648d40fe3e43f4ea2ce515e73c6fc891e248ef6)
    Conflicts: neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py
            neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py
            neutron/tests/unit/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_ovsdb_monitor.py
    (cherry picked from...

Read more...

tags: added: in-stable-ussuri
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Download full text (3.3 KiB)

fs01 wallaby still failing[1]. Following is the traceback observed:

2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base Traceback (most recent call last):
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/api/compute/base.py", line 453, in delete_server
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base cls.servers_client.delete_server(server_id)
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/lib/services/compute/servers_client.py", line 170, in delete_server
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base resp, body = self.delete("servers/%s" % server_id)
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 330, in delete
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base return self.request('DELETE', url, extra_headers, headers, body)
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/lib/services/compute/base_compute_client.py", line 47, in request
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base resp, resp_body = super(BaseComputeClient, self).request(
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 703, in request
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base self._error_checker(resp, resp_body)
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 879, in _error_checker
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base raise exceptions.ServerFault(resp_body, resp=resp,
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base tempest.lib.exceptions.ServerFault: Got server fault
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base Details: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base &lt;class 'oslo_db.exception.DBConnectionError'&gt;
2022-06-12 23:31:15.746 ERROR /var/log/tempest/stestr_results.html: 217460 ERROR tempest.api.compute.base

[1]https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tri...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :

<< fs01 wallaby still failing[1]. Following is the traceback observed:
@soniya That is failing with some different issue, so can be tracked separately.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.5.0

This issue was fixed in the openstack/neutron 18.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.4.0

This issue was fixed in the openstack/neutron 19.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.2.0

This issue was fixed in the openstack/neutron 20.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.0.0.0rc1

This issue was fixed in the openstack/neutron 21.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn train-eol

This issue was fixed in the openstack/networking-ovn train-eol release.

Revision history for this message
Alan Pevec (apevec) wrote :

closing old promotion-blocker
fixed in Neutron

Changed in tripleo:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ussuri-eol

This issue was fixed in the openstack/neutron ussuri-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron victoria-eom

This issue was fixed in the openstack/neutron victoria-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.