[OVN] Frequent DB leader changes causes 'VIF creation failed' on nova side

Bug #1969592 reported by Oleg Konobeev
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
neutron
Fix Released
Undecided
Unassigned

Bug Description

Hello guys.

As I found in 2021 there was a commit to OVS [since OVS v2.16 or 2.15 with that backport] that changed behavior during OVN DBs snapshoting.
Now before the leader creates a snapshot it will transfer leadership to another node.

We've run tests with rally and tempest and looks like there is a problem now when there is interaction between nova and neutron.
For example, simple rally test like 'create {network,router,subnet} -> add interface to router' looks okay even with 256 concurrent same tests/threads. But something like 'neutron.create_subnet -> nova.boot_server -> nova.attach_interface' will fail in time when transfer leadership happens.
Since it happens to often [ each 10m + rand(10) ] we will get a lot of errors.

This problem can be observed on all versions where OVS 2.16 [or backport] or higher invited :)

Some tracing from logs [neutron, nova, ovn-sb-db]:

CONTROL-NODES:

ctl01-ovn-sb-db.log:2022-04-19T12:30:03.089Z|01002|raft|INFO|Transferring leadership to write a snapshot.
ctl01-ovn-sb-db.log:2022-04-19T12:30:03.099Z|01003|raft|INFO|server 1c5f is leader for term 42

ctl03-ovn-sb-db.log:2022-04-19T12:30:03.090Z|00938|raft|INFO|received leadership transfer from 1f46 in term 41
ctl03-ovn-sb-db.log:2022-04-19T12:30:03.092Z|00940|raft|INFO|term 42: elected leader by 2+ of 3 servers
ctl03-ovn-sb-db.log:2022-04-19T12:30:10.941Z|00941|jsonrpc|WARN|tcp:xx.yy.zz.26:41882: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.324Z|00943|jsonrpc|WARN|tcp:xx.yy.zz.26:41896: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00945|jsonrpc|WARN|tcp:xx.yy.zz.26:41880: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00947|jsonrpc|WARN|tcp:xx.yy.zz.26:41892: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.327Z|00949|jsonrpc|WARN|tcp:xx.yy.zz.26:41884: send error: Connection reset by peer
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00951|jsonrpc|WARN|tcp:xx.yy.zz.25:40260: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00953|jsonrpc|WARN|tcp:xx.yy.zz.25:40264: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00955|jsonrpc|WARN|tcp:xx.yy.zz.24:37440: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00957|jsonrpc|WARN|tcp:xx.yy.zz.24:37442: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.245Z|00959|jsonrpc|WARN|tcp:xx.yy.zz.24:37446: send error: Connection timed out
ctl03-ovn-sb-db.log:2022-04-19T12:32:01.533Z|01001|jsonrpc|WARN|tcp:xx.yy.zz.67:57586: send error: Connection timed out

2022-04-19 12:30:08.898 27 INFO neutron.db.ovn_revision_numbers_db [req-7fcfdd74-482d-46b2-9f76-07190669d76d ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 1

2022-04-19 12:30:09.644 27 INFO neutron.db.ovn_revision_numbers_db [req-a8278418-3ad9-450c-89bb-e7a5c1c0a06d a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 2

2022-04-19 12:30:10.235 27 INFO neutron.wsgi [req-571b53cc-ca04-46f7-89f9-fdf8e5931f4c a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] xx.yy.zz.68,xx.yy.zz.26 "GET /v2.0/ports?tenant_id=9d3ae9a7b121488285203b0fdeabc3a3&device_id=7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 HTTP/1.1" status: 200 len: 1081 time: 0.0363808

2022-04-19 12:30:10.395 25 INFO nova.api.openstack.compute.server_external_events [req-d144458a-ca21-4e92-b5be-4f589e118d8c 4688fc44551b45c6a761c9c0351e76a8 72db34087b9b401d842b66643b647e16 - default default] Creating event network-changed:be178a9a-26d7-4bf0-a4e8-d206a6965205 for instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 on cmp-ucs-01-05.poc.example.com

2022-04-19 12:30:10.402 27 INFO neutron.notifiers.nova [-] Nova event response: {'name': 'network-changed', 'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'status': 'completed', 'code': 200}

2022-04-19 12:33:01.497 34 INFO neutron.db.ovn_revision_numbers_db [req-82f3d4fe-1276-4918-86d3-55cc42ca2816 - - - - -] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 3

COMPUTE-NODES:

ovs-vswitchd.log:2022-04-19T12:30:10.743Z|08415|bridge|INFO|bridge br-int: added interface tapbe178a9a-26 on port 874
ovn-controller.log:2022-04-19T12:30:11.251Z|02356|binding|INFO|Claiming lport be178a9a-26d7-4bf0-a4e8-d206a6965205 for this chassis.
ovn-controller.log:2022-04-19T12:30:11.251Z|02357|binding|INFO|be178a9a-26d7-4bf0-a4e8-d206a6965205: Claiming fa:16:3e:8f:3b:47 1.80.1.254

nova-compute.log:2022-04-19 12:30:10.672 8 INFO os_vif [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:8f:3b:47,bridge_name='br-int',has_traffic_filtering=True,id=be178a9a-26d7-4bf0-a4e8-d206a6965205,network=Network(5dd2512b-56f8-4e04-8623-971a633a76aa),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapbe178a9a-26')

ovs-vswitchd.log:2022-04-19T12:35:11.302Z|08448|bridge|INFO|bridge br-int: deleted interface tapbe178a9a-26 on port 874
ovn-controller.log:2022-04-19T12:35:11.307Z|02380|binding|INFO|Releasing lport be178a9a-26d7-4bf0-a4e8-d206a6965205 from this chassis.

2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Failed to
 allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Traceback (most recent call last):
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7260, in _create_guest_with_network
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] post_xml_callback=post_xml_callback)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] next(self.gen)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] actual_event = event.wait()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] result = hub.switch()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] return self.greenlet.switch()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] eventlet.timeout.Timeout: 300 seconds
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] During handling of the above exception, another exception occurred:
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Traceback (most recent call last):
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2402, in _build_and_run_instance
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] accel_info=accel_info)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 4225, in spawn
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] cleanup_instance_disks=created_disks)
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7283, in _create_guest_with_network
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] raise exception.VirtualInterfaceCreateException()
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34]
2022-04-19 12:35:11.844 8 ERROR nova.compute.manager [req-2c8b5f82-ce4a-4aa7-9f0a-d346ee210ffb ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] Build of
instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 aborted: Failed to allocate the network(s), not rescheduling.: nova.exception.BuildAbortException: Build of instance 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34 aborted: Failed to allocate the networ
k(s), not rescheduling.

CONTROL:

2022-04-19 12:35:11.330 25 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-2b4d10b1-3a7a-4683-97df-990259705b46 - - - - -] OVN reports status down for port: be178a9a-26d7-4bf0-a4e8-d206a6965205

2022-04-19 12:35:12.524 23 WARNING neutron.notifiers.nova [-] Nova event: {'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'name': 'network-vif-deleted', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'status': 'failed', 'code': 422} returned with failed status

/var/log/kolla/neutron/neutron-server.log:2022-04-19 12:35:13.016 25 WARNING neutron.notifiers.nova [-] Nova event: {'server_uuid': '7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34', 'name': 'network-vif-plugged', 'status': 'failed', 'tag': 'be178a9a-26d7-4bf0-a4e8-d206a6965205', 'code': 422} returned with failed status

Tags: neutron ovn
Oleg Konobeev (choooze)
tags: added: neutron ovn
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Nova compute waited for 300 sec to get a network-vif-plugged event then timed out:

2022-04-19 12:35:11.842 8 ERROR nova.compute.manager [instance: 7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34] eventlet.timeout.Timeout: 300 seconds

Do you have a bit more logs from the nova-compute service? If you have debug log enabled then from the nova compute logs we can puzzle together what for what ports nova got the plugged event and for what port such event was missing.

I'm not sure but this might mean that neutron never sent the plugged event as the port was down in the OVN backend:
2022-04-19 12:35:11.330 25 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-2b4d10b1-3a7a-4683-97df-990259705b46 - - - - -] OVN reports status down for port: be178a9a-26d7-4bf0-a4e8-d206a6965205

Somebody with neutron + OVN knowledge should look at this.

Changed in nova:
status: New → Incomplete
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Basically, you can't successfully use RAFT unless you have python-ovs 2.17 (which hasn't been released to PyPI yet due to it accidentally breaking Neutron/ovsdbapp).

This https://patchwork.ozlabs<email address hidden>/ added monitor-cond-since support to python-ovs (which had existed in the C IDL library as long as RAFT had) which allows reconnecting to OVS and only getting the transactions that have happened since you disconnected. This is important, since during ovsdb-server snapshotting which happens every 100 txns, ovsdb-server currently does a leadership change which breaks the connection if you are connected to the leader. With out monitor-cond-since, the entire db is downloaded again on reconnect, which just completely destroys performance.

I need to convince the OVS team to backport monitor-cond-since (and related fixes) to previous OVS versions, but right now ovs 2.17 (really master until the release is made) and https://review.opendev.org/c/openstack/ovsdbapp/+/834729 and https://review.opendev.org/c/openstack/neutron/+/834613 are needed to use RAFT.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Once the backports for the above ovsdbapp/neutron fixes are done, I'll release python-ovs 2.17 (also I have https://patchwork.ozlabs<email address hidden>/ which would resolve the API breakage in 2.17 from the OVS side that has a couple of acks and needs to be merged).

Revision history for this message
Oleg Konobeev (choooze) wrote :

Hey Balazs.

I brought nova-compute logs [debug severity] I've collected in time when the problem occurred.
Here is only logs about one instance not related to previous logs but every symptoms and other logs are the same.

Revision history for this message
Oleg Konobeev (choooze) wrote :

Terry Wilson, thank you for this info.

Does it mean that only performance is suffering? We are now observing the degradation of basic functions [attaching interface to instance], not performance issues.

Anyway will wait for the backports or python-ovs 2.17

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):
Download full text (6.4 KiB)

@Oleg: thanks for the logs. Here is the relevant part of it:

$ egrep -e 'Starting instance..|Successfully created port:|Successfully updated port|Preparing to wait for external event network-vif-plugged-|Received event|nova.virt.libvirt.vif.* plug|Timeout waiting for' nova-compute-debug-log
2022-04-27 22:01:03.119 7 DEBUG nova.compute.manager [req-be79d0ca-7aaf-4514-96b5-ec553777a322 f08d31bec4e648e0a70a7a6df4eeea0e 1c168227db7d4053abb49960a6033124 - default default] [instance: 71971196-9aea-42bb-9a33-7f0dcff07c90] Starting instance... _do_build_and_run_instance /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:2197
2022-04-27 22:01:04.215 7 DEBUG nova.network.neutron [req-be79d0ca-7aaf-4514-96b5-ec553777a322 f08d31bec4e648e0a70a7a6df4eeea0e 1c168227db7d4053abb49960a6033124 - default default] [instance: 71971196-9aea-42bb-9a33-7f0dcff07c90] Successfully created port: 13c6ac02-b1fa-4d8a-90e4-bdb3a4dd8a36 _create_port_minimal /var/lib/kolla/venv/lib/python3.6/site-packages/nova/network/neutron.py:533
2022-04-27 22:01:05.291 7 DEBUG nova.network.neutron [req-be79d0ca-7aaf-4514-96b5-ec553777a322 f08d31bec4e648e0a70a7a6df4eeea0e 1c168227db7d4053abb49960a6033124 - default default] [instance: 71971196-9aea-42bb-9a33-7f0dcff07c90] Successfully updated port: 13c6ac02-b1fa-4d8a-90e4-bdb3a4dd8a36 _update_port /var/lib/kolla/venv/lib/python3.6/site-packages/nova/network/neutron.py:571
2022-04-27 22:01:05.792 7 DEBUG nova.compute.manager [req-be79d0ca-7aaf-4514-96b5-ec553777a322 f08d31bec4e648e0a70a7a6df4eeea0e 1c168227db7d4053abb49960a6033124 - default default] [instance: 71971196-9aea-42bb-9a33-7f0dcff07c90] Preparing to wait for external event network-vif-plugged-13c6ac02-b1fa-4d8a-90e4-bdb3a4dd8a36 prepare_for_instance_event /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:281
2022-04-27 22:01:05.795 7 DEBUG nova.virt.libvirt.vif [req-be79d0ca-7aaf-4514-96b5-ec553777a322 f08d31bec4e648e0a70a7a6df4eeea0e 1c168227db7d4053abb49960a6033124 - default default] vif_type=ovs instance=Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='nova',cell_name=None,cleaned=False,config_drive='',created_at=2022-04-27T19:01:02Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=None,disable_terminate=False,display_description='s_rally_b673429f_y9geXNk8',display_name='s_rally_b673429f_y9geXNk8',ec2_ids=EC2Ids,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(525),hidden=False,host='cmp01.dev.example.com',hostname='s-rally-b673429f-y9gexnk8',id=4143,image_ref='d0842e59-243a-4a8b-b092-7f3ff67b862e',info_cache=InstanceInfoCache,instance_type_id=525,kernel_id='',key_data=None,key_name=None,keypairs=KeyPairList,launch_index=0,launched_at=None,launched_on='cmp01.dev.example.com',locked=False,locked_by=None,memory_mb=1024,metadata={},migration_context=None,new_flavor=None,node='cmp01.dev.example.com',numa_topology=None,old_flavor=None,os_type='linux',pci_devices=PciDeviceList,pci_requests=InstancePCIRequests,power_state=0,progress=0,project_id='1c168227db7d4053abb49960a6033124',ramdisk_id='',reservation_id='r-9i0hurwj'...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :

There were many fixes related to the reported issue in neutron and ovs since this bug report, some of these that i quickly catched are:-
- https://patchwork.ozlabs.org<email address hidden>/
- https://review.opendev.org/c/openstack/ovsdbapp/+/856200
- https://review.opendev.org/c/openstack/ovsdbapp/+/862524
- https://review.opendev.org/c/openstack/neutron/+/857775
- https://review.opendev.org/c/openstack/neutron/+/871825

One of the issue that i know still open but that happens very rarely https://bugzilla.redhat.com/show_bug.cgi?id=2214289

Closing it based on above and Comment #2 and #3. If the issues are still seen with python-ovs>=2.17 and above fixes included please feel free to open the issue along with neutron and nova debug logs.

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.