neutron_ovn_metadata_agent doesn`t work after reconnect to new OVN Southbound leader DB

Bug #1953510 reported by olmy0414
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Rodolfo Alonso

Bug Description

After restart of current OVN SB_DB leader, neutron_ovn_metadata_agent reconnects to new SB_DB leader, according agent`s logs.
But then, during spawn of new instances, they don`t obtain metadata from nova.
Metadata agent doesn`t create any ovn namespaces and related ports, there are no any errors or other info in agent`s log.
As WA, this issue can be resolved by restarting of neutron_ovn_metadata_agent.

How to reproduce it:
Restart OVN SB_DB leader and try to spawn new openstack instance.

Environment:
Openstack is deployed by kolla-ansible (v12.2.0)
Docker - 20.10.8
Container types - ubuntu/binary

Version:
* OS: Ubuntu 20.04.3 LTS
* Kernel: 5.4.0-89-generic
* Openstack version: wallaby
* Neutron: 18.1.1
* OVS: 2.15.0
* OVN: 20.12.0

Tags: ovn
Revision history for this message
Bartosz Bezak (bbezak) wrote :

I can observe similar behaviour on Wallaby with ovn 21.06, ovs 2.16 on Centos8 Stream.
All neutron-ovn-metadata-agent are listed as XXX in openstack network agent list after leader elections occurs.

neutron-ovn-metadata-agent are working ok after restart.

summary: - neutron_ovn_metadata_agent don`t work after reconnect to new OVN
+ neutron_ovn_metadata_agent doesn`t work after reconnect to new OVN
Southbound leader DB
description: updated
Revision history for this message
Bartosz Bezak (bbezak) wrote (last edit ):

It looks like it also will have higher impact when using OVS 2.16 - which has automatic leadership transfer when doing DB snapshots - https://github.com/openvswitch/ovs/commit/3c2d6274bceecb65ec8f2f93f2aac26897a7ddfe

OVN SB LOG:
2021-12-13T21:27:01.252Z|00099|raft|INFO|Transferring leadership to write a snapshot.

NEUTRON OVN METADATA LOG:
2021-12-13 21:27:01.326 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.11:6642: connection closed by client
2021-12-13 21:27:01.326 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.11:6642: waiting 4 seconds before reconnect
2021-12-13 21:27:01.348 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: clustered database server is not cluster leader; trying another server
2021-12-13 21:27:01.349 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: connection closed by client
2021-12-13 21:27:02.322 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: connecting...
2021-12-13 21:27:02.322 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: connected
2021-12-13 21:27:02.344 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: clustered database server is not cluster leader; trying another server
2021-12-13 21:27:02.346 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: connection closed by client
2021-12-13 21:27:02.346 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.12:6642: waiting 2 seconds before reconnect
2021-12-13 21:27:02.350 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connecting...
2021-12-13 21:27:02.351 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connected
2021-12-13 21:27:04.348 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connecting...
2021-12-13 21:27:04.348 22 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connected
2021-12-13 21:27:05.327 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connecting...
2021-12-13 21:27:05.328 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:192.168.2.10:6642: connected

Revision history for this message
Bartosz Bezak (bbezak) wrote :

some debug logs from metadata agent when leader change occurs:

https://paste.opendev.org/show/811649/

Akihiro Motoki (amotoki)
tags: added: ovn
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello olmy0414, Bartosz:

I think you are hitting [0]. When the metadata agent IDL reconnects to the SB DB again, it receives the "ChassisPrivateCreateEvent", as seen in the logs you provided [1]. At this point the metadata agent should add the metadata ID to the Chassis or Chassis_Private table, as seen in [2] (my dev environment). This is not happening in your logs and that's why I suspect from the reported LP bug.

I've pushed backports up to Ussuri [3].

I would make an advice here: I don't know if Kolla is setting the chassis ID. By default, this value is retrieved from the OVS system-id. If Kolla is overriding the default value, please be aware that **OVS system-id must be a UUID formatted string** [4].

Please, check the fix and report the result.

Regards.

[0]https://bugs.launchpad.net/neutron/+bug/1952550
[1]https://paste.opendev.org/show/811649/
[2]https://paste.opendev.org/show/811687/
[3]https://review.opendev.org/q/Iad2b07f6e40dcbf690889d3b69bc00bb2ed0c05c
[4]https://docs.openvswitch.org/en/latest/ref/ovs-ctl.8/

Changed in neutron:
importance: Undecided → Medium
no longer affects: kolla-ansible
Revision history for this message
Bartosz Bezak (bbezak) wrote :

I can confirm that after system-id changed to UUID format, metadata agent reconnected to ovn sb leader change correctly.

However procedure to change system-id to UUID is bumpy. After change ovn-sb did produce errors:

2021-12-15T12:47:15.341Z|00015|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"192.168.1.19\") for index on columns \"type\" and \"ip\". First row, with UUID 1477d77e-7dc0-4700-a1a8-aded58b438cb, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID ef46bc4e-a293-4340-acb1-ed9a8e9be8d3, was inserted by this transaction.","error":"constraint violation"}

To fix that one need to remove both chassis and chassis_private (as it breaks listing existing agents in neutron) like that:

https://bugzilla.redhat.com/show_bug.cgi?id=1948472

If you have better solution to change system-ID to UUID format I'd be grateful. Thank you.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Bartosz:

To change the system ID to UUID format is not a solution but the only way OVS system-id can be configured. If this ID was incorrectly set initially, that was a configuration issue. OVS does not fail when this ID has not UUID format but it should. And now the OVS metadata agent will exit with an exception.

As reported in this BZ, you need to stop the affected ovn-controllers, delete the Chassis and Chassis_Private registers, delete the OVN agents in Neutron (if present) and restart the ovn-controller again.

I'll mark this bug as a duplicate of LP#1952550.

Regards.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.