Neutron picking incorrect ovn records

Bug #2012104 reported by Peter Sabaini
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

For one of our compute machines I'm seeing two network agents that appear unhealthy:

```
$ os network agent list | fgrep "register deleted"
| compute1 | OVN Controller agent | ("Chassis" register deleted) | | XXX | UP | ovn-controller |
| c085d57a-3a2b-4f97-8250-23d3f914b078 | OVN Metadata agent | ("Chassis" register deleted) | | XXX | UP | neutron-ovn-metadata-agent |
```

The ("Chassis" register deleted) message appears to come from the fix for this: https://bugs.launchpad.net/neutron/+bug/1951149

Searching for that external id I can find this private chassis and it's chassis indeed seems empty:

```
$ sudo ovn-sbctl find chassis-private | grep -A 5 e621e0fb-83d3-4a18-82b3-c842996548ed'
_uuid : e621e0fb-83d3-4a18-82b3-c842996548ed
chassis : []
external_ids : {"neutron:liveness_check_at"="2022-06-17T08:43:33.393639+00:00", "neutron:metadata_liveness_check_at"="2022-06-17T02:27:21.309718+00:00", "neutron:ovn-metadata-id"="c085d57a-3a2b-4f97-8250-23d3f914b078", "ne
utron:ovn-metadata-sb-cfg"="150397"}
name : compute1
nb_cfg : 150397
nb_cfg_timestamp : 1657729945956
```

But there's also:

```
$ sudo ovn-sbctl find chassis hostname=compute1.stack
_uuid : 164cb56b-1a3c-4401-bc52-6fa5e58d8f2a
encaps : [c442312a-9dfa-4ffe-9db7-afe5f9055962]
external_ids : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", "neutron:ovn-metadata-sb-cfg"="250161", ovn-bridge-mappings="", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-timeout-ms="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname : compute1.stack
name : compute1.stack
nb_cfg : 0
other_config : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-timeout-ms="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones : []
vtep_logical_switches: []

$ sudo ovn-sbctl find chassis-private chassis=164cb56b-1a3c-4401-bc52-6fa5e58d8f2a
_uuid : cbec617d-19dc-481c-ba99-b4132244773c
chassis : 164cb56b-1a3c-4401-bc52-6fa5e58d8f2a
external_ids : {"neutron:ovn-metadata-id"="3328a0c7-081b-58a9-9e91-baf5c8c259cd", "neutron:ovn-metadata-sb-cfg"="312321"}
name : compute1.stack
nb_cfg : 312321
nb_cfg_timestamp : 1679042105359
```

Which seems to be a correct entry -- should neutron not pick up this entry rather than the one with "chassis : []"?

Software versions:

ii neutron-server 2:20.2.0-0ubuntu1~cloud0 all Neutron is a virtual network service for Openstack - server

ii ovn-central 22.03.0-0ubuntu1~cloud0 amd64 OVN central components

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

Please let me know if I can provide more diagnostics.

Tags: ovn
description: updated
tags: added: ovn
Revision history for this message
Brian Haley (brian-haley) wrote :

From the info you provided I believe you're running stable/yoga code, which should have the fix you mentioned.

Did you try restarting neutron-server to see if that changed anything?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Right, the patch https://review.opendev.org/c/openstack/neutron/+/839027 is deployed. -- Aiui the patch introduced the message in the first place.

I have indeed done a `sudo systemctl restart neutron-server.service` across the board, but still see `("Chassis" register deleted)`.

From reading the bug and patch there seems to be an expectation that operators manually cleanup? It'd of course be nice if this could be done automatically, but failing that maybe this is just a matter of spelling this out more clearly and documenting a procedure

Changed in neutron:
status: Incomplete → New
Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Peter:

That seems to be a problem during an upgrade or compute node re-deployment. The process "ovn-controller", that runs in every node, is in charge of, among other operations, updating the "Chassis"/"Chassis_Private" records in the OVN SB database. When the process is started, it first checks if these records are created or not; when the process is **gracefully** stopped, it deletes them. I've highlighted gracefully because this is important: if the process is killed, the "Chassis"/"Chassis_Private" (or just one of them) will remain in the OVN SB database.

In your case I guess that "Chassis_Private" is still present but not "Chassis". I see the host name changed from "compute1" to "compute1.stack", this is why I'm guessing you did some kind of upgrade/update.

If now you can see the "Chassis"/"Chassis_Private" for "compute1.stack", this is because you have started again "ovn-controller". But that won't delete any orphaned or leftover register. It is a responsibility of the admin to check that when starting again the "ovn-controller" service.

Neutron is not responsible of the OVN SB database and can't delete these orphaned registers. The patch mentioned [1] is a fix for the Neutron server in these kind of situations. Now instead of raising an exception it shows the message "("Chassis" register deleted)" that should be an indication to the admin to check the status of the OVN database.

One more question to confirm: when executing the "os network agent list" command, are you now seeing two registers for "compute1" (deleted ones) and two registers for "compute1.stack"?

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/839027

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Hi Rodolfo,

yes this node has had crashes, possibly this contributed to the stale entries.

When doing `os network agent list` I did not see any entries mentioning `compute1` nor `compute1.stack` besides the ones listed as `("Chassis" register deleted)` (I have since deleted and resurrected the agents).

Wrt to the admins responsibility to clean up orphaned entries -- I suspected as much after reading the patch from https://review.opendev.org/c/openstack/neutron/+/839027 . It would be good to have some docs around that though.

Thanks,
peter.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

This will be addressed in https://bugs.launchpad.net/neutron/+bug/2016158. In case Neutron finds a number of Chassis/Chassis_Private registers that are duplicated (same host), we'll decide what to do in this case.

For your case, where the host name changed from "compute1" to "compute1.stack", we'll update the docs but Neutron cannot proactively execute any action in this case.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881204

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/881204
Committed: https://opendev.org/openstack/neutron/commit/b31453af477bc7421f81886fc1f4cba230dad425
Submitter: "Zuul (22348)"
Branch: master

commit b31453af477bc7421f81886fc1f4cba230dad425
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Apr 20 20:52:33 2023 +0200

    [OVN] Admin procedure for duplicated or deleted OVN agents

    This patch documents how to detect that the system has duplicated
    "Chassis" and "Chassis_Private" registers or when a "Chassis_Private"
    register is orphaned, and how to proceed to health the OVN Southbound
    database.

    Closes-Bug: #2012104
    Change-Id: I926e6b9fe5fbad2968fc92e65082b7bb0d8571a9

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b2

This issue was fixed in the openstack/neutron 23.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.