[ovn] OVN agents showing as dead until neutron services restarted

Bug #1955503 reported by Paul Goins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

My apologies if this is already a resolved issue; I couldn't readily find an existing bug but I recognize my software versions are somewhat behind here.

High level description: Had an issue today where "openstack network agent list" was frequently showing all OVN agents as offline. I root-caused this to 2 of the neutron-servers consistently returning alive=false for all OVN network agents while 1 of the neutron servers consistently returned alive=true. Upon restarting neutron (pause/resume via neutron-api charm action), the affected neutron servers started returning alive=true.

Workaround: Restarting neutron services appears to resolve the issue; "openstack network agent list" now consistently shows all OVN agents as alive.

Relevant software versions in use:
* OpenStack series: Ussuri
* Neutron version: 16.4.0 (e.g. neutron-common package at 2:16.4.0-0ubuntu3~cloud0)
* Charm versions:
  * neutron-api: cs:neutron-api-288
  * neutron-api-plugin-ovn: cs:neutron-api-plugin-ovn-1

Perceived severity: Not a blocker since there's a workaround, but when it occurs, it causes very scary looking alerts in Nagios due to all of OVN appearing offline.

My apologies for this being perhaps somewhat scarce on details; I need to jump to debug another issue, but wanted to ensure at least something is filed here. Thank you.

Tags: ovn
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi,

Thanks for the report!

We have an AgentCache for ovn:

https://opendev.org/openstack/neutron/src/commit/dddf93cd2b85131a68352255874409bfef74eff7/neutron/plugins/ml2/drivers/ovn/agent/neutron_agent.py#L197

And as we know caching is hard, so I wouldn't be surprised to see such a bug. However without more information this can be hard or impossible to fix. Do you have an idea what conditions trigger this error? If you are monitoring it you may have some indication when it is happening. What else in going on in your system at that time?

Do you have neutron-server logs from that time (preferably at debug level)? Do you see any errors in those logs?

In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?

When you have the time please try to come up with reproduction steps, because that would help the fix to a great extent.

Cheers,
Bence

tags: added: ovn
Changed in neutron:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Paul Goins (vultaire) wrote :

Unfortunately, neutron logging was at verbose=False debug=False at the time of this.

I do see OVN errors, but they at least don't seem directly related, e.g.

2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry '<REDACTED-UUID>-<REDACTED-HOST>' for key 'PRIMARY'")
2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s]
2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation [parameters: {'host': '<REDACTED-HOST>', 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': '<REDACTED-UUID>', 'ml2_port_bindings_host': '<DIFFERENT-REDACTED-HOST>'}]

Generally we don't run with debug/verbose logging enabled due to the sheer amount of log data it produces. I'd say I'd turn on logging if this recurs, but of course the act of restarting the services appears to make the issue resolve.

If I see this recur, I'll try to provide any additional information which may help to root cause this. I understand that with the information currently provided here that there's not much that can be done here, aside from this bug being here as something for others to potentially find in the future if they hit a similar issue.

There's one bit I can provide a little extra context on:

> In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?

I generally try to operate in terms of charms, so I paused and resumed the neutron-api charm. neutron-server would have been restarted by doing that; I'm unsure if other services would have been touched or not. For what it's worth, "systemctl list-units --all | grep neutron | grep -ve jujud" only shows neutron-server.service as being present.

Best Regards,
Paul Goins

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Just a heads-up. There is an ongoing patch [1] to change how the OVN agents are monitored and the config stored (using the Neutron DB). Of course, that will probably land in master only.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/818850

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.