neutron

[ovn] OVN agents showing as dead until neutron services restarted

Bug #1955503 reported by Paul Goins on 2021-12-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Expired	Medium	Unassigned

Bug Description

My apologies if this is already a resolved issue; I couldn't readily find an existing bug but I recognize my software versions are somewhat behind here.

High level description: Had an issue today where "openstack network agent list" was frequently showing all OVN agents as offline. I root-caused this to 2 of the neutron-servers consistently returning alive=false for all OVN network agents while 1 of the neutron servers consistently returned alive=true. Upon restarting neutron (pause/resume via neutron-api charm action), the affected neutron servers started returning alive=true.

Workaround: Restarting neutron services appears to resolve the issue; "openstack network agent list" now consistently shows all OVN agents as alive.

Relevant software versions in use:
* OpenStack series: Ussuri
* Neutron version: 16.4.0 (e.g. neutron-common package at 2:16.4.0-0ubuntu3~cloud0)
* Charm versions:
* neutron-api: cs:neutron-api-288
* neutron-api-plugin-ovn: cs:neutron-api-plugin-ovn-1

Perceived severity: Not a blocker since there's a workaround, but when it occurs, it causes very scary looking alerts in Nagios due to all of OVN appearing offline.

My apologies for this being perhaps somewhat scarce on details; I need to jump to debug another issue, but wanted to ensure at least something is filed here. Thank you.

Tags:

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2021-12-22:

Hi,

Thanks for the report!

We have an AgentCache for ovn:

https://opendev.org/openstack/neutron/src/commit/dddf93cd2b85131a68352255874409bfef74eff7/neutron/plugins/ml2/drivers/ovn/agent/neutron_agent.py#L197

And as we know caching is hard, so I wouldn't be surprised to see such a bug. However without more information this can be hard or impossible to fix. Do you have an idea what conditions trigger this error? If you are monitoring it you may have some indication when it is happening. What else in going on in your system at that time?

Do you have neutron-server logs from that time (preferably at debug level)? Do you see any errors in those logs?

In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?

When you have the time please try to come up with reproduction steps, because that would help the fix to a great extent.

Cheers,
Bence

tags:	added: ovn
Changed in neutron:
importance:	Undecided → Medium
status:	New → Incomplete

Revision history for this message

Paul Goins (vultaire) wrote on 2021-12-22:

Unfortunately, neutron logging was at verbose=False debug=False at the time of this.

I do see OVN errors, but they at least don't seem directly related, e.g.

2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry '<REDACTED-UUID>-<REDACTED-HOST>' for key 'PRIMARY'")
2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s]
2021-12-22 17:40:50.432 1513960 ERROR neutron.pecan_wsgi.hooks.translation [parameters: {'host': '<REDACTED-HOST>', 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': '<REDACTED-UUID>', 'ml2_port_bindings_host': '<DIFFERENT-REDACTED-HOST>'}]

Generally we don't run with debug/verbose logging enabled due to the sheer amount of log data it produces. I'd say I'd turn on logging if this recurs, but of course the act of restarting the services appears to make the issue resolve.

If I see this recur, I'll try to provide any additional information which may help to root cause this. I understand that with the information currently provided here that there's not much that can be done here, aside from this bug being here as something for others to potentially find in the future if they hit a similar issue.

There's one bit I can provide a little extra context on:

> In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?

I generally try to operate in terms of charms, so I paused and resumed the neutron-api charm. neutron-server would have been restarted by doing that; I'm unsure if other services would have been touched or not. For what it's worth, "systemctl list-units --all | grep neutron | grep -ve jujud" only shows neutron-server.service as being present.

Best Regards,
Paul Goins

Unfortunately, neutron logging was at verbose=False debug=False at the time of this.

I do see OVN errors, but they at least don't seem directly related, e.g.

Generally we don't run with debug/verbose logging enabled due to the sheer amount of log data it produces.  I'd say I'd turn on logging if this recurs, but of course the act of restarting the services appears to make the issue resolve.

If I see this recur, I'll try to provide any additional information which may help to root cause this.  I understand that with the information currently provided here that there's not much that can be done here, aside from this bug being here as something for others to potentially find in the future if they hit a similar issue.

There's one bit I can provide a little extra context on:

> In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?

I generally try to operate in terms of charms, so I paused and resumed the neutron-api charm.  neutron-server would have been restarted by doing that; I'm unsure if other services would have been touched or not.  For what it's worth, "systemctl list-units --all | grep neutron | grep -ve jujud" only shows neutron-server.service as being present.

Best Regards,
Paul Goins

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2022-01-11:

Hello:

Just a heads-up. There is an ongoing patch [1] to change how the OVN agents are monitored and the config stored (using the Neutron DB). Of course, that will probably land in master only.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/818850