[ovn] OVN agents showing as dead until neutron services restarted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Expired
|
Medium
|
Unassigned |
Bug Description
My apologies if this is already a resolved issue; I couldn't readily find an existing bug but I recognize my software versions are somewhat behind here.
High level description: Had an issue today where "openstack network agent list" was frequently showing all OVN agents as offline. I root-caused this to 2 of the neutron-servers consistently returning alive=false for all OVN network agents while 1 of the neutron servers consistently returned alive=true. Upon restarting neutron (pause/resume via neutron-api charm action), the affected neutron servers started returning alive=true.
Workaround: Restarting neutron services appears to resolve the issue; "openstack network agent list" now consistently shows all OVN agents as alive.
Relevant software versions in use:
* OpenStack series: Ussuri
* Neutron version: 16.4.0 (e.g. neutron-common package at 2:16.4.
* Charm versions:
* neutron-api: cs:neutron-api-288
* neutron-
Perceived severity: Not a blocker since there's a workaround, but when it occurs, it causes very scary looking alerts in Nagios due to all of OVN appearing offline.
My apologies for this being perhaps somewhat scarce on details; I need to jump to debug another issue, but wanted to ensure at least something is filed here. Thank you.
Hi,
Thanks for the report!
We have an AgentCache for ovn:
https:/ /opendev. org/openstack/ neutron/ src/commit/ dddf93cd2b85131 a68352255874409 bfef74eff7/ neutron/ plugins/ ml2/drivers/ ovn/agent/ neutron_ agent.py# L197
And as we know caching is hard, so I wouldn't be surprised to see such a bug. However without more information this can be hard or impossible to fix. Do you have an idea what conditions trigger this error? If you are monitoring it you may have some indication when it is happening. What else in going on in your system at that time?
Do you have neutron-server logs from that time (preferably at debug level)? Do you see any errors in those logs?
In your workaround is it enough to restart the problematic neutron-server instance or do you have to restart something else too? If yes, which component?
When you have the time please try to come up with reproduction steps, because that would help the fix to a great extent.
Cheers,
Bence