ovs is dead but ovs agent is up

Bug #1910946 reported by norman shen
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
norman shen

Bug Description

we are using openstack-neutron rocky with openvswitch versioned 2.10.0

We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175.

My question is that when this bug happens ovs agent will not working and reported dead observed from log,
but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Are you suggesting that the agent should report itself dead when the ovs deadlock occurs?

1) How would the agent determine OVS is deadlocked?
2) Isn't it easier to adopt the workaround indicated in https://github.com/openvswitch/ovs-issues/issues/175 and just move on?

Revision history for this message
norman shen (jshen28) wrote :

Thank you for reply. Actually ovs might have other problems which will make command like
`ovs-ofctl dump-flows` hang, but when this happens subsequent operation on modifying flow
tables will timeout and fail.

And since we already could detect ovs problem early, why not stop sending heartbeats and fail quickly
rather than let cloud operator figure why live migration or spawning new instance timeout?

As for replacing libc, we actually did it on some env but since ubuntu 18.04 isn't going to fix the problem,
we actually took the risk and use libc6 from 20.04 which could be risky.... not mentioning upgrading is pretty painful because we have to live migrate instance first... but this hangup could be easily solved by something like a gdb attach, so I think it might be beneficial to expose ovs down as soon as ovs agent detects it.

Anyway my point we might need to treat ovs and ovs agent as a whole unit....

Revision history for this message
norman shen (jshen28) wrote :

Ah sorry for the duplicate report ...

Revision history for this message
Miguel Lavalle (minsel) wrote :

But we would still be doing this enhancement for this specific ovs / libc issue. Wouldn't we? Or how do you see the proposal to be more generally applicable?

Revision history for this message
norman shen (jshen28) wrote :

Right, we have actually updated libc on several production sites already and saw no issues till now. But hopefully still need some time to make the final conclusion...

anyway, got some alerts from ovs agent if it detects problem already will be good at least in terms of fast problem identification imo ... At least operators could get some alerts instead of scratching their head an finding out why server failed spawning .....

Revision history for this message
Oleg Bondarev (obondarev) wrote :
Changed in neutron:
importance: Undecided → Medium
status: New → In Progress
Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → norman shen (jshen28)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.1.0

This issue was fixed in the openstack/neutron 17.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.0.0.0rc1

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

Revision history for this message
ammarun (ammarun) wrote :

Which version of libc do you guys update ??
We also using ubuntu 18.04 with OpenStack rocky version and had this issue `ovs is dead.`

Revision history for this message
Brian Haley (brian-haley) wrote :

This fix is only in Victoria and later, so Rocky would still have the issue. Closing since the fix is available in other branches.

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.