ovs is dead but ovs agent is up

Bug #1910946 reported by norman shen on 2021-01-11
This bug affects 1 person
Affects Status Importance Assigned to Milestone
norman shen

Bug Description

we are using openstack-neutron rocky with openvswitch versioned 2.10.0

We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175.

My question is that when this bug happens ovs agent will not working and reported dead observed from log,
but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state.

Miguel Lavalle (minsel) wrote :

Are you suggesting that the agent should report itself dead when the ovs deadlock occurs?

1) How would the agent determine OVS is deadlocked?
2) Isn't it easier to adopt the workaround indicated in https://github.com/openvswitch/ovs-issues/issues/175 and just move on?

norman shen (jshen28) wrote :

Thank you for reply. Actually ovs might have other problems which will make command like
`ovs-ofctl dump-flows` hang, but when this happens subsequent operation on modifying flow
tables will timeout and fail.

And since we already could detect ovs problem early, why not stop sending heartbeats and fail quickly
rather than let cloud operator figure why live migration or spawning new instance timeout?

As for replacing libc, we actually did it on some env but since ubuntu 18.04 isn't going to fix the problem,
we actually took the risk and use libc6 from 20.04 which could be risky.... not mentioning upgrading is pretty painful because we have to live migrate instance first... but this hangup could be easily solved by something like a gdb attach, so I think it might be beneficial to expose ovs down as soon as ovs agent detects it.

Anyway my point we might need to treat ovs and ovs agent as a whole unit....

norman shen (jshen28) wrote :

Ah sorry for the duplicate report ...

Miguel Lavalle (minsel) wrote :

But we would still be doing this enhancement for this specific ovs / libc issue. Wouldn't we? Or how do you see the proposal to be more generally applicable?

norman shen (jshen28) wrote :

Right, we have actually updated libc on several production sites already and saw no issues till now. But hopefully still need some time to make the final conclusion...

anyway, got some alerts from ovs agent if it detects problem already will be good at least in terms of fast problem identification imo ... At least operators could get some alerts instead of scratching their head an finding out why server failed spawning .....

Oleg Bondarev (obondarev) wrote :
Changed in neutron:
importance: Undecided → Medium
status: New → In Progress
Miguel Lavalle (minsel) on 2021-01-14
Changed in neutron:
assignee: nobody → norman shen (jshen28)

This issue was fixed in the openstack/neutron 17.1.0 release.

This issue was fixed in the openstack/neutron release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.