Long enough network partitioning puts neutron-ovs-agent in unmanaged state

Bug #1504322 reported by Victor Denisov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Medium
Fuel Library (Deprecated)
6.1.x
Invalid
Medium
MOS Maintenance
7.0.x
Invalid
Medium
MOS Maintenance
Future
Invalid
Medium
Fuel Library (Deprecated)

Bug Description

Steps to reproduce:

Introduce a network partitioning between controller nodes in fuel.
Wait till neutron-ovs-agents go into unmanaged state in crm.
Remove the network partitioning and try restarting neutron-ovs-agents using crm commands.
The agent stays in unmanaged state.

If this scenario doesn't bring you to the reproduction then after putting services in unmanaged state
delet neutron-ovs-agent pid files from /var/run/resource-agents/ocf-neutron-ovs-agent/ocf-neutron-ovs-agent.pid
and then try restarting the services using crm commands.

Dmitry Klenov (dklenov)
Changed in fuel:
assignee: nobody → MOS Neutron (mos-neutron)
importance: Undecided → High
milestone: none → 8.0
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

We will be moving neutron agents from under pcs control so this should fundamentally solve it.

Changed in fuel:
status: New → Confirmed
tags: added: customer-found
Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Sergey Kolekonov (skolekonov)
status: Confirmed → Triaged
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Revision history for this message
Roman Rufanov (rrufanov) wrote :

customer found on 6.1 - need fix in that and all subsequent versions please. Thanks !

tags: added: support
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Per discussion with Fuel Library folks, we should not remove ovs agent from Pacemaker as a permanent solution.
The problem with network partitioning should be solved globally, as there's nearly nothing to do from the ovs agent side.
I've filed a bug that describes one of the possible solutions of this problem, though it's more a Fuel feature: https://bugs.launchpad.net/fuel/+bug/1515894

Reassigning the bug to Fuel Library, as there's nothing to do here from the MOS side.

Changed in fuel:
status: Triaged → Confirmed
Changed in fuel:
assignee: Sergey Kolekonov (skolekonov) → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

"Wait till neutron-ovs-agents go into unmanaged state in crm."
this has nothing to the network partitions, but a flaw in the OCF script.
A resource may only become unmanaged if it failed to stop AND there is no stonith enabled in pacemaker cluster.
Hence, we must check the OCF code thoroghly and fix cases then a resource might be failing to stop.
There were same type of issue for mysql OCF, btw

tags: added: area-library
removed: area-mos
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I would set it to critical as faily stop actions are blockers for operations

Changed in fuel:
importance: High → Critical
Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I see another solution for this.
We should make empty OCF script, that will nothing to do, but has monitor, that check network consistency.

All agents scropt should be co-located with this script. If network will broken, agents will down. Ind will up only after network connectivity was restored.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Please update tickets with detail plan how to create 'network partitioning'. Which networks were included? What quorum-policy did you have? Where can I find the logs? Where can I find the exact fuel build number? Please follow

https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Test_and_report_bugs

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Matthew Mosesohn (raytrac3r)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I'm okay with Sergey's approach that we should ensure end-to-end connectivity on all required ports and protocols, but that's quite a large open-ended solution. What we need here first is a concrete way to reproduce the issue.

I tried blocking corosync and rabbitmq traffic so that the neutron agents all died and their corosync resources went down, waited 10 minutes, then removed the iptables rules to block them. All services recovered as expected. PID files were removed as expected, too.

I also tried simply removing the pidfile and restarting ovs agent. It restarted with no issues.

This was in 8.0. We can't leave this bug in critical state if we can't reproduce this and it isn't actually blocking any 8.0 deployments. I'll move this to critical for 6.1 and high for 8.0. I will try now to reproduce this bug in 6.1.

Changed in fuel:
importance: Critical → High
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Medium on 6.1 too. I couldn't reproduce in such a way to make the service reach unmanaged state.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Moving back to fuel-library

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Fuel Library Team (fuel-library)
tags: added: team-bugfix
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Invalid for 6.1-updates and 7.0-updates as per comment #9 the issue was not reproduced.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Moving to 9.0. We can't reproduce this on 8.0 and it's not high priority.

Changed in fuel:
importance: High → Medium
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

wontfix for 8.0 because it's medium and past SCF

Changed in fuel:
status: Incomplete → Won't Fix
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
tags: added: wontfix-low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.