Fuel for OpenStack

Long enough network partitioning puts neutron-ovs-agent in unmanaged state

Bug #1504322 reported by Victor Denisov on 2015-10-08

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Won't Fix	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 8.0
6.1.x	Invalid	Medium	MOS Maintenance	Fuel for OpenStack 6.1-updates
7.0.x	Invalid	Medium	MOS Maintenance	Fuel for OpenStack 7.0-updates
Future	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 9.0

Bug Description

Steps to reproduce:

Introduce a network partitioning between controller nodes in fuel.
Wait till neutron-ovs-agents go into unmanaged state in crm.
Remove the network partitioning and try restarting neutron-ovs-agents using crm commands.
The agent stays in unmanaged state.

If this scenario doesn't bring you to the reproduction then after putting services in unmanaged state
delet neutron-ovs-agent pid files from /var/run/resource-agents/ocf-neutron-ovs-agent/ocf-neutron-ovs-agent.pid
and then try restarting the services using crm commands.

Tags:

Dmitry Klenov (dklenov) on 2015-10-08

Changed in fuel:
assignee:	nobody → MOS Neutron (mos-neutron)
importance:	Undecided → High
milestone:	none → 8.0

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-10-08:

We will be moving neutron agents from under pcs control so this should fundamentally solve it.

Changed in fuel:
status:	New → Confirmed
tags:	added: customer-found

Alexander Ignatov (aignatov) on 2015-10-14

Changed in fuel:
assignee:	MOS Neutron (mos-neutron) → Sergey Kolekonov (skolekonov)
status:	Confirmed → Triaged

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-mos

Revision history for this message

Roman Rufanov (rrufanov) wrote on 2015-10-23:

customer found on 6.1 - need fix in that and all subsequent versions please. Thanks !

tags:

added: support

Revision history for this message

Sergey Kolekonov (skolekonov) wrote on 2015-11-13:

Per discussion with Fuel Library folks, we should not remove ovs agent from Pacemaker as a permanent solution.
The problem with network partitioning should be solved globally, as there's nearly nothing to do from the ovs agent side.
I've filed a bug that describes one of the possible solutions of this problem, though it's more a Fuel feature: https://bugs.launchpad.net/fuel/+bug/1515894

Reassigning the bug to Fuel Library, as there's nothing to do here from the MOS side.

Sergey Kolekonov (skolekonov) on 2015-11-13

Changed in fuel:
status:	Triaged → Confirmed

Alexander Ignatov (aignatov) on 2015-11-13

Changed in fuel:
assignee:	Sergey Kolekonov (skolekonov) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-11-13:

"Wait till neutron-ovs-agents go into unmanaged state in crm."
this has nothing to the network partitions, but a flaw in the OCF script.
A resource may only become unmanaged if it failed to stop AND there is no stonith enabled in pacemaker cluster.
Hence, we must check the OCF code thoroghly and fix cases then a resource might be failing to stop.
There were same type of issue for mysql OCF, btw

tags:

added: area-library
removed: area-mos

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-11-13:

I would set it to critical as faily stop actions are blockers for operations

Changed in fuel:
importance:	High → Critical

Sergii Golovatiuk (sgolovatiuk) on 2015-11-18

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2015-11-18:

I see another solution for this.
We should make empty OCF script, that will nothing to do, but has monitor, that check network consistency.

All agents scropt should be co-located with this script. If network will broken, agents will down. Ind will up only after network connectivity was restored.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-11-18:

Please update tickets with detail plan how to create 'network partitioning'. Which networks were included? What quorum-policy did you have? Where can I find the logs? Where can I find the exact fuel build number? Please follow

https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Test_and_report_bugs

Matthew Mosesohn (raytrac3r) on 2015-11-18

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Matthew Mosesohn (raytrac3r)

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-11-18:

I'm okay with Sergey's approach that we should ensure end-to-end connectivity on all required ports and protocols, but that's quite a large open-ended solution. What we need here first is a concrete way to reproduce the issue.

I tried blocking corosync and rabbitmq traffic so that the neutron agents all died and their corosync resources went down, waited 10 minutes, then removed the iptables rules to block them. All services recovered as expected. PID files were removed as expected, too.

I also tried simply removing the pidfile and restarting ovs agent. It restarted with no issues.

This was in 8.0. We can't leave this bug in critical state if we can't reproduce this and it isn't actually blocking any 8.0 deployments. I'll move this to critical for 6.1 and high for 8.0. I will try now to reproduce this bug in 6.1.

Changed in fuel:
importance:	Critical → High

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-11-19:

Medium on 6.1 too. I couldn't reproduce in such a way to make the service reach unmanaged state.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-11-24:

#10

Moving back to fuel-library

Changed in fuel:
assignee:	Matthew Mosesohn (raytrac3r) → Fuel Library Team (fuel-library)

Matthew Mosesohn (raytrac3r) on 2015-12-11

tags:

added: team-bugfix

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-12-24:

#11

Setting this as Invalid for 6.1-updates and 7.0-updates as per comment #9 the issue was not reproduced.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-12-25:

#12

Moving to 9.0. We can't reproduce this on 8.0 and it's not high priority.

Changed in fuel:
importance:	High → Medium

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-12-28:

#13

wontfix for 8.0 because it's medium and past SCF

Changed in fuel:
status:	Incomplete → Won't Fix

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-28:

#14

likely the same issue https://bugs.launchpad.net/fuel/+bug/1528889

Vitaly Sedelnik (vsedelnik) on 2016-03-10

tags:

added: wontfix-low

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.