Pacemaker neutron agent scripts start/stop/migration will fail if management vip moved recently

Bug #1287716 reported by Matthew Mosesohn on 2014-03-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
4.1.x
High
Fuel Library (Deprecated)

Bug Description

{"build_id": "2014-03-04_12-31-13", "mirantis": "yes", "build_number": "112", "nailgun_sha": "d98b61e073d32c45c98099a11ff263a68b7ba205", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "15a55ccff0f59929b32d087679d19e896bde8e0d"}

Steps to reproduce:
1 - Deploy Ubuntu HA (Cinder LVM backend, Swift glance backend, Neutron with GRE segmentation) 3 computes - 1 controller - 1 storage
2 - Log into first controller and run crm_resource -r vip__management_old --move --node node-3 (NOTE: replace node-3 where it is the nonprimary controller)
3 - Wait ~60s for keystone and other services to recover
4 - Run neutron agent-list

Results:

# neutron agent-list
+--------------------------------------+--------------------+--------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+--------+-------+----------------+
| 09699e60-aa51-4a66-bf0f-bb8eeab49da5 | L3 agent | node-3 | xxx | True |
| 12236192-8980-4068-8ed8-adc94eb1f681 | Open vSwitch agent | node-1 | :-) | True |
| 2c0ec06d-087c-4e45-b066-403ce6a97f51 | Open vSwitch agent | node-2 | :-) | True |
| ad4c9181-6a26-4b4c-be22-214c3df2514e | DHCP agent | node-1 | xxx | True |
| bd893993-5768-4182-ac4a-ff71e7905a64 | Open vSwitch agent | node-3 | :-) | True |
| f7451cfd-600a-444b-8d36-2af7b21714c3 | Open vSwitch agent | node-4 | :-) | True |

# crm resource show | egrep 'l3|dhcp'
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started (unmanaged) FAILED
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started (unmanaged) FAILED

From l3 agent logs:
p_neutron-l3-agent_start_0:4166:stderr [ Traceback (most recent call last): ]
p_neutron-l3-agent_start_0:4166:stderr [ File "/usr/bin/q-agent-cleanup.py", line 525, in <module> ]
p_neutron-l3-agent_start_0:4166:stderr [ cleaner = NeutronCleaner(get_authconfig(args.authconf), options=vars(args), log=LOG) ]
p_neutron-l3-agent_start_0:4166:stderr [ File "/usr/bin/q-agent-cleanup.py", line 106, in __init__ ]
p_neutron-l3-agent_start_0:4166:stderr [ raise e ]
p_neutron-l3-agent_start_0:4166:stderr [ keystoneclient.apiclient.exceptions.AuthorizationFailure: Authorization Failed: An unexpected erro
(2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0") None None (HTTP 500) ]
p_neutron-dhcp-agent_start_0:4153:stderr [ Traceback (most recent call last): ]
p_neutron-dhcp-agent_start_0:4153:stderr [ File "/usr/bin/q-agent-cleanup.py", line 525, in <module> ]
p_neutron-dhcp-agent_start_0:4153:stderr [ cleaner = NeutronCleaner(get_authconfig(args.authconf), options=vars(args), log=LOG) ]
p_neutron-dhcp-agent_start_0:4153:stderr [ File "/usr/bin/q-agent-cleanup.py", line 106, in __init__ ]
p_neutron-dhcp-agent_start_0:4153:stderr [ raise e ]

We should tune OCF scripts and/or q-agent-cleanup.py to be more tolerant of keystone being unavailable for up to 2 minutes.

Matthew Mosesohn (raytrac3r) wrote :
tags: added: library neutron

Fix proposed to branch: master
Review: https://review.openstack.org/77895

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
status: New → In Progress

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/78067

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Dmitry Borodaenko (dborodaenko)

Reviewed: https://review.openstack.org/77895
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=48ced96753378e49883cdade8957160ef1b29899
Submitter: Jenkins
Branch: master

commit 48ced96753378e49883cdade8957160ef1b29899
Author: Sergey Vasilenko <email address hidden>
Date: Tue Mar 4 18:42:20 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant

    to mysql and keystone temporary fails.

    Change-Id: Iaf5d5b49932c1dc4db6bca0563607972150f4cf4
    Closes-bug: #1287716

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/78067
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=73313007c0914e602246ea41fa5e8ca2dfead9f8
Submitter: Jenkins
Branch: stable/4.1

commit 73313007c0914e602246ea41fa5e8ca2dfead9f8
Author: Sergey Vasilenko <email address hidden>
Date: Tue Mar 4 18:42:20 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant

    to mysql and keystone temporary fails.

    Change-Id: Iaf5d5b49932c1dc4db6bca0563607972150f4cf4
    Closes-bug: #1287716

Reviewed: https://review.openstack.org/78178
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=70e813d5b6b26dba0cd763ce24eab27747f4b573
Submitter: Jenkins
Branch: master

commit 70e813d5b6b26dba0cd763ce24eab27747f4b573
Author: Sergey Vasilenko <email address hidden>
Date: Wed Mar 5 15:40:04 2014 +0400

    Make Neutron L3/DHCP agents OCF script more tolerant to mysql and keystone temporary fails.

    In this implementation cleanup-script does not get information from Neutron API.
    Script inspects network namespaces on this node for given agent type and removes
    found ports from integration bridge.

    Closes-bug: #1287716
    Partial-bug: #1285929
    Change-Id: I2dfb31f240dca652341c4623f237f6a143414448

tags: added: in progress

verified on fuel_5_0_iso#29

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: in progress
tags: added: backports-4.1.1
Andrew Woodward (xarses) on 2014-04-04
tags: added: ha
Andrew Woodward (xarses) wrote :

at a glance, it appears that 70e813d5b6b26dba0cd763ce24eab27747f4b573 was not backported

Changed in fuel:
status: Fix Released → Triaged
Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Sergey Vasilenko (xenolog)
Andrew Woodward (xarses) on 2014-05-07
summary: - Neutron L3/DHCP agents fail when VIP fails over
+ Pacemaker neutron agent scripts start/stop/migration will fail if
+ management vip moved recently
no longer affects: fuel/5.0.x
no longer affects: fuel
Meg McRoberts (dreidellhasa) wrote :

Documented in 4.1.1 Release Notes

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers