Sometimes dhcp-agent has status 'unmanaged' during execution of function 'stop'

Bug #1377906 reported by Anastasia Palkina
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Stanislav Makar
5.1.x
Fix Released
High
Stanislav Makar
6.0.x
Fix Released
High
Stanislav Makar

Bug Description

I found it on release 5.1 ISO + patch fuel-5.1_neutron_fix_20141001.patch

1. Create new environment (CentOS, HA mode)
2. Choose VLAN neutron
3. Add 3 controllers, 1 compute, 1 cinder
4. Start deployment. It was succeessful
5. Create 2 new tenants, configure netwroks and routers for them
6. Create instances for all of 3 tenants and start on them 'ping 8.8.8.8'
7. Pause primary controller
8. Waiting some time
9. p_neutron-l3-agent migrated to third controller
10. But p_neutron-dhcp-agent has status "unmanaged" and "Network is unreachable" for instances

[root@node-3 ~]# pcs status
Cluster name:
Last updated: Mon Oct 6 11:30:58 2014
Last change: Mon Oct 6 11:30:19 2014 via crm_attribute on node-2.domain.tld
Stack: classic openais (with plugin)
Current DC: node-2.domain.tld - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
20 Resources configured

Online: [ node-2.domain.tld node-3.domain.tld ]
OFFLINE: [ node-1.domain.tld ]

Full list of resources:

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.domain.tld
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.domain.tld
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.domain.tld node-3.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.domain.tld ]
     Slaves: [ node-2.domain.tld ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.domain.tld node-3.domain.tld ]
 p_openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-2.domain.tld
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-2.domain.tld node-3.domain.tld ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-2.domain.tld node-3.domain.tld ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): FAILED node-3.domain.tld (unmanaged)
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-3.domain.tld

Failed actions:
    p_neutron-dhcp-agent_stop_0 on node-3.domain.tld 'unknown error' (1): call=250, status=Timed Out, last-rc-change='Mon Oct 6 09:39:51 2014', queued=60038ms, exec=0ms

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaTGZkRmJNRGlVVmM/view?usp=sharing

Tags: neutron
Stanislav Makar (smakar)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
Stanislav Makar (smakar)
Changed in fuel:
status: New → In Progress
Revision history for this message
Stanislav Makar (smakar) wrote :
description: updated
Revision history for this message
Stanislav Makar (smakar) wrote :

I've got the same but with l3 agent
p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): FAILED node-2.test.domain.local (unmana
ged)

Looks loke they are similar

Below detailed output:

Last updated: Tue Oct 7 12:29:20 2014
Last change: Tue Oct 7 12:28:57 2014 via crm_attribute on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-1.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
20 Resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-3.test.domain.local ]

vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-1.test.domain.local
vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-1.test.domain.local
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-3.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-1.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-3.test.domain.local ]
p_openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-1.test.domain.local
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-3.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-3.test.domain.local ]
p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-1.test.domain.local
p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): FAILED node-2.test.domain.local (unmanaged)

Failed actions:
    p_neutron-l3-agent_stop_0 on node-2.test.domain.local 'unknown error' (1): call=244, status=Timed Out,
last-rc-change='Tue Oct 7 12:09:59 2014', queued=60000ms, exec=0ms

Revision history for this message
Stanislav Makar (smakar) wrote :

On clear 5.1 HA doesn't work at all:
routers and dhcp namespaces are not migrated

I have reproduced the issue on 5.1 + patch
So the error:
<27>Oct 8 12:54:37 node-5 crmd[20306]: error: process_lrm_event: LRM operation p_neutron-dhcp-agent_sto
p_0 (247) Timed Out (timeout=60000ms)

Revision history for this message
Stanislav Makar (smakar) wrote :

It's a floating problem, not easy to reproduce it, looks like it appeared after installation only
Meanwhile I have found other bug https://bugs.launchpad.net/fuel/+bug/1379272

I am still investigating the problem and trying to debug

Now I have some ideas how to fix it:
1. increase the timeout for operation stop for resource p_neutron-dhcp-agent
2. add failure-timeout for resource p_neutron-dhcp-agent

tags: added: fuel-lib-neutron
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/127919

Mike Scherbakov (mihgen)
tags: added: neutron
removed: fuel-lib-neutron
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/130764

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/130764
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=fdb5a3d4a230fee3cc0abff4ff8bf0ebc3d7bc72
Submitter: Jenkins
Branch: master

commit fdb5a3d4a230fee3cc0abff4ff8bf0ebc3d7bc72
Author: Stanislav Makar <email address hidden>
Date: Fri Oct 24 15:02:16 2014 +0300

    Refactor the stop operation for neutron-dhcp-agent

    * Combine the stopping of neutron-dhcp-agent and the stopping of dnsmasq
    processes inside namespaces into one time period (it was only for stopping
    of ONE neutron dhcp process).
    * Clean up duplicate code .

    Change-Id: Iaaa2b6f7c8e3b0fa870715c69fd39fa45e5dc526
    Closes-Bug: 1377906

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/132527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.1)

Change abandoned by Stanislav Makar (<email address hidden>) on branch: stable/5.1
Review: https://review.openstack.org/127919
Reason: moved to https://review.openstack.org/#/c/132527/

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #75

"build_id": "2014-11-04_16-38-46", "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa", "build_number": "75", "auth_required": true, "api": "1.0", "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129", "production": "docker", "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13", "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-11-04_16-38-46", "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa", "build_number": "75", "api": "1.0", "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129", "production": "docker", "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13", "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "2a314f9d14ea045b4b917d01d6c8f9a732ca1d7f"}}}, "fuellib_sha": "2a314f9d14ea045b4b917d01d6c8f9a732ca1d7f"

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/132527
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=745626ca7fc17ea5106f16aca1ecdaffbf653320
Submitter: Jenkins
Branch: stable/5.1

commit 745626ca7fc17ea5106f16aca1ecdaffbf653320
Author: Stanislav Makar <email address hidden>
Date: Mon Oct 13 12:32:27 2014 +0300

    Refactor the stop operation for neutron-dhcp-agent.

    * Combine the stopping of neutron-dhcp-agent and the stopping of dnsmasq
    processes inside namespaces into one time period (it was only for stopping of
    ONE neutron dhcp process).
    * Clean up duplicate code.

    Change-Id: Iaaa2b6f7c8e3b0fa870715c69fd39fa45e5dc526
    Closes-Bug: 1377906

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #17

"build_id": "2014-11-16_21-00-23", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "17", "auth_required": true, "api": "1.0", "nailgun_sha": "2fc6fc4261092a591779a8fb7e3fb1623c6abb85", "production": "docker", "fuelmain_sha": "b118fa4475833ce031ef189ce280772c676fa1c9", "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-11-16_21-00-23", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "17", "api": "1.0", "nailgun_sha": "2fc6fc4261092a591779a8fb7e3fb1623c6abb85", "production": "docker", "fuelmain_sha": "b118fa4475833ce031ef189ce280772c676fa1c9", "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "0d3909b9a291880af28dbe48b9c7d25215aa98ea"}}}, "fuellib_sha": "0d3909b9a291880af28dbe48b9c7d25215aa98ea"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.