neutron metadata agent has failed after shutdown of primary controller

Bug #1371561 reported by Vadim Rovachev
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Sergey Vasilenko
5.0.x
Won't Fix
Critical
Sergey Vasilenko
5.1.x
Fix Released
Critical
Vladimir Kuklin
6.0.x
Fix Released
Critical
Sergey Vasilenko

Bug Description

{"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "auth_required": true, "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}

Precondition steps:
Fuel masted node installed

Steps to reproduce:
1. Create and deploy env with params:
   KVM hypervizor
   Neutron GRE network
   CentOS HA mode
   3 KVM machimes with roles: controller + mongo
   1 Supermicro with roles: compute + ceph-osd

2. Run ostf Sanity and Functional tests.
3. Wait all pass tests.
4. Destroy primary controller
5. Run ostf Sanity and Functional tests.
Expected result: all pass tests.
Actual result: test "Check network connectivity from instance via floating IP" failed

##########################################################################

ostf log:
http://paste.openstack.org/show/113233/

##########################################################################

crm resource list
 vip__management_old (ocf::mirantis:ns_IPaddr2): Started
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started
 p_openstack-ceilometer-central (ocf::mirantis:ceilometer-agent-central): Started
 p_openstack-ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Started
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-7.domain.tld node-8.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-7.domain.tld ]
     Slaves: [ node-8.domain.tld ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-7.domain.tld node-8.domain.tld ]
 p_openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-7.domain.tld node-8.domain.tld ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-7.domain.tld node-8.domain.tld ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started

##########################################################################

ubuntu instance start log:
http://paste.openstack.org/show/113234/

##########################################################################

cirros instance start log:
http://paste.openstack.org/show/113244/

##########################################################################
neutron agent-list
+--------------------------------------+--------------------+-------------------+-------+----------------+
| id | agent_type | host | alive | admin_state_up |
+--------------------------------------+--------------------+-------------------+-------+----------------+
| 0a5f55cb-0372-4030-a43f-e6fb8f5643bb | L3 agent | node-6.domain.tld | xxx | True |
| 2213382a-a759-4d51-b680-29190969eac9 | Open vSwitch agent | node-8.domain.tld | :-) | True |
| 37531782-3c9c-4610-bdf5-764a9d676d59 | Metadata agent | node-6.domain.tld | xxx | True |
| 4056b707-a12d-4fed-a726-6e4ad87efe17 | Metadata agent | node-8.domain.tld | :-) | True |
| 954f2aa7-ecfa-4eb9-ac85-e5ac830b0f48 | DHCP agent | node-7.domain.tld | :-) | True |
| aed4e34f-af53-4a4a-847e-c08e446e832a | L3 agent | node-8.domain.tld | :-) | True |
| af60e54c-33f5-4a06-a7ef-e5b7961e9426 | Open vSwitch agent | node-5.domain.tld | :-) | True |
| b498ef39-2734-403d-a8dc-447f8a79a861 | Open vSwitch agent | node-7.domain.tld | :-) | True |
| d234f93e-66a0-4511-bc0a-1c2f8037f24f | Open vSwitch agent | node-6.domain.tld | xxx | True |
| edf61da4-eca1-4b1c-aa6a-79d6f7269985 | Metadata agent | node-7.domain.tld | :-) | True |
+--------------------------------------+--------------------+-------------------+-------+----------------+
##########################################################################

Revision history for this message
Vadim Rovachev (vrovachev) wrote :
description: updated
description: updated
description: updated
Changed in fuel:
importance: Undecided → Critical
status: New → Confirmed
milestone: none → 5.1
assignee: nobody → Fuel Core Team (fuel-core)
description: updated
Changed in fuel:
assignee: Fuel Core Team (fuel-core) → Sergey Vasilenko (xenolog)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Also the same result on Ubuntu

{"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "auth_required": true, "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

If we look at the state of environment at the point when issue was introduced, we'll see that there is no L3 agent running on any of the alive controllers.

For some reason, rescheduling of L3 agent from the dead controller was either not started or has failed.

As a consequence, metadata proxy process for the tenant network was not started and hence VMs can't access metadata server.

Failure to reschedule L3 agent also indicates that all kind of external connectivity is also broken.

As first step in resolving this issue I suggest to improve logging in q_agent-cleanup.py script.
Currently its logs are confusing and don't give enough information about what has happened.

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

It's happens because when rescheduling-script was started -- no alive agents was registered in the Neutron database.

2014-09-19 12:17:10,750 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=l3 --reschedule --remove-dead --admin-auth-url=http://10.108.3.2:35357/v2.0 --auth-token=XXX
2014-09-19 12:17:37,944 - INFO - found dead L3 agent: 5ec7beca-4814-4ce6-b265-a32a08ca40f7
2014-09-19 13:35:58,582 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=l3 --reschedule --remove-dead --admin-auth-url=http://10.108.3.2:35357/v2.0 --auth-token=XXX

Log for normal rescheduling operation looks like:

2014-09-19 13:35:58,582 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=l3 --reschedule --remove-dead --admin-auth-url=http://10.108.3.2:35357/v2.0 --auth-token=XXX
2014-09-19 13:36:02,799 - INFO - found dead L3 agent: 5ec7beca-4814-4ce6-b265-a32a08ca40f7
2014-09-19 13:36:02,821 - INFO - found alive L3 agent: fbb97e82-f64c-46d3-af40-33d321810593
2014-09-19 13:36:02,832 - INFO - remove dead L3 agent: 5ec7beca-4814-4ce6-b265-a32a08ca40f7
2014-09-19 13:36:02,870 - INFO - schedule router cf2cc5b3-9fa3-4cb1-afae-fcbc835a0555 to L3 agent fbb97e82-f64c-46d3-af40-33d321810593

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

A I see in logs:

Neutron L3 agent starts at 2014-09-19 12:16:38.237 and connect to the RMQ server.
2014-09-19 12:16:38.314 12997 INFO neutron.openstack.common.rpc.common [req-1c2fba2b-f013-4152-b462-2a4a712b711d None] Connected to AMQP server on 127.0.0.1:5673

However, I guess, message system worked wrong, and at 2014-09-19 12:17:55.567 became exception:
http://paste.openstack.org/show/113338/

After this L3 agent registered in the database successful (see first date):

| fbb97e82-f64c-46d3-af40-33d321810593 | L3 agent | neutron-l3-agent | l3_agent | node-12 | 1 | 2014-09-19 12:18:41 | 2014-09-19 13:35:26 | 2014-09-19 14:07:36 | NULL | {"router_id": "", "gateway_external_network_id": "", "handle_internal_only_routers": false, "use_namespaces": true, "routers": 1, "interfaces": 1, "floating_ips": 1, "interface_driver": "neutron.agent.linux.interface.OVSInterfaceDriver", "ex_gw_ports": 1} |

This issue looks like oslo.messaging and RabbitMQ issue.

As workaround I propose:
* increase sleep between start l3-agent and rescheduling script (now this sleep 33 sec.)
* make loop while call rescheduling script
* additional periodically run rescheduling script by crone

summary: - neutron metadata agent has failed after shutdown primary controller
+ rescheduling of any neutron agents may be failed after shutdown
+ controller with corresponded agent.
summary: - rescheduling of any neutron agents may be failed after shutdown
- controller with corresponded agent.
+ neutron metadata agent has failed after shutdown of primary controller
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/123098

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123217

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/123217
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=cbe6de69f2e0d55ed4ebb47a1691e642e8c8787e
Submitter: Jenkins
Branch: master

commit cbe6de69f2e0d55ed4ebb47a1691e642e8c8787e
Author: Sergey Vasilenko <email address hidden>
Date: Mon Sep 22 15:36:42 2014 +0400

    multiple start rescheduling after migration L3/DHCP agent

    This workaround is safe, because starting rescheduling on alive agent do nothing

    Change-Id: I5b4ea431d6907e4d57d0f753d3720a470583c77f
    Closes-bug: #1371561

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/123098
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=342d0b800eed8f496355d19b83240f17a74327fb
Submitter: Jenkins
Branch: stable/5.1

commit 342d0b800eed8f496355d19b83240f17a74327fb
Author: Sergey Vasilenko <email address hidden>
Date: Mon Sep 22 15:36:42 2014 +0400

    multiple start rescheduling after migration L3/DHCP agent

    This workaround is safe, because starting rescheduling on alive agent do nothing

    Change-Id: I5b4ea431d6907e4d57d0f753d3720a470583c77f
    Closes-bug: #1371561

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified {"build_id": "2014-12-03_01-07-36", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "48", "auth_required": true, "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "7626c5aeedcde77ad22fc081c25768944697d404", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.3-5.1.1": {"VERSION": {"build_id": "2014-12-03_01-07-36", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "48", "api": "1.0", "nailgun_sha": "500e36d08a45dbb389bf2bd97673d9bff48ee84d", "production": "docker", "fuelmain_sha": "7626c5aeedcde77ad22fc081c25768944697d404", "astute_sha": "ef8aa0fd0e3ce20709612906f1f0551b5682a6ce", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "a3043477337b4a0a8fd166dc83d6cd5d504f5da8"}}}, "fuellib_sha": "a3043477337b4a0a8fd166dc83d6cd5d504f5da8"}

Revision history for this message
Stanislav Makar (smakar) wrote :

verifying on 6.0

tags: added: verifying
Revision history for this message
Stanislav Makar (smakar) wrote :

verified
{"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "auth_required": true, "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}}}, "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}

tags: removed: verifying
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/162988

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.0)

Change abandoned by Sergey Kolekonov (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/162988

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.