[library] vip__management recovered with error

Bug #1323277 reported by Egor Kotko
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Sergey Vasilenko
4.1.x
Won't Fix
Critical
Aleksandr Didenko
5.0.x
Won't Fix
Critical
Aleksandr Didenko

Bug Description

{"build_id": "2014-05-25_23-01-31", "mirantis": "yes", "build_number": "22", "ostf_sha": "1f020d69acbf50be00c12c29564f65440971bafe", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152", "release": "5.0", "fuellib_sha": "b9985e42159187853edec82c406fdbc38dc5a6d0"}

Step to reproduce:
1. Deploy ha on Centos with neutron vlan with 3 controllers
2. When deployment finish with success ssh on controller and see where vips are running (crm status)
3. ssh on node where vip__management is running
4. shut down br-mgmt
5. using crm see if vip__management is recovered

Expected:
crm move vip__management after shutdown the interface and cluster will be operational

Actual result:
crm move vip__management but cluster in not operational

The output of
#nova --debug service-list
http://paste.openstack.org/show/81510/

[root@node-2 ~]# nova-manage service list
Binary Host Zone Status State Updated_At
nova-cert node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-consoleauth node-1.domain.tld internal enabled XXX 2014-05-26 11:01:00
nova-scheduler node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-conductor node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-cert node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-consoleauth node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-scheduler node-2.domain.tld internal enabled :-) 2014-05-26 12:26:58
nova-conductor node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-cert node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-consoleauth node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-scheduler node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-conductor node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-compute node-4.domain.tld nova enabled XXX 2014-05-26 12:24:29

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Egor, please use unique descriptive names when you upload log bundles.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I tried the proposed test case with 4.1.1 and the outcome was even worse than described in the bug: vip__management didn't even recover after bringing br-mgmt down.

Changed in fuel:
importance: Medium → Critical
tags: removed: icehouse
tags: added: ha
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This looks like the same problem with the public VIP:
https://bugs.launchpad.net/fuel/+bug/1324951

Revision history for this message
Ryan Moe (rmoe) wrote :

I tested this on ISO 344 on ubuntu and everything worked correctly. I tried both ip link set down and ifdown and both worked.

Revision history for this message
Ryan Moe (rmoe) wrote :

I was able to reproduce this on 4.1.1 ISO 356 with slightly different results. After bringing the interface down corosync correctly moves the management vip. However, after that happens nova-compute services can no longer communicate with the conductor. Even restarting the nova-compute services didn't fix the issue. I'm not sure what the root cause was.

Changed in fuel:
status: New → Incomplete
assignee: Dmitry Borodaenko (dborodaenko) → Alexander Didenko (adidenko)
status: Incomplete → New
Revision history for this message
Mike Scherbakov (mihgen) wrote :

If it is confirmed for 5.0.1, then it is likely confirmed for 5.1 too. Is it the case?

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on
{
    "api": "1.0",
    "astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19",
    "build_id": "2014-06-10_19-37-13",
    "build_number": "95",
    "fuellib_sha": "90310d10d8b0fdcebe9b89016ff2364f79eb8b50",
    "fuelmain_sha": "e59cb325d99fba49085ed5b0b6cfb0d9a83d2fc8",
    "mirantis": "no",
    "nailgun_sha": "f851199db57d9561b51c429956c056b463ad3f0b",
    "ostf_sha": "67b61ed3788297fa5d985afec32498d8c0f812db",
    "production": "docker",
    "release": "5.1"
}

Can't reproduce. "node-1" service are marked as "XXXX" which is expected behavior, since we've turned off node-1 server's management network. But VIPs are getting migrated just fine:

Online: [ node-3.test.domain.local node-4.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-4.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-4.test.domain.local
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-3.test.domain.local
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-3.test.domain.local

Also, I've tested functionality via Horizon - everything seems to be OK, I can run instances, etc. OSTF also passes (except "Some nova services have not been started.." alert ofc).

Changed in fuel:
status: New → Invalid
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Alex, please clarify one moment - you iso was customised, because actually master iso has #260 and iso from stable/5.0 has only #60?

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Yep, it was fuel-gerrit-95-2014-06-10_19-37-13.iso
I'm going to check on 5.0 release ISO and on 4.1.1 as well. Then I'll re-check on the latest 5.1 from master.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on 5.0
{
    "api": "1.0",
    "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152",
    "build_id": "2014-05-27_05-51-41",
    "build_number": "26",
    "fuellib_sha": "2f79c0415159651fc1978d99bd791079d1ae4a06",
    "fuelmain_sha": "505741e4f431f85a8d0252fc42754d10c0326c1a",
    "mirantis": "yes",
    "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4",
    "ostf_sha": "a8b7660082a6f152794c610d6abe30d360fd577d",
    "production": "docker",
    "release": "5.0"
}

VIPs moved OK:

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]
 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local

Env is operational, OSTF is ok (except alerts about one controller down).

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on 4.1
{
    "astute_sha": "55df06b2e84fa5d71a1cc0e78dbccab5db29d968",
    "build_id": "2014-06-18_00-41-14",
    "build_number": "367",
    "fuellib_sha": "245b99f3733fa1b24208c2d732236d404e79abd1",
    "fuelmain_sha": "6baad98fc95334b6a7c1f4b69ba0cfa3a266481c",
    "mirantis": "yes",
    "nailgun_sha": "bdc747cd70721874d1e8af36181fd4179cce4fce",
    "ostf_sha": "2b7b39e4b6ea89751b65171f24a8e80b5cac56aa",
    "release": "4.1.1"
}

VIPs moved OK:

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local

Env is operational, OSTF is ok, I can launch instances, browse Horizon, etc.

Changed in fuel:
status: Invalid → Incomplete
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Download full text (4.3 KiB)

I finally managed to reproduce it on the latest 5.1 ISO (#263)
{
    "api": "1.0",
    "astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
    "build_id": "2014-06-23_00-31-14",
    "build_number": "265",
    "fuellib_sha": "dc2713b3ba20ccff2816cf61e8481fe2f17ed69b",
    "fuelmain_sha": "4394ca9be6540d652cc3919556633d9381e0db64",
    "mirantis": "yes",
    "nailgun_sha": "eaabb2c9bbe8e921aaa231960dcda74a7bc86213",
    "ostf_sha": "429c373fb79b1073aa336bc62c6aad45a8f93af6",
    "production": "docker",
    "release": "5.1"
}

The problem is caused by rabbitmq glitch on one of the remaining controller nodes (in my case node-2, after bringing down br-mgmt on node-1). Here is an example of nova-compute log:

2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014...

Read more...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

[root@node-1 ~]# crm_mon -1
Last updated: Fri Jul 4 14:10:45 2014
Last change: Fri Jul 4 13:40:10 2014 via cibadmin on node-4.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-1.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
17 Resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): FAILED node-1.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-1.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-2.test.domain.local
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-1.test.domain.local
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-4.test.domain.local

Failed actions:
    p_mysql_monitor_60000 on node-1.test.domain.local 'not running' (7): call=1000, status=complete, last-rc-change='Fri Jul 4 14:07:45 2014', queued=0ms, exec=0ms
    vip__management_old_monitor_2000 on node-1.test.domain.local 'not running' (7): call=1071, status=complete, last-rc-change='Fri Jul 4 14:07:50 2014', queued=17ms, exec=0ms

Reproduced on {"build_id": "2014-07-04_13-44-50", "mirantis": "yes", "build_number": "97", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f", "nailgun_sha": "d01b4efc0fc4af9d0e316b9dfc7974f16975f822", "production": "docker", "api": "1.0", "fuelmain_sha": "e312e03dbe29d3436958f7ac024402b1c468e2e4", "astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff", "release": "5.0.1", "fuellib_sha": "8a7d86a033b82520abe611bc2c286a10eae42d93"}

Deploy netron vlan - > ifdown br-mgmt on primarry controller - so vip failed to migrate - As result all openstack service like glance keystone is not operational

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Errors from pacemaker logs
http://paste.openstack.org/show/85505/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

When you put br-mgmt down, you literally remove the controller from the cluster, because it can't comunicate with any other node via management network anymore. So commands like "crm_mon -1" on node-1 unfortunately provide no useful info as well as pacemaker logs. All the corosync checks, "nova service-list", etc should be performed on any other remaining controller.

I've checked the snapshot and I see the following records in node-2 crmd.log:

2014-07-04T13:53:35.982239+00:00 warning: warning: reap_dead_nodes: Our DC node (node-1.test.domain.local) left the cluster
2014-07-04T13:53:38.941189+00:00 notice: notice: te_rsc_command: Initiating action 15: start vip__management_old_start_0 on node-2.test.domain.local (local)
2014-07-04T13:53:38.941189+00:00 notice: notice: te_rsc_command: Initiating action 17: start vip__public_old_start_0 on node-4.test.domain.local
2014-07-04T13:53:40.111458+00:00 notice: notice: process_lrm_event: LRM operation vip__management_old_start_0 (call=159, rc=0, cib-update=108, confirmed=true) ok

node-4 crmd.log:

2014-07-04T13:53:35.982245+00:00 warning: warning: reap_dead_nodes: Our DC node (node-1.test.domain.local) left the cluster
2014-07-04T13:53:39.954750+00:00 notice: notice: process_lrm_event: LRM operation vip__public_old_start_0 (call=155, rc=0, cib-update=80, confirmed=true) ok

Also I can see the following in node-4 netstat output (management_vip: 10.108.2.2):

tcp 0 0 10.108.2.6:46448 10.108.2.2:3306 ESTABLISHED 29702/python
tcp 0 0 10.108.2.6:46451 10.108.2.2:3306 ESTABLISHED 4500/python
tcp 0 0 10.108.2.6:39112 10.108.2.2:3306 ESTABLISHED 2058/python
tcp 0 0 10.108.2.6:46879 10.108.2.2:3306 ESTABLISHED 30656/python

So it looks like both vips were successfully migrated to other controllers and node-4 was even able to connect to mysql via management_vip.

If by any chance you still have this env around or you're able to reproduce the issue, please let me know so I could check it on the live env.

Mike Scherbakov (mihgen)
tags: added: release-notes
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Documented in "Known Issues" for 5.0.1.

Dmitry Ilyin (idv1985)
summary: - vip__management recovered with error
+ [puppet] vip__management recovered with error
Dmitry Ilyin (idv1985)
summary: - [puppet] vip__management recovered with error
+ [library] vip__management recovered with error
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Changed in fuel:
status: Confirmed → Fix Committed
Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Fuel Library Team (fuel-library)
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

It should be fixed automatically when
https://bugs.launchpad.net/fuel/+bug/1311749
will be fixed

Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/108439

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/108439
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=40a2e36e37e122cb1f6a55c92d99afca4b001dc1
Submitter: Jenkins
Branch: master

commit 40a2e36e37e122cb1f6a55c92d99afca4b001dc1
Author: Sergey Vasilenko <email address hidden>
Date: Mon Jul 21 21:09:51 2014 +0400

    add checking for interface state to the ns_IPaddr2

    Change-Id: I5bd493ac1b661e6c7bf98b3c5659045de1e97c3e
    Closes-bug: #1311749
    Closes-bug: #1323277

Changed in fuel:
status: In Progress → Fix Committed
Tom Fifield (fifieldt)
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.