[library] vip__management recovered with error

Bug #1323277 reported by Egor Kotko
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Sergey Vasilenko
4.1.x
Critical
Aleksandr Didenko
5.0.x
Critical
Aleksandr Didenko

Bug Description

{"build_id": "2014-05-25_23-01-31", "mirantis": "yes", "build_number": "22", "ostf_sha": "1f020d69acbf50be00c12c29564f65440971bafe", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152", "release": "5.0", "fuellib_sha": "b9985e42159187853edec82c406fdbc38dc5a6d0"}

Step to reproduce:
1. Deploy ha on Centos with neutron vlan with 3 controllers
2. When deployment finish with success ssh on controller and see where vips are running (crm status)
3. ssh on node where vip__management is running
4. shut down br-mgmt
5. using crm see if vip__management is recovered

Expected:
crm move vip__management after shutdown the interface and cluster will be operational

Actual result:
crm move vip__management but cluster in not operational

The output of
#nova --debug service-list
http://paste.openstack.org/show/81510/

[root@node-2 ~]# nova-manage service list
Binary Host Zone Status State Updated_At
nova-cert node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-consoleauth node-1.domain.tld internal enabled XXX 2014-05-26 11:01:00
nova-scheduler node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-conductor node-1.domain.tld internal enabled XXX 2014-05-26 11:00:56
nova-cert node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-consoleauth node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-scheduler node-2.domain.tld internal enabled :-) 2014-05-26 12:26:58
nova-conductor node-2.domain.tld internal enabled :-) 2014-05-26 12:26:59
nova-cert node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-consoleauth node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-scheduler node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-conductor node-3.domain.tld internal enabled :-) 2014-05-26 12:26:54
nova-compute node-4.domain.tld nova enabled XXX 2014-05-26 12:24:29

Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Egor, please use unique descriptive names when you upload log bundles.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I tried the proposed test case with 4.1.1 and the outcome was even worse than described in the bug: vip__management didn't even recover after bringing br-mgmt down.

Changed in fuel:
importance: Medium → Critical
tags: removed: icehouse
tags: added: ha
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This looks like the same problem with the public VIP:
https://bugs.launchpad.net/fuel/+bug/1324951

Revision history for this message
Ryan Moe (rmoe) wrote :

I tested this on ISO 344 on ubuntu and everything worked correctly. I tried both ip link set down and ifdown and both worked.

Revision history for this message
Ryan Moe (rmoe) wrote :

I was able to reproduce this on 4.1.1 ISO 356 with slightly different results. After bringing the interface down corosync correctly moves the management vip. However, after that happens nova-compute services can no longer communicate with the conductor. Even restarting the nova-compute services didn't fix the issue. I'm not sure what the root cause was.

Changed in fuel:
status: New → Incomplete
assignee: Dmitry Borodaenko (dborodaenko) → Alexander Didenko (adidenko)
status: Incomplete → New
Revision history for this message
Mike Scherbakov (mihgen) wrote :

If it is confirmed for 5.0.1, then it is likely confirmed for 5.1 too. Is it the case?

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on
{
    "api": "1.0",
    "astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19",
    "build_id": "2014-06-10_19-37-13",
    "build_number": "95",
    "fuellib_sha": "90310d10d8b0fdcebe9b89016ff2364f79eb8b50",
    "fuelmain_sha": "e59cb325d99fba49085ed5b0b6cfb0d9a83d2fc8",
    "mirantis": "no",
    "nailgun_sha": "f851199db57d9561b51c429956c056b463ad3f0b",
    "ostf_sha": "67b61ed3788297fa5d985afec32498d8c0f812db",
    "production": "docker",
    "release": "5.1"
}

Can't reproduce. "node-1" service are marked as "XXXX" which is expected behavior, since we've turned off node-1 server's management network. But VIPs are getting migrated just fine:

Online: [ node-3.test.domain.local node-4.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-4.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-3.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-4.test.domain.local
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-3.test.domain.local
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-3.test.domain.local

Also, I've tested functionality via Horizon - everything seems to be OK, I can run instances, etc. OSTF also passes (except "Some nova services have not been started.." alert ofc).

Changed in fuel:
status: New → Invalid
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Alex, please clarify one moment - you iso was customised, because actually master iso has #260 and iso from stable/5.0 has only #60?

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Yep, it was fuel-gerrit-95-2014-06-10_19-37-13.iso
I'm going to check on 5.0 release ISO and on 4.1.1 as well. Then I'll re-check on the latest 5.1 from master.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on 5.0
{
    "api": "1.0",
    "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152",
    "build_id": "2014-05-27_05-51-41",
    "build_number": "26",
    "fuellib_sha": "2f79c0415159651fc1978d99bd791079d1ae4a06",
    "fuelmain_sha": "505741e4f431f85a8d0252fc42754d10c0326c1a",
    "mirantis": "yes",
    "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4",
    "ostf_sha": "a8b7660082a6f152794c610d6abe30d360fd577d",
    "production": "docker",
    "release": "5.0"
}

VIPs moved OK:

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]
 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local

Env is operational, OSTF is ok (except alerts about one controller down).

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on 4.1
{
    "astute_sha": "55df06b2e84fa5d71a1cc0e78dbccab5db29d968",
    "build_id": "2014-06-18_00-41-14",
    "build_number": "367",
    "fuellib_sha": "245b99f3733fa1b24208c2d732236d404e79abd1",
    "fuelmain_sha": "6baad98fc95334b6a7c1f4b69ba0cfa3a266481c",
    "mirantis": "yes",
    "nailgun_sha": "bdc747cd70721874d1e8af36181fd4179cce4fce",
    "ostf_sha": "2b7b39e4b6ea89751b65171f24a8e80b5cac56aa",
    "release": "4.1.1"
}

VIPs moved OK:

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local

Env is operational, OSTF is ok, I can launch instances, browse Horizon, etc.

Changed in fuel:
status: Invalid → Incomplete
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Download full text (4.3 KiB)

I finally managed to reproduce it on the latest 5.1 ISO (#263)
{
    "api": "1.0",
    "astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
    "build_id": "2014-06-23_00-31-14",
    "build_number": "265",
    "fuellib_sha": "dc2713b3ba20ccff2816cf61e8481fe2f17ed69b",
    "fuelmain_sha": "4394ca9be6540d652cc3919556633d9381e0db64",
    "mirantis": "yes",
    "nailgun_sha": "eaabb2c9bbe8e921aaa231960dcda74a7bc86213",
    "ostf_sha": "429c373fb79b1073aa336bc62c6aad45a8f93af6",
    "production": "docker",
    "release": "5.1"
}

The problem is caused by rabbitmq glitch on one of the remaining controller nodes (in my case node-2, after bringing down br-mgmt on node-1). Here is an example of nova-compute log:

2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014...

Read more...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

[root@node-1 ~]# crm_mon -1
Last updated: Fri Jul 4 14:10:45 2014
Last change: Fri Jul 4 13:40:10 2014 via cibadmin on node-4.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-1.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
17 Resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): FAILED node-1.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-1.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-4.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-4.test.domain.local ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-2.test.domain.local
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-1.test.domain.local
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-4.test.domain.local

Failed actions:
    p_mysql_monitor_60000 on node-1.test.domain.local 'not running' (7): call=1000, status=complete, last-rc-change='Fri Jul 4 14:07:45 2014', queued=0ms, exec=0ms
    vip__management_old_monitor_2000 on node-1.test.domain.local 'not running' (7): call=1071, status=complete, last-rc-change='Fri Jul 4 14:07:50 2014', queued=17ms, exec=0ms

Reproduced on {"build_id": "2014-07-04_13-44-50", "mirantis": "yes", "build_number": "97", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f", "nailgun_sha": "d01b4efc0fc4af9d0e316b9dfc7974f16975f822", "production": "docker", "api": "1.0", "fuelmain_sha": "e312e03dbe29d3436958f7ac024402b1c468e2e4", "astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff", "release": "5.0.1", "fuellib_sha": "8a7d86a033b82520abe611bc2c286a10eae42d93"}

Deploy netron vlan - > ifdown br-mgmt on primarry controller - so vip failed to migrate - As result all openstack service like glance keystone is not operational

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Errors from pacemaker logs
http://paste.openstack.org/show/85505/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

When you put br-mgmt down, you literally remove the controller from the cluster, because it can't comunicate with any other node via management network anymore. So commands like "crm_mon -1" on node-1 unfortunately provide no useful info as well as pacemaker logs. All the corosync checks, "nova service-list", etc should be performed on any other remaining controller.

I've checked the snapshot and I see the following records in node-2 crmd.log:

2014-07-04T13:53:35.982239+00:00 warning: warning: reap_dead_nodes: Our DC node (node-1.test.domain.local) left the cluster
2014-07-04T13:53:38.941189+00:00 notice: notice: te_rsc_command: Initiating action 15: start vip__management_old_start_0 on node-2.test.domain.local (local)
2014-07-04T13:53:38.941189+00:00 notice: notice: te_rsc_command: Initiating action 17: start vip__public_old_start_0 on node-4.test.domain.local
2014-07-04T13:53:40.111458+00:00 notice: notice: process_lrm_event: LRM operation vip__management_old_start_0 (call=159, rc=0, cib-update=108, confirmed=true) ok

node-4 crmd.log:

2014-07-04T13:53:35.982245+00:00 warning: warning: reap_dead_nodes: Our DC node (node-1.test.domain.local) left the cluster
2014-07-04T13:53:39.954750+00:00 notice: notice: process_lrm_event: LRM operation vip__public_old_start_0 (call=155, rc=0, cib-update=80, confirmed=true) ok

Also I can see the following in node-4 netstat output (management_vip: 10.108.2.2):

tcp 0 0 10.108.2.6:46448 10.108.2.2:3306 ESTABLISHED 29702/python
tcp 0 0 10.108.2.6:46451 10.108.2.2:3306 ESTABLISHED 4500/python
tcp 0 0 10.108.2.6:39112 10.108.2.2:3306 ESTABLISHED 2058/python
tcp 0 0 10.108.2.6:46879 10.108.2.2:3306 ESTABLISHED 30656/python

So it looks like both vips were successfully migrated to other controllers and node-4 was even able to connect to mysql via management_vip.

If by any chance you still have this env around or you're able to reproduce the issue, please let me know so I could check it on the live env.

Mike Scherbakov (mihgen)
tags: added: release-notes
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Documented in "Known Issues" for 5.0.1.

Dmitry Ilyin (idv1985)
summary: - vip__management recovered with error
+ [puppet] vip__management recovered with error
Dmitry Ilyin (idv1985)
summary: - [puppet] vip__management recovered with error
+ [library] vip__management recovered with error
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Changed in fuel:
status: Confirmed → Fix Committed
Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Fuel Library Team (fuel-library)
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

It should be fixed automatically when
https://bugs.launchpad.net/fuel/+bug/1311749
will be fixed

Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/108439

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/108439
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=40a2e36e37e122cb1f6a55c92d99afca4b001dc1
Submitter: Jenkins
Branch: master

commit 40a2e36e37e122cb1f6a55c92d99afca4b001dc1
Author: Sergey Vasilenko <email address hidden>
Date: Mon Jul 21 21:09:51 2014 +0400

    add checking for interface state to the ns_IPaddr2

    Change-Id: I5bd493ac1b661e6c7bf98b3c5659045de1e97c3e
    Closes-bug: #1311749
    Closes-bug: #1323277

Changed in fuel:
status: In Progress → Fix Committed
Tom Fifield (fifieldt)
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers