Openstack cluster do not work after failover of primary controller

Bug #1322259 reported by Tatyanka
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Vladimir Kuklin
4.1.x
Fix Released
Critical
Registry Administrators
5.0.x
Fix Released
Critical
Vladimir Kuklin

Bug Description

{"build_id": "2014-05-21_01-10-31", "mirantis": "yes", "build_number": "214", "ostf_sha": "353f918197ec53a00127fd28b9151f248a2a2d30", "nailgun_sha": "0b6e8eabaccad2aa29519561ce7cde9df9292964", "production": "docker", "api": "1.0", "fuelmain_sha": "910f262f85e94bef08e0e9b9d6230ad890bf139e", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "3d92142a5643af82596f0450e39282550a45e5db"}

Steps to Reproduce:
1. Deploy environment
3 controllers + 2 computes on nova Vlan
2. When deployment finish with succes - run ostf to be sure that all works
3. Run rally banchmark tests(create/delete isnatce) and ostf
4. While tests running - force off primary controller(in my deployment it is node-1)
5. wait untill vips and other ha services recovered
6. run ostf

Expected Result:
Openstack cluster is operational. Ostf passed. User can succesfully create/delete instance on horizon

Actual result:
Ostf failes, Instance do not created/ deleted

queues status
http://paste.openstack.org/show/81161/

rabbit cluster status:
[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-2']},
 {partitions,[]}]
...done.
[root@node-2 ~]#

crm:

[root@node-2 ~]# crm_mon -1
Last updated: Thu May 22 15:47:26 2014
Last change: Thu May 22 12:40:37 2014 via cibadmin on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
9 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-2.test.domain.local
[root@node-2 ~]#

on compute I have not see rabbit connection at all:
-leasefile-ro --domain=novalocal --no-hosts --addn-hosts=/var/lib/nova/networks/nova-br103.hosts
[root@node-4 ~]# lsof -p 21836 | grep IP
nova-comp 21836 nova 20u IPv4 89794 0t0 TCP node-4:43550->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 21u IPv4 94273 0t0 TCP node-4:43597->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 22u IPv4 94275 0t0 TCP node-4:43598->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 23u IPv4 94287 0t0 TCP node-4:43599->node-2:jms (ESTABLISHED)
[root@node-4 ~]#

[root@node-4 ~]# lsof -p 21836 | grep 56714-05-22 15:48:54.561 21836 DEBUG nova.compute.manager [-] Didn't find any instances for network info cache update. _heal_instance_info_cache /usr/lib/python2.6/site-packages/nova/compute/manager.py:4895

also on computes a lot of errors
2014-05-22 15:48:54.561 21836 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call sleeping for 60.00 seconds _inner /usr/lib/python2.6/site-packages/nova/openstack/common/loopingcall.py:132
2014-05-22 15:49:54.561 21836 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db % msg_id)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID 23c5b1c2b4e2425b8f1bf555722477bf
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db
3
[root@node-4 ~]#

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

This commit seems to improve recovery significantly:
https://review.openstack.org/95007

Not sure if it will make the described test steps pass (ostf might still fail of controller is lost in the middle of a test run), but it does reduce post-failover recovery time considerably.

Changed in fuel:
status: New → Confirmed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Do we want to keep this bug as Critical for 5.0? Tatyana, after time passed, were you able to run VMs?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Download full text (3.9 KiB)

No I am not able create vm, it stack in Building state.
Seem sthe problem is on compute node - according it still reply with errors
2014-05-23 08:53:42.056 21836 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timed out waiting for a reply to message ID 2286043347a14011bb1212d48ccbc5
aa
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task task(self, context)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 5446, in update_available_resource
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task compute_nodes_in_db = self._get_compute_nodes_in_db(context)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 5457, in _get_compute_nodes_in_db
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task context, self.host)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 186, in service_get_by_compute_host
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task result = self._manager.service_get_all_by(context, 'compute', host)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 280, in service_get_all_by
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task topic=topic, host=host, binary=binary)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task wait_for_reply=True, timeout=timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task timeout=timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task result = self._waiter.wait(msg_id, timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_driver...

Read more...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well, rabbit connections are in place (jms is 5673 port), just recheck it with lsof -P -p instead

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Tanya, please elaborate was the result from the #4 comment received with the https://review.openstack.org/95007 applied?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

As I can see https://review.openstack.org/95007 was merged yestarday evening, so this environment without this patch. I try to reproduce it on 5.0-19 iso - and back with updates :)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Ok, thank you, waiting for update then (incomplete)

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

{"build_id": "2014-05-23_03-53-39", "mirantis": "yes", "build_number": "19", "ostf_sha": "5c479f04c35127576d35526650ec83b104f9a33d", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "2ed4fbe1e04b85e83f1010ca23be7f5da34bd492"}
The same situation
Instance stack in building and deletig state. On compute nodes error about message timeouts
Also -P helps to see current conncetions with rabbit - thaks)
[root@node-4 log]# lsof -P -p 22985 | grep IPv4
nova-comp 22985 nova 20u IPv4 74323 0t0 TCP node-4:50256->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 21u IPv4 74329 0t0 TCP node-4:50258->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 22u IPv4 74596 0t0 TCP node-4:50261->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 23u IPv4 74600 0t0 TCP node-4:50263->node-2:5673 (ESTABLISHED)

So issue is reprodusable on 19 iso

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced. The fix is to increase kombu_reconnect_delay from 1 to 5 seconds. Looks like 1 second is not enough for envs with poor performance. After updating delay to the 5 secons, all issues gone and instances are able to spawn

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

we need to port kombu_reconnect_delay to all subprojects which do not use oslo.messaging and set it to 5.0 explicitly

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Heat and ceilometer to come

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/95205

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/95209

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/95210

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/95205
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=89d9a27331578ee0c2f4f6cd63ec695a5516ac0b
Submitter: Jenkins
Branch: master

commit 89d9a27331578ee0c2f4f6cd63ec695a5516ac0b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.0)

Reviewed: https://review.openstack.org/95209
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b9985e42159187853edec82c406fdbc38dc5a6d0
Submitter: Jenkins
Branch: stable/5.0

commit b9985e42159187853edec82c406fdbc38dc5a6d0
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Raised this issue to Critical priority. Reminding that we expect only Critical, release blocking issues to be fixed with patch into stable/5.0 (after Hard Code Freeze).

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Action items: validate kombu_reconnect_delay for Neutron, Heat, Ceilometer, once ported for MOS packages

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/95477

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/95210
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1fd0732e53b4805c18dbca693ab31914a6a73c47
Submitter: Jenkins
Branch: master

commit 1fd0732e53b4805c18dbca693ab31914a6a73c47
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.0)

Reviewed: https://review.openstack.org/95477
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f2f2f4d0b0dff2078313507a7508de5ebdee984f
Submitter: Jenkins
Branch: stable/5.0

commit f2f2f4d0b0dff2078313507a7508de5ebdee984f
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259

Revision history for this message
Egor Kotko (ykotko) wrote :

Have the same on:
{"build_id": "2014-05-25_23-01-31", "mirantis": "yes", "build_number": "22", "ostf_sha": "1f020d69acbf50be00c12c29564f65440971bafe", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152", "release": "5.0", "fuellib_sha": "b9985e42159187853edec82c406fdbc38dc5a6d0"}

Steps to reproduce:
1. Env configuration: Centos 3 Controller,1 Compute Neutron Vlan
2. Determine primary controller
3. Destroy virtual machine with primary controller

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please clarify which OSt component became broken. For now, we have patch kombu_reconnect_delay only for nova. What exactly was the issue for given in the #27 case? Was nova-compute nodes marked as down? Or any other issues?

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

build number did not contain all the required fixes. also, description does not contain, which components did not work. closing until there is clearer description

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

We need to add kombu_reconnect_delay parameter to all OpenStack components connecting to RabbitMQ in 4.1/Havana:
oslo.messaging
cinder (uses oslo.messaging in Havana)
nova
neutron
glance
heat
ceilometer

cinder, nova, and neutron are most critical.

Revision history for this message
Andriy Kurilin (andreykurilin) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

we also need glance fix

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

For glance openstack-ci/fuel-4.1.1/2013.2.3: https://gerrit.mirantis.com/16157

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/97399

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/97402

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/97402
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0cc0eb42906e45fb056c5d16f394824c76af6b0b
Submitter: Jenkins
Branch: stable/4.1

commit 0cc0eb42906e45fb056c5d16f394824c76af6b0b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    depends on:

    https://gerrit.mirantis.com/#/c/16134/
    https://gerrit.mirantis.com/#/c/16135/
    https://gerrit.mirantis.com/#/c/16157/

    but can be safely merged (option will be ignored)

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/97399
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f8fdf936dcf27bd127dc4b833a65957e1063da8b
Submitter: Jenkins
Branch: stable/4.1

commit f8fdf936dcf27bd127dc4b833a65957e1063da8b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    depends on https://gerrit.mirantis.com/#/c/16137/
    but can be safely merged (the option will be ignored)

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259
    (cherry picked from commit 1fd0732e53b4805c18dbca693ab31914a6a73c47)

tags: added: to-be-covered-by-tests
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Changed in fuel:
milestone: 5.0 → 5.1
Tom Fifield (fifieldt)
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.