Openstack cluster do not work after failover of primary controller

Bug #1322259 reported by Tatyanka on 2014-05-22
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Vladimir Kuklin
4.1.x
Critical
Registry Administrators
5.0.x
Critical
Vladimir Kuklin

Bug Description

{"build_id": "2014-05-21_01-10-31", "mirantis": "yes", "build_number": "214", "ostf_sha": "353f918197ec53a00127fd28b9151f248a2a2d30", "nailgun_sha": "0b6e8eabaccad2aa29519561ce7cde9df9292964", "production": "docker", "api": "1.0", "fuelmain_sha": "910f262f85e94bef08e0e9b9d6230ad890bf139e", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "3d92142a5643af82596f0450e39282550a45e5db"}

Steps to Reproduce:
1. Deploy environment
3 controllers + 2 computes on nova Vlan
2. When deployment finish with succes - run ostf to be sure that all works
3. Run rally banchmark tests(create/delete isnatce) and ostf
4. While tests running - force off primary controller(in my deployment it is node-1)
5. wait untill vips and other ha services recovered
6. run ostf

Expected Result:
Openstack cluster is operational. Ostf passed. User can succesfully create/delete instance on horizon

Actual result:
Ostf failes, Instance do not created/ deleted

queues status
http://paste.openstack.org/show/81161/

rabbit cluster status:
[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-2']},
 {partitions,[]}]
...done.
[root@node-2 ~]#

crm:

[root@node-2 ~]# crm_mon -1
Last updated: Thu May 22 15:47:26 2014
Last change: Thu May 22 12:40:37 2014 via cibadmin on node-3.test.domain.local
Stack: classic openais (with plugin)
Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
9 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local ]
OFFLINE: [ node-1.test.domain.local ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-2.test.domain.local
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-2.test.domain.local
[root@node-2 ~]#

on compute I have not see rabbit connection at all:
-leasefile-ro --domain=novalocal --no-hosts --addn-hosts=/var/lib/nova/networks/nova-br103.hosts
[root@node-4 ~]# lsof -p 21836 | grep IP
nova-comp 21836 nova 20u IPv4 89794 0t0 TCP node-4:43550->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 21u IPv4 94273 0t0 TCP node-4:43597->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 22u IPv4 94275 0t0 TCP node-4:43598->node-2:jms (ESTABLISHED)
nova-comp 21836 nova 23u IPv4 94287 0t0 TCP node-4:43599->node-2:jms (ESTABLISHED)
[root@node-4 ~]#

[root@node-4 ~]# lsof -p 21836 | grep 56714-05-22 15:48:54.561 21836 DEBUG nova.compute.manager [-] Didn't find any instances for network info cache update. _heal_instance_info_cache /usr/lib/python2.6/site-packages/nova/compute/manager.py:4895

also on computes a lot of errors
2014-05-22 15:48:54.561 21836 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call sleeping for 60.00 seconds _inner /usr/lib/python2.6/site-packages/nova/openstack/common/loopingcall.py:132
2014-05-22 15:49:54.561 21836 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db % msg_id)
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID 23c5b1c2b4e2425b8f1bf555722477bf
2014-05-22 15:49:54.561 21836 TRACE nova.servicegroup.drivers.db
3
[root@node-4 ~]#

Tatyanka (tatyana-leontovich) wrote :
Dmitry Borodaenko (angdraug) wrote :

This commit seems to improve recovery significantly:
https://review.openstack.org/95007

Not sure if it will make the described test steps pass (ostf might still fail of controller is lost in the middle of a test run), but it does reduce post-failover recovery time considerably.

Changed in fuel:
status: New → Confirmed
Mike Scherbakov (mihgen) wrote :

Do we want to keep this bug as Critical for 5.0? Tatyana, after time passed, were you able to run VMs?

Tatyanka (tatyana-leontovich) wrote :
Download full text (3.9 KiB)

No I am not able create vm, it stack in Building state.
Seem sthe problem is on compute node - according it still reply with errors
2014-05-23 08:53:42.056 21836 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timed out waiting for a reply to message ID 2286043347a14011bb1212d48ccbc5
aa
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task task(self, context)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 5446, in update_available_resource
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task compute_nodes_in_db = self._get_compute_nodes_in_db(context)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 5457, in _get_compute_nodes_in_db
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task context, self.host)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 186, in service_get_by_compute_host
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task result = self._manager.service_get_all_by(context, 'compute', host)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 280, in service_get_all_by
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task topic=topic, host=host, binary=binary)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task wait_for_reply=True, timeout=timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task timeout=timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task result = self._waiter.wait(msg_id, timeout)
2014-05-23 08:53:42.056 21836 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/oslo/messaging/_driver...

Read more...

Bogdan Dobrelya (bogdando) wrote :

Well, rabbit connections are in place (jms is 5673 port), just recheck it with lsof -P -p instead

Bogdan Dobrelya (bogdando) wrote :

Tanya, please elaborate was the result from the #4 comment received with the https://review.openstack.org/95007 applied?

Tatyanka (tatyana-leontovich) wrote :

As I can see https://review.openstack.org/95007 was merged yestarday evening, so this environment without this patch. I try to reproduce it on 5.0-19 iso - and back with updates :)

Bogdan Dobrelya (bogdando) wrote :

Ok, thank you, waiting for update then (incomplete)

Changed in fuel:
status: Confirmed → Incomplete
Tatyanka (tatyana-leontovich) wrote :

{"build_id": "2014-05-23_03-53-39", "mirantis": "yes", "build_number": "19", "ostf_sha": "5c479f04c35127576d35526650ec83b104f9a33d", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "2ed4fbe1e04b85e83f1010ca23be7f5da34bd492"}
The same situation
Instance stack in building and deletig state. On compute nodes error about message timeouts
Also -P helps to see current conncetions with rabbit - thaks)
[root@node-4 log]# lsof -P -p 22985 | grep IPv4
nova-comp 22985 nova 20u IPv4 74323 0t0 TCP node-4:50256->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 21u IPv4 74329 0t0 TCP node-4:50258->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 22u IPv4 74596 0t0 TCP node-4:50261->node-2:5673 (ESTABLISHED)
nova-comp 22985 nova 23u IPv4 74600 0t0 TCP node-4:50263->node-2:5673 (ESTABLISHED)

So issue is reprodusable on 19 iso

Changed in fuel:
status: Incomplete → Confirmed
Bogdan Dobrelya (bogdando) wrote :

Reproduced. The fix is to increase kombu_reconnect_delay from 1 to 5 seconds. Looks like 1 second is not enough for envs with poor performance. After updating delay to the 5 secons, all issues gone and instances are able to spawn

Changed in fuel:
status: Confirmed → Triaged
Vladimir Kuklin (vkuklin) wrote :

we need to port kombu_reconnect_delay to all subprojects which do not use oslo.messaging and set it to 5.0 explicitly

Vladimir Kuklin (vkuklin) wrote :

Heat and ceilometer to come

Fix proposed to branch: master
Review: https://review.openstack.org/95205

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Dmitry Borodaenko (dborodaenko)

Reviewed: https://review.openstack.org/95205
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=89d9a27331578ee0c2f4f6cd63ec695a5516ac0b
Submitter: Jenkins
Branch: master

commit 89d9a27331578ee0c2f4f6cd63ec695a5516ac0b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Reviewed: https://review.openstack.org/95209
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b9985e42159187853edec82c406fdbc38dc5a6d0
Submitter: Jenkins
Branch: stable/5.0

commit b9985e42159187853edec82c406fdbc38dc5a6d0
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Mike Scherbakov (mihgen) wrote :

Raised this issue to Critical priority. Reminding that we expect only Critical, release blocking issues to be fixed with patch into stable/5.0 (after Hard Code Freeze).

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)
Bogdan Dobrelya (bogdando) wrote :

Action items: validate kombu_reconnect_delay for Neutron, Heat, Ceilometer, once ported for MOS packages

Vladimir Kuklin (vkuklin) wrote :

Reviewed: https://review.openstack.org/95210
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1fd0732e53b4805c18dbca693ab31914a6a73c47
Submitter: Jenkins
Branch: master

commit 1fd0732e53b4805c18dbca693ab31914a6a73c47
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259

Reviewed: https://review.openstack.org/95477
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f2f2f4d0b0dff2078313507a7508de5ebdee984f
Submitter: Jenkins
Branch: stable/5.0

commit f2f2f4d0b0dff2078313507a7508de5ebdee984f
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259

Egor Kotko (ykotko) wrote :

Have the same on:
{"build_id": "2014-05-25_23-01-31", "mirantis": "yes", "build_number": "22", "ostf_sha": "1f020d69acbf50be00c12c29564f65440971bafe", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "a7eac46348dc77fc2723c6fcc3dbc66cc1a83152", "release": "5.0", "fuellib_sha": "b9985e42159187853edec82c406fdbc38dc5a6d0"}

Steps to reproduce:
1. Env configuration: Centos 3 Controller,1 Compute Neutron Vlan
2. Determine primary controller
3. Destroy virtual machine with primary controller

Egor Kotko (ykotko) wrote :
Bogdan Dobrelya (bogdando) wrote :

Please clarify which OSt component became broken. For now, we have patch kombu_reconnect_delay only for nova. What exactly was the issue for given in the #27 case? Was nova-compute nodes marked as down? Or any other issues?

Vladimir Kuklin (vkuklin) wrote :

build number did not contain all the required fixes. also, description does not contain, which components did not work. closing until there is clearer description

Dmitry Borodaenko (angdraug) wrote :

We need to add kombu_reconnect_delay parameter to all OpenStack components connecting to RabbitMQ in 4.1/Havana:
oslo.messaging
cinder (uses oslo.messaging in Havana)
nova
neutron
glance
heat
ceilometer

cinder, nova, and neutron are most critical.

Vladimir Kuklin (vkuklin) wrote :

we also need glance fix

Dmitry Borodaenko (angdraug) wrote :

For glance openstack-ci/fuel-4.1.1/2013.2.3: https://gerrit.mirantis.com/16157

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/97402

Reviewed: https://review.openstack.org/97402
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0cc0eb42906e45fb056c5d16f394824c76af6b0b
Submitter: Jenkins
Branch: stable/4.1

commit 0cc0eb42906e45fb056c5d16f394824c76af6b0b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:16:46 2014 +0400

    Set kombu_reconnect_delay to 5.0

    Set delay to 5.0 to recover channel errors on highly loaded environments.

    depends on:

    https://gerrit.mirantis.com/#/c/16134/
    https://gerrit.mirantis.com/#/c/16135/
    https://gerrit.mirantis.com/#/c/16157/

    but can be safely merged (option will be ignored)

    Change-Id: Ibec002828b785282221fa6d2827163a2deb0e627
    Partial-Bug: 1322259

Reviewed: https://review.openstack.org/97399
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f8fdf936dcf27bd127dc4b833a65957e1063da8b
Submitter: Jenkins
Branch: stable/4.1

commit f8fdf936dcf27bd127dc4b833a65957e1063da8b
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 23 20:23:06 2014 +0400

    Set kombu reconnect delay to 5 seconds

    depends on https://gerrit.mirantis.com/#/c/16137/
    but can be safely merged (the option will be ignored)

    Change-Id: I0ad9bcfd1f35e5d557a147a2cb0d3b2f2d79c846
    Partial-Bug: #1322259
    (cherry picked from commit 1fd0732e53b4805c18dbca693ab31914a6a73c47)

tags: added: to-be-covered-by-tests
Dmitry Pyzhov (dpyzhov) on 2014-08-13
no longer affects: fuel/5.1.x
Changed in fuel:
milestone: 5.0 → 5.1
Tom Fifield (fifieldt) on 2015-06-11
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers