Bug #1454174 “nova-compute status down after reboot master p_rab...” : Bugs : Mirantis OpenStack

Veronica Krayneva (vkrayneva) on 2015-05-12

affects:

nova → fuel

Veronica Krayneva (vkrayneva) on 2015-05-12

description:	updated
description:	updated

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-12:

#1

Veronica, can we get a diagnostic snapshot for the issue? It seems there are some problems with nova-compute only on node-4, but it is impossible to investigate the issue without logs.

Changed in fuel:
milestone:	none → 6.1
assignee:	nobody → Veronica Krayneva (vkrayneva)
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-05-13:

#2

It's hard to tell for sure without logs, but it might be a duplicate of https://bugs.launchpad.net/mos/+bug/1452632

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-05-13:

#3

... or it might be a rabbitmq split brain and nova-compute/nova-conductor not seeing each other

Revision history for this message

Veronica Krayneva (vkrayneva) wrote on 2015-05-14:

#4

Unfortunately, yesterday deploy cluster was failed. I'm try to deploy cluster for this issue today. But, I have ditto diagnostic snapshot with cic maintenance mode. We do maintenance mode (same reboot) for controllers and 'nova-compute' on one compute node was marked as down and OSTF test 'Check that required services are running' failed.

Diagnostic snapshot is attached.

Revision history for this message

Veronica Krayneva (vkrayneva) wrote on 2015-05-14:

#5

fail_error_manual_cic_maintenance_mode-2015_05_12__06_10_18.tar.xz Edit (75.0 MiB, application/octet-stream)

Veronica Krayneva (vkrayneva) on 2015-05-14

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-14:

#6

Veronika, is this the snapshot with the same bug with rabbitmq or not?

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Veronica Krayneva (vkrayneva) wrote on 2015-05-15:

#7

As far as I'm concerned, yes.
Because If controller (not master_rabbit) just rebooted, this does not happen. However, when the master of the rabbit is on controller and this controller was rebooted, then the nova-compute service is down

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-15:

#8

According to the environment status, there is no issues with rabbitmq cluster failover. But nova-compute services is reported down due to rabbit_heartbeat was not configured for nova services on all nodes. This is strange as, IIRC, it should be enabled for all Openstack services but Neutron.

summary:	- Nova-compute won't start after reboot master p_rabbitmq-server + Nova-compute won't start after reboot master p_rabbitmq-server as there + is no rabbit_heartbeat configured
Changed in fuel:
status:	New → Triaged

Veronica Krayneva (vkrayneva) on 2015-05-15

Changed in fuel:
assignee:	Veronica Krayneva (vkrayneva) → Bogdan Dobrelya (bogdando)

Bogdan Dobrelya (bogdando) on 2015-05-15

summary:

Nova-compute won't start after reboot master p_rabbitmq-server as there
- is no rabbit_heartbeat configured
+ is in-optimal default value used for the rabbit_heartbeat

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-15: Re: Nova-compute won't start after reboot master p_rabbitmq-server as there is in-optimal default value used for the rabbit_heartbeat

#9

Updating the rabbit_heartbeat for Nova did not completely resolved the issue. It looks like. there are no more errors like:
2015-05-15T14:05:20.322089+00:00 notice: {amqp_error,not_found,
2015-05-15T14:05:20.322089+00:00 notice: "no exchange 'reply_546500363b244e1a8307a61c7dd1a9da' in vhost '/'",
2015-05-15T14:05:20.322089+00:00 notice: 'exchange.declare'}

in rabbit logs after the node failover. But nova service may still show some compute service as down. @Roman, MOS-Nova team, please take a look at this bug more precisely. Perhaps it is a dup of https://bugs.launchpad.net/mos/+bug/1452632 indeed

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → MOS Nova (mos-nova)
summary:	- Nova-compute won't start after reboot master p_rabbitmq-server as there - is in-optimal default value used for the rabbit_heartbeat + Nova-compute status down after reboot master p_rabbitmq-server

Roman Podoliaka (rpodolyaka) on 2015-05-15

Changed in fuel:
status:	Triaged → Confirmed

Roman Podoliaka (rpodolyaka) on 2015-05-15

Changed in fuel:
assignee:	MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
status:	Confirmed → In Progress
tags:	added: nova

Roman Podoliaka (rpodolyaka) on 2015-05-17

Changed in mos:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → MOS Oslo (mos-oslo)
milestone:	none → 6.1
no longer affects:	fuel
no longer affects:	mos/7.0.x
summary:	- Nova-compute status down after reboot master p_rabbitmq-server + nova-compute status down after reboot master p_rabbitmq-server
description:	updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-05-17:

#10

Ok, so I tried to reproduce this yesterday enabling eventlet backdoor port in Nova services on compute nodes - it effectively allows you to open a Python shell within a context of a running process (https://github.com/openstack/nova/blob/stable/juno/nova/openstack/common/eventlet_backdoor.py#L35-L46), so that you cat introspect the state of green/native threads.

This issue is hard to reproduce, so it will take some time, but 10 reboots of RabbitMQ 'master' (taken from crm status output) must be enough. After that nova-compute accidentally stops reporting its state to nova-conductor (done here https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L79-L94 every `report_interval` seconds == 60s in our setups). There are no errors in Nova logs, the report state thread is just stuck.

stracing nova-compute process didn't help a lot: all TCP connections to RabbitMQ were active time to time, it's just the 'service_update' RPC calls which were missing.

Connecting to the eventlet backdoor port allowed to actually get the backtrace for the report state thread: http://paste.openstack.org/show/224750/

Introspecting eventlet shown, that the corresponding FD (the one we expected events on when putting the green thread to sleep) was open, it's an active TCP connection which transmitted different RPC messages. Which effectively means that the report state thread was actually *scheduled* by eventlet main loop.

The problem is actually in in the olso.messaging code here: https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/amqpdriver.py#L292-L297

Due to the weird model of receiving replies in oslo.messaging version we use in 6.1, one of the calling threads will become the one, which receives all the replies and puts them into thread-specific in-process message queues, before the replies are actually dispatched. The problem with this thread is that if replies (arbitrary, for calls made by concurrent threads) come regularly, Timeout exception won't ever be raised for the 'receiver' thread.

Being a 'chatty' process, nova-compute may get stuck in report_state thread, thus becoming down from nova-api/conductor point of view.

Ok, so I tried to reproduce this yesterday enabling eventlet backdoor port in Nova services on compute nodes  - it effectively allows you to open a Python shell within a context of a running process (https://github.com/openstack/nova/blob/stable/juno/nova/openstack/common/eventlet_backdoor.py#L35-L46), so that you cat introspect the state of green/native threads.

This issue is hard to reproduce, so it will take some time, but 10 reboots of RabbitMQ 'master' (taken from crm status output) must be enough. After that nova-compute accidentally stops reporting its state to nova-conductor (done here https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L79-L94 every `report_interval` seconds == 60s in our setups). There are no errors in Nova logs, the report state thread is just stuck.

stracing nova-compute process didn't help a lot: all TCP connections to RabbitMQ were active time to time,  it's just the 'service_update' RPC calls which were missing.

Connecting to the eventlet backdoor port allowed to actually get the backtrace for the report state thread: http://paste.openstack.org/show/224750/

Introspecting eventlet shown, that the corresponding FD (the one we expected events on when putting the green thread to sleep) was open, it's an active TCP connection which transmitted different RPC messages. Which effectively means that the report state thread was actually *scheduled* by eventlet main loop.

The problem is actually in in the olso.messaging code here: https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/amqpdriver.py#L292-L297

Due to the weird model of receiving replies in oslo.messaging version we use in 6.1, one of the calling threads will become the one, which receives all the replies and puts them into thread-specific in-process message queues, before the replies are actually dispatched. The problem with this thread is that if replies (arbitrary, for calls made by concurrent threads) come regularly, Timeout exception won't ever be raised for the 'receiver' thread.

Being a 'chatty' process, nova-compute may get stuck in report_state thread, thus becoming down from nova-api/conductor point of view.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-05-17:

#11

This patch must fix the problem - https://review.openstack.org/#/c/161119/

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-17: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

#12

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6763

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-17:

#13

Roman, thank you for the bright investigation made! I believe we should reopen https://bugs.launchpad.net/mos/+bug/1399272 and do the same steps for that issue as well

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-17: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

#14

Reviewed: https://review.fuel-infra.org/6763
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 4a35df1025274280aa4820630cda8b66901fbf74
Author: Jie Li <email address hidden>
Date: Sun May 17 08:15:01 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502

Revision history for this message

Veronica Krayneva (vkrayneva) wrote on 2015-05-18:

#15

In my case: the procedure was repeated 10 times and the issue was not reproduced.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "432"
  build_id: "2015-05-18_03-43-53"
  nailgun_sha: "076566b5df37f681c3fd5b139c966d680d81e0a5"
  python-fuelclient_sha: "38765563e1a7f14f45201fd47cf507393ff5d673"
  astute_sha: "cb655a9a9ad26848bcd9d9ace91857b6f4a0ec15"
  fuel-library_sha: "1621cb350af744f497c35f2b3bb889c2041465d8"
  fuel-ostf_sha: "9ce1800749081780b8b2a4a7eab6586583ffaf33"
  fuelmain_sha: "0e970647a83d9a7d336c4cc253606d4dd0d59a60"

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-05-18:

#16

Thanks for the update Veronica, we did 8 iterations for the same ISO and were unable to reproduce the issue.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-19: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

#17

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6794

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-19: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.0.1/2014.2)

#18

Fix proposed to branch: openstack-ci/fuel-6.0.1/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6795

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-19: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.0.1/2014.2)

#19

Reviewed: https://review.fuel-infra.org/6795
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-6.0.1/2014.2

Commit: 90b4b0f250809c6916043426daee2dba96516ba7
Author: Jie Li <email address hidden>
Date: Tue May 19 12:36:48 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502
(cherry picked from commit 4a35df1025274280aa4820630cda8b66901fbf74)

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-05-19: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

#20

Reviewed: https://review.fuel-infra.org/6794
Submitter: Alex Khivin <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: f646144e5916734c73bfb8cbaf08b67d0a91f88a
Author: Jie Li <email address hidden>
Date: Tue May 19 12:00:43 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502
(cherry picked from commit 4a35df1025274280aa4820630cda8b66901fbf74)

Alexey Khivin (akhivin) on 2015-06-05

tags:

added: oslo.messaging

Roman Zhnichkov (rzhnichkov) on 2015-06-11

tags:

added: customer-found

Revision history for this message

JohnsonYi (yichengli) wrote on 2015-07-21:

#22

I replaced three code amqpdriver.py/common.py/impl_rabbit/py from mos6.1 to my fuel 6.0/2014.2 environment, this bug was fixed.
And while the rabbitmq primary node was shutdown, nova-compute service on compute node will be marked as down for a short while, then all set-up again. FYI

Vitaly Sedelnik (vsedelnik) on 2015-09-26

no longer affects:

mos/6.0.1-updates

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2015-09-29:

#23

compute.tar.gz Edit (73.2 KiB, application/x-tar)

Download full text (3.3 KiB)

Still affects one of the cuustomers' deployments.

Fuel 6.1, the new package for oslo.messaging from MU1 is installed.
At some point I noticed that I have 17 from 60 compute nodes marked as down.

What happened was - rabbit server went down, compute nodes tried to reconnect to it and got stuck:

root@GGUTTHLDK008:~# date
Tue Sep 29 21:56:05 UTC 2015

<182>Sep 27 10:38:39 GGUTTHLDK008 nova-compute Connected to AMQP server on 192.168.26.71:5673
<179>Sep 27 10:38:42 GGUTTHLDK008 nova-compute (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 935, in _connection_maintenance
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit conn.drain_events(timeout=0.01)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 624, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit self.connection.drain_events(timeout=timeout)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 279, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 91, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 320, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return amqp_method(channel, args)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 526, in _close
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit (class_id, method_id), ConnectionError)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit
<182>Sep 27 10:38:42 GGUTTHLDK008 nova-compute (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
<182>Sep 27 10:38:42 GGUTTHLDK008 nova-compute Delaying reconnect for 1.0 seconds ...
<182>Sep 27 10:38:43 GGUTTHLDK008 nova-compute Connecting to AMQP server on 192.168.26.71:5673
<182>Sep 27 10:38:44 GGUTTHLDK008 nova-compute Connected to AMQP server on 192.168.26.71:5673

This is the last message in the log file of nova-compute.

Strace outputs show - nothing happens in the broken service, while there is...

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	High	MOS Oslo	Mirantis OpenStack 6.1
5.1.x	Invalid	High	Alexey Khivin	Mirantis OpenStack 5.1.1-updates
6.0.x	Fix Released	High	Alexey Khivin	Mirantis OpenStack 6.0-mu-4
6.1.x	Fix Released	High	MOS Oslo	Mirantis OpenStack 6.1

Mirantis OpenStack

nova-compute status down after reboot master p_rabbitmq-server

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches