nova-compute status down after reboot master p_rabbitmq-server

Bug #1454174 reported by Veronica Krayneva
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
MOS Oslo
5.1.x
Invalid
High
Alexey Khivin
6.0.x
Fix Released
High
Alexey Khivin
6.1.x
Fix Released
High
MOS Oslo

Bug Description

Repeat up to 10 times to reproduce:

Steps:

1. Create cluster in HA mode
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster

5. Reboot node-3 - controller with master_p_rabbitmq-server)
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.test.domain.local ]
     Slaves: [ node-1.test.domain.local node-2.test.domain.local ]

6. Check nova service-list

Actual result:

[root@node-3 ~]# nova service-list

| 37 | nova-compute | node-5.test.domain.local | nova | enabled | down | 2015-05-12T09:00:46.000000 | - |
| 40 | nova-compute | node-4.test.domain.local | nova | enabled | up | 2015-05-12T09:33:55.000000 | - |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+

Expected result:
Nova compute is up

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "388"
  build_id: "2015-05-07_03-59-48"
  nailgun_sha: "7437e500c3e2673b124af4c7634ac04121b1c949"
  python-fuelclient_sha: "af6c9c3799b9ec107bcdc6dbf035cafc034526ce"
  astute_sha: "83b23783edbfacd25aded7a79e0aca17fc883e79"
  fuel-library_sha: "a7a794e68086014f8163f79a3dad1c9a074b1f3a"
  fuel-ostf_sha: "1a09c09ee7f6b7622dfc7e37dce70c4537c0b532"
  fuelmain_sha: "68e6f08e23845516e1995df59e038ebecb26e6b8"

affects: nova → fuel
description: updated
description: updated
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Veronica, can we get a diagnostic snapshot for the issue? It seems there are some problems with nova-compute only on node-4, but it is impossible to investigate the issue without logs.

Changed in fuel:
milestone: none → 6.1
assignee: nobody → Veronica Krayneva (vkrayneva)
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

It's hard to tell for sure without logs, but it might be a duplicate of https://bugs.launchpad.net/mos/+bug/1452632

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

... or it might be a rabbitmq split brain and nova-compute/nova-conductor not seeing each other

Revision history for this message
Veronica Krayneva (vkrayneva) wrote :

Unfortunately, yesterday deploy cluster was failed. I'm try to deploy cluster for this issue today. But, I have ditto diagnostic snapshot with cic maintenance mode. We do maintenance mode (same reboot) for controllers and 'nova-compute' on one compute node was marked as down and OSTF test 'Check that required services are running' failed.

Diagnostic snapshot is attached.

Revision history for this message
Veronica Krayneva (vkrayneva) wrote :
Changed in fuel:
status: Incomplete → New
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Veronika, is this the snapshot with the same bug with rabbitmq or not?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Veronica Krayneva (vkrayneva) wrote :

As far as I'm concerned, yes.
Because If controller (not master_rabbit) just rebooted, this does not happen. However, when the master of the rabbit is on controller and this controller was rebooted, then the nova-compute service is down

Changed in fuel:
status: Incomplete → New
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the environment status, there is no issues with rabbitmq cluster failover. But nova-compute services is reported down due to rabbit_heartbeat was not configured for nova services on all nodes. This is strange as, IIRC, it should be enabled for all Openstack services but Neutron.

summary: - Nova-compute won't start after reboot master p_rabbitmq-server
+ Nova-compute won't start after reboot master p_rabbitmq-server as there
+ is no rabbit_heartbeat configured
Changed in fuel:
status: New → Triaged
Changed in fuel:
assignee: Veronica Krayneva (vkrayneva) → Bogdan Dobrelya (bogdando)
summary: Nova-compute won't start after reboot master p_rabbitmq-server as there
- is no rabbit_heartbeat configured
+ is in-optimal default value used for the rabbit_heartbeat
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Nova-compute won't start after reboot master p_rabbitmq-server as there is in-optimal default value used for the rabbit_heartbeat

Updating the rabbit_heartbeat for Nova did not completely resolved the issue. It looks like. there are no more errors like:
2015-05-15T14:05:20.322089+00:00 notice: {amqp_error,not_found,
2015-05-15T14:05:20.322089+00:00 notice: "no exchange 'reply_546500363b244e1a8307a61c7dd1a9da' in vhost '/'",
2015-05-15T14:05:20.322089+00:00 notice: 'exchange.declare'}

in rabbit logs after the node failover. But nova service may still show some compute service as down. @Roman, MOS-Nova team, please take a look at this bug more precisely. Perhaps it is a dup of https://bugs.launchpad.net/mos/+bug/1452632 indeed

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → MOS Nova (mos-nova)
summary: - Nova-compute won't start after reboot master p_rabbitmq-server as there
- is in-optimal default value used for the rabbit_heartbeat
+ Nova-compute status down after reboot master p_rabbitmq-server
Changed in fuel:
status: Triaged → Confirmed
Changed in fuel:
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
status: Confirmed → In Progress
tags: added: nova
Changed in mos:
status: New → Triaged
importance: Undecided → High
assignee: nobody → MOS Oslo (mos-oslo)
milestone: none → 6.1
no longer affects: fuel
no longer affects: mos/7.0.x
summary: - Nova-compute status down after reboot master p_rabbitmq-server
+ nova-compute status down after reboot master p_rabbitmq-server
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, so I tried to reproduce this yesterday enabling eventlet backdoor port in Nova services on compute nodes - it effectively allows you to open a Python shell within a context of a running process (https://github.com/openstack/nova/blob/stable/juno/nova/openstack/common/eventlet_backdoor.py#L35-L46), so that you cat introspect the state of green/native threads.

This issue is hard to reproduce, so it will take some time, but 10 reboots of RabbitMQ 'master' (taken from crm status output) must be enough. After that nova-compute accidentally stops reporting its state to nova-conductor (done here https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L79-L94 every `report_interval` seconds == 60s in our setups). There are no errors in Nova logs, the report state thread is just stuck.

stracing nova-compute process didn't help a lot: all TCP connections to RabbitMQ were active time to time, it's just the 'service_update' RPC calls which were missing.

Connecting to the eventlet backdoor port allowed to actually get the backtrace for the report state thread: http://paste.openstack.org/show/224750/

Introspecting eventlet shown, that the corresponding FD (the one we expected events on when putting the green thread to sleep) was open, it's an active TCP connection which transmitted different RPC messages. Which effectively means that the report state thread was actually *scheduled* by eventlet main loop.

The problem is actually in in the olso.messaging code here: https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/amqpdriver.py#L292-L297

Due to the weird model of receiving replies in oslo.messaging version we use in 6.1, one of the calling threads will become the one, which receives all the replies and puts them into thread-specific in-process message queues, before the replies are actually dispatched. The problem with this thread is that if replies (arbitrary, for calls made by concurrent threads) come regularly, Timeout exception won't ever be raised for the 'receiver' thread.

Being a 'chatty' process, nova-compute may get stuck in report_state thread, thus becoming down from nova-api/conductor point of view.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This patch must fix the problem - https://review.openstack.org/#/c/161119/

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6763

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Roman, thank you for the bright investigation made! I believe we should reopen https://bugs.launchpad.net/mos/+bug/1399272 and do the same steps for that issue as well

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

Reviewed: https://review.fuel-infra.org/6763
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 4a35df1025274280aa4820630cda8b66901fbf74
Author: Jie Li <email address hidden>
Date: Sun May 17 08:15:01 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502

Revision history for this message
Veronica Krayneva (vkrayneva) wrote :

In my case: the procedure was repeated 10 times and the issue was not reproduced.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "432"
  build_id: "2015-05-18_03-43-53"
  nailgun_sha: "076566b5df37f681c3fd5b139c966d680d81e0a5"
  python-fuelclient_sha: "38765563e1a7f14f45201fd47cf507393ff5d673"
  astute_sha: "cb655a9a9ad26848bcd9d9ace91857b6f4a0ec15"
  fuel-library_sha: "1621cb350af744f497c35f2b3bb889c2041465d8"
  fuel-ostf_sha: "9ce1800749081780b8b2a4a7eab6586583ffaf33"
  fuelmain_sha: "0e970647a83d9a7d336c4cc253606d4dd0d59a60"

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Thanks for the update Veronica, we did 8 iterations for the same ISO and were unable to reproduce the issue.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6794

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.0.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.0.1/2014.2
Change author: Jie Li <email address hidden>
Review: https://review.fuel-infra.org/6795

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.0.1/2014.2)

Reviewed: https://review.fuel-infra.org/6795
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-6.0.1/2014.2

Commit: 90b4b0f250809c6916043426daee2dba96516ba7
Author: Jie Li <email address hidden>
Date: Tue May 19 12:36:48 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502
(cherry picked from commit 4a35df1025274280aa4820630cda8b66901fbf74)

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

Reviewed: https://review.fuel-infra.org/6794
Submitter: Alex Khivin <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: f646144e5916734c73bfb8cbaf08b67d0a91f88a
Author: Jie Li <email address hidden>
Date: Tue May 19 12:00:43 2015

Fix _poll_connection not timeout issue

_poll_connection could fall into a loop waiting for a reply message, if
rabbit dies and up. This commit will set up a rpc_response_timeout timer
for one connection polling; so the rpc will finally jump out with a
timeout exception which is expected in such scenario.

Closes-Bug: #1454174

Change-Id: I157dab80cdb4afcf9a5f26fa900f96f0696db502
(cherry picked from commit 4a35df1025274280aa4820630cda8b66901fbf74)

Alexey Khivin (akhivin)
tags: added: oslo.messaging
tags: added: customer-found
Revision history for this message
JohnsonYi (yichengli) wrote :

I replaced three code amqpdriver.py/common.py/impl_rabbit/py from mos6.1 to my fuel 6.0/2014.2 environment, this bug was fixed.
And while the rabbitmq primary node was shutdown, nova-compute service on compute node will be marked as down for a short while, then all set-up again. FYI

no longer affects: mos/6.0.1-updates
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :
Download full text (3.3 KiB)

Still affects one of the cuustomers' deployments.

Fuel 6.1, the new package for oslo.messaging from MU1 is installed.
At some point I noticed that I have 17 from 60 compute nodes marked as down.

What happened was - rabbit server went down, compute nodes tried to reconnect to it and got stuck:

root@GGUTTHLDK008:~# date
Tue Sep 29 21:56:05 UTC 2015

<182>Sep 27 10:38:39 GGUTTHLDK008 nova-compute Connected to AMQP server on 192.168.26.71:5673
<179>Sep 27 10:38:42 GGUTTHLDK008 nova-compute (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 935, in _connection_maintenance
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit conn.drain_events(timeout=0.01)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 624, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit self.connection.drain_events(timeout=timeout)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 279, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 91, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 320, in drain_events
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit return amqp_method(channel, args)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 526, in _close
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit (class_id, method_id), ConnectionError)
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2015-09-27 10:38:42.513 5692 TRACE oslo.messaging._drivers.impl_rabbit
<182>Sep 27 10:38:42 GGUTTHLDK008 nova-compute (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
<182>Sep 27 10:38:42 GGUTTHLDK008 nova-compute Delaying reconnect for 1.0 seconds ...
<182>Sep 27 10:38:43 GGUTTHLDK008 nova-compute Connecting to AMQP server on 192.168.26.71:5673
<182>Sep 27 10:38:44 GGUTTHLDK008 nova-compute Connected to AMQP server on 192.168.26.71:5673

This is the last message in the log file of nova-compute.

Strace outputs show - nothing happens in the broken service, while there is...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.