Lost communication between nova-conductor and nova-compute because of lost exchanges

Bug #1579128 reported by Ilya Shakhat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
Kirill Bespalov
7.0.x
Fix Released
High
Kirill Bespalov
8.0.x
Fix Released
High
Kirill Bespalov
9.x
Fix Released
High
Kirill Bespalov

Bug Description

At 200-scale Nova reports that all nova-compute services are down. The issue appeared after some load, but at the moment of observation the system was idle for more than a day (so it obviously had time to recover).

According to nova-compute logs it tries to ping conductor, but fails to get the reply:
2016-05-06 15:06:36.561 23305 ERROR oslo_service.periodic_task [req-0a0cea39-f73d-4be0-8007-3c96997d3df7 - - - - -] Error during ComputeManager.update_available_resource
2016-05-06 15:06:36.561 23305 ERROR oslo_service.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 3385762122d749509aeaf1b053297a19

The original message is present in the traffic and is received by the conductor:
2016-05-06 15:05:36.562 16992 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 3385762122d749509aeaf1b053297a19 size: 1232 reply to: reply_ffdd1b7c9c994f919bf3e7e2bde0f3d0 __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:230

Message body:
{"oslo.message": "{\"_context_domain\": null, \"_msg_id\": \"3385762122d749509aeaf1b053297a19\", \"_context_quota_class\": null, \"_context_read_only\": false, \"_context_request_id\": \"req-0c2026ff-f767-4152-ac05-0bd1a779f702\", \"_context_service_catalog\": [], \"args\": {\"object_versions\": {\"ComputeNodeList\": \"1.14\", \"ComputeNode\": \"1.16\", \"HVSpec\": \"1.2\", \"PciDevicePool\": \"1.1\", \"PciDevicePoolList\": \"1.1\"}, \"objmethod\": \"get_all_by_host\", \"args\": [\"node-63.domain.tld\"], \"objname\": \"ComputeNodeList\", \"kwargs\": {\"use_slave\": true}}, \"_unique_id\": \"608fe18727d24ababe01b47eee4ec360\", \"_context_resource_uuid\": null, \"_context_instance_lock_checked\": false, \"_context_user\": null, \"_context_user_id\": null, \"_context_project_name\": null, \"_context_read_deleted\": \"no\", \"_context_user_identity\": \"- - - - -\", \"_reply_q\": \"reply_ffdd1b7c9c994f919bf3e7e2bde0f3d0\", \"_context_auth_token\": null, \"_context_show_deleted\": false, \"_context_tenant\": null, \"_context_roles\": [], \"_context_is_admin\": true, \"version\": \"3.0\", \"_context_project_id\": null, \"_context_project_domain\": null, \"_context_timestamp\": \"2016-05-06T15:04:36.552980\", \"_context_user_domain\": null, \"_context_user_name\": null, \"method\": \"object_class_action_versions\", \"_context_remote_address\": null}", "oslo.version": "2.0"}

The reply queue is present:
root@node-159:~# rabbitmqctl list_consumers | grep reply_ffdd1b7c9c994f919bf3e7e2bde0f3d0
reply_ffdd1b7c9c994f919bf3e7e2bde0f3d0 <email address hidden> 1 true 0 []

root@node-159:~# rabbitmqctl list_channels name pid connection | grep <email address hidden>
192.168.0.65:53749 -> 192.168.0.88:5673 (1) <email address hidden> <email address hidden>

Nova-compute service is connected to RabbitMQ and connection is alive:
root@node-63:~# lsof -i 4 | grep 53749
nova-comp 23305 nova 4u IPv4 2422830 0t0 TCP node-63.domain.tld:53749->messaging-node-159.domain.tld:5673 (ESTABLISHED)

However there's no evidence that the reply was even send to the compute service (it is not present in traffic capture).

-------------

Restart of one of nova-conductor services helps, but over the time number of down compute services increases and approximately in an hour the situation becomes stable (all dead). The conductor logs are different -- according to the service it at least tries to send the reply during the minute:

2016-05-06 15:18:05.828 36910 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: b69938a0ca55484babe95e75b8542c66 size: 1736 reply to: reply_3a37becbbdd7416794cf881453d2611a __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:230

2016-05-06 15:18:06.637 36910 DEBUG oslo_messaging._drivers.amqpdriver [req-abcbd8e1-dd37-4b66-9fd2-d63dcfd55717 - - - - -] REPLY msg_id: b69938a0ca55484babe95e75b8542c66 size: 278 reply queue: reply_3a37becbbdd7416794cf881453d2611a time elapsed: 0.0372149180039s _send_reply /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:96

2016-05-06 15:18:07.208 36910 DEBUG oslo_messaging._drivers.amqpdriver [req-abcbd8e1-dd37-4b66-9fd2-d63dcfd55717 - - - - -] The reply b69938a0ca55484babe95e75b8542c66 cannot be sent reply_3a37becbbdd7416794cf881453d2611a reply queue don't exist, retrying... reply /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:130

2016-05-06 15:19:07.169 36910 WARNING oslo_messaging._drivers.amqpdriver [req-abcbd8e1-dd37-4b66-9fd2-d63dcfd55717 - - - - -] reply_3a37becbbdd7416794cf881453d2611a doesn't exists, drop reply to b69938a0ca55484babe95e75b8542c66

2016-05-06 15:19:07.519 36910 INFO oslo_messaging._drivers.amqpdriver [req-abcbd8e1-dd37-4b66-9fd2-d63dcfd55717 - - - - -] The reply b69938a0ca55484babe95e75b8542c66 cannot be sent reply_3a37becbbdd7416794cf881453d2611a reply queue don't exist after 60 sec abandoning...

In the stable state there are no any logs related to replying.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
[root@fuel ~]# fuel2 fuel-version
api: '1'
auth_required: true
feature_groups: []
openstack_version: mitaka-9.0
release: '9.0'

DEPLOYMENT:
200 HW nodes, Neutron DVR

tags: added: nova scale
Changed in mos:
milestone: none → 9.0
assignee: nobody → MOS Nova (mos-nova)
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Ilya Shakhat (shakhat) wrote : Re: Lost communication between nova-conductor and nova-compute

Interesting that the status of nova-computes stays "up" even if it cannot receive the reply message. So it looks like after some point nova-conductor doesn't have resources (threads?) to process incoming messages and update DB.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

>>> Interesting that the status of nova-computes stays "up" even if it cannot receive the reply message. So it looks like after some point nova-conductor doesn't have resources (threads?) to process incoming messages and update DB.

it does not need to: by the time nova-conductor tries to send a reply it has already updated the corresponding row in the DB.

My understanding was that nova-conductor actually receives RPC calls, but it just can't send replies, as the reply queues are missing for some reason.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

Investigation shows 2 issues:

1) The reply queue exists in RabbitMQ, but the corresponding exchange is not. The following error is presented in RabbitMQ log:

=ERROR REPORT==== 3-May-2016::21:36:02 ===
Channel error on connection <0.17745.2> (192.168.0.126:55899 -> 192.168.0.131:5673, vhost: '/', user: 'nova'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'reply_09181cdbdf894afeabcc99be8983f528' in vhost '/' due to timeout"

As result all messages published to the queue fail with:

=ERROR REPORT==== 3-May-2016::22:43:21 ===
Channel error on connection <0.9232.15> (192.168.0.131:51606 -> 192.168.0.131:5673, vhost: '/', user: 'nova'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_09181cdbdf894afeabcc99be8983f528' in vhost '/'"

2) All threads inside Nova-conductor stall in read forever, as result the service cannot process messages nor reply to them. A separate bug will be filed to track this.

Revision history for this message
Ilya Shakhat (shakhat) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So far, this has only been reproduced once on Ilya's environment. Tentatively marking this as Medium - we can always bump the priority to High, if needed.

Changed in mos:
status: New → Won't Fix
importance: Undecided → Medium
tags: added: area-nova
removed: nova
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So what we have here. #2 will be fixed for all but a Neutron by enabling built-in heartbeats.

For #1, lost exchanges will be a normal case as we do not use ha-policy for them anymore!
So the apps is better be prepared for that!

We could as well try to apply ha-policy to exchanges *only*, which is a pretty funny band-aiding.
But that requires a clear naming rules existing for exchanges in openstack apps and oslo

For example, see this snippet http://pastebin.com/NU7747BX

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
summary: - Lost communication between nova-conductor and nova-compute
+ Lost communication between nova-conductor and nova-compute because of
+ lost exchanges
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It seems, the patch fixing #1 is https://review.openstack.org/315700

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raising to high, as the related to missing exchanges (#1) bug is high https://bugs.launchpad.net/mos/+bug/1572085

Changed in mos:
importance: Medium → High
status: Won't Fix → Triaged
no longer affects: mos/9.0.x
Changed in mos:
milestone: 9.0 → 10.0
no longer affects: mos/10.0.x
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I agree with Bogdan that Kirill's commit will close the bug. Hence reassigning the bug to Kirill.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Kirill Bespalov <email address hidden>
Review: https://review.fuel-infra.org/21134

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/21134
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: f739aa5af33b77e3718b3e594798d02cad2850a5
Author: Kirill Bespalov <email address hidden>
Date: Mon May 23 14:04:31 2016

Fix consuming from missing queues

Redeclare queues on 'Basic.consume: (404) no queue' exception
and enable by default consumer declaration with nowait=False
in order to wait for a broker response.

Change-Id: I99f2bc858dbc7c18a2f328ee26f39105ed17cee3
Closes-Bug: #1579128
Closes-Bug: #1572085

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Fix for 10.0 is merged upstream here https://review.openstack.org/#/c/315700/
Fix for 9.0 was backported, see the above comment.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Kirill Bespalov <email address hidden>
Review: https://review.fuel-infra.org/28255

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/28255
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: acfb7d187c984b83a3b66716f9390cf92ed54f05
Author: Kirill Bespalov <email address hidden>
Date: Mon Nov 14 11:49:58 2016

Fix consuming from missing queues

Redeclare queues on 'Basic.consume: (404) no queue' exception
and enable by default consumer declaration with nowait=False
in order to wait for a broker response.

Change-Id: I99f2bc858dbc7c18a2f328ee26f39105ed17cee3
Closes-Bug: #1579128
Closes-Bug: #1572085

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Kirill Bespalov <email address hidden>
Review: https://review.fuel-infra.org/29003

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/29003
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 250c29274fa05e4810ad470be2bb95f99e7e7524
Author: Kirill Bespalov <email address hidden>
Date: Mon May 22 15:32:27 2017

Fix consuming from missing queues

Redeclare queues on 'Basic.consume: (404) no queue' exception
and enable by default consumer declaration with nowait=False
in order to wait for a broker response.

Change-Id: I99f2bc858dbc7c18a2f328ee26f39105ed17cee3
Closes-Bug: #1579128
Closes-Bug: #1572085

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 7.0 + mu8 updates.

Revision history for this message
Ilya Bumarskov (ibumarskov) wrote :

Verified on 8.0 MU-5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.