[RabbitMQ] nova-compute stuck for a while (AMQP)

Bug #1317488 reported by Bogdan Dobrelya
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya
4.1.x
Fix Committed
High
Bogdan Dobrelya

Bug Description

Symptoms:
1)
* Random nova-compute from time to time marked as "XXX" for a while.
* Compute service itself works properly. In logs there are a status updates send reports to conductor are being recorded, but actually nothing is sent.
* "netstat" shows that all connections to/from rabbit "ESTABLISHED"
* rabbitmqctl shows that "compute.node-x" queue synced to all slaves.
2)
* computes' queues grow after some time have passed since the last compute service restarting.

Axe style solution:
/etc/init.d/openstack-nova-compute restart

Summary:
1)Fuel should provide TCP KA (keepalives) for rabitmq sessions in HA mode.
These TCP KA should be visible at the app layer as well as at the network stack layer.
related Oslo.messaging issue: https://bugs.launchpad.net/oslo.messaging/+bug/856764
related fuel-dev ML: https://lists.launchpad.net/fuel-dev/msg01024.html

2) Instances at compute nodes should be consistant with their state in nova db in order to prevent computes' queues uncontrolled grow - there was a reaping logic update was done in the Icehouse should be synced as well (running_deleted_instance_action = reap, was log)
related zendesk issues, #1663, #1743

Perhaps, this issue should be fixed in 5.0 but backporting should be considered as a critical for 3.2.1, 4.1, 4.1.1 releases (due to the increasing number of related tickets in zendesk).

Tags: ha
Changed in fuel:
assignee: nobody → Fuel Hardening Team (fuel-hardening)
Revision history for this message
Ryan Moe (rmoe) wrote :

The support for Rabbit heartbeat was reverted: https://review.openstack.org/#/c/36606/. With kombu you have to call heartbeat_check() once per second. Without a thread calling that function your connections will all die after heartbeat seconds.

The kombu reconnect changes here: https://review.openstack.org/#/c/76686/ along with the CCN changes are already in our packages. The config changes to rabbit here: https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19 sound helpful though and are worth testing.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Thank you for clarifying this. Ryan. Any comment on https://review.openstack.org/77276 destiny?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I don't think this issue is a dup of #1289200. Andrew, why do you think it is?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This bug isn't a dup cuz there are additional causes exist for computes flopping, see #1743 ticket in zendesk

Changed in fuel:
milestone: 5.1 → 5.0
assignee: Fuel Hardening Team (fuel-hardening) → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93869

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The 'reap' patch should be backported for Havana, in IceHouse reap is a default as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/93927

Changed in fuel:
milestone: 5.0 → 4.1.1
Changed in fuel:
milestone: 4.1.1 → 5.0
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/4.1)

Reviewed: https://review.openstack.org/93927
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f35bab1b7d2997f933824c451daf2a434d7e2445
Submitter: Jenkins
Branch: stable/4.1

commit f35bab1b7d2997f933824c451daf2a434d7e2445
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300

    Sync running_deleted_instance_action for nova

    * Sync running_deleted_instance_action=reap from nova upstream
      in order to make all instaces at compute nodes being consistent
      with the Nova DB state.
    * In Havana release the default value was 'log' and that was a cause
      of the flopping nova-computes and their queues uncontrolled grow
      due to the DB inconsistent state.
    See
    https://github.com/openstack/nova/blob/master/nova/compute/manager.py
    poke ci

    Closes-bug: #1317488

    Change-Id: I6de6898ab2e1c7e327eea0cd79c82f4fa6a94bb0
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.