[RabbitMQ] nova-compute stuck for a while (AMQP)

Bug #1317488 reported by Bogdan Dobrelya on 2014-05-08
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya
4.1.x
High
Bogdan Dobrelya

Bug Description

Symptoms:
1)
* Random nova-compute from time to time marked as "XXX" for a while.
* Compute service itself works properly. In logs there are a status updates send reports to conductor are being recorded, but actually nothing is sent.
* "netstat" shows that all connections to/from rabbit "ESTABLISHED"
* rabbitmqctl shows that "compute.node-x" queue synced to all slaves.
2)
* computes' queues grow after some time have passed since the last compute service restarting.

Axe style solution:
/etc/init.d/openstack-nova-compute restart

Summary:
1)Fuel should provide TCP KA (keepalives) for rabitmq sessions in HA mode.
These TCP KA should be visible at the app layer as well as at the network stack layer.
related Oslo.messaging issue: https://bugs.launchpad.net/oslo.messaging/+bug/856764
related fuel-dev ML: https://lists.launchpad.net/fuel-dev/msg01024.html

2) Instances at compute nodes should be consistant with their state in nova db in order to prevent computes' queues uncontrolled grow - there was a reaping logic update was done in the Icehouse should be synced as well (running_deleted_instance_action = reap, was log)
related zendesk issues, #1663, #1743

Perhaps, this issue should be fixed in 5.0 but backporting should be considered as a critical for 3.2.1, 4.1, 4.1.1 releases (due to the increasing number of related tickets in zendesk).

Tags: ha Edit Tag help
Changed in fuel:
assignee: nobody → Fuel Hardening Team (fuel-hardening)
Ryan Moe (rmoe) wrote :

The support for Rabbit heartbeat was reverted: https://review.openstack.org/#/c/36606/. With kombu you have to call heartbeat_check() once per second. Without a thread calling that function your connections will all die after heartbeat seconds.

The kombu reconnect changes here: https://review.openstack.org/#/c/76686/ along with the CCN changes are already in our packages. The config changes to rabbit here: https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19 sound helpful though and are worth testing.

Bogdan Dobrelya (bogdando) wrote :

Thank you for clarifying this. Ryan. Any comment on https://review.openstack.org/77276 destiny?

Bogdan Dobrelya (bogdando) wrote :

I don't think this issue is a dup of #1289200. Andrew, why do you think it is?

Bogdan Dobrelya (bogdando) wrote :

This bug isn't a dup cuz there are additional causes exist for computes flopping, see #1743 ticket in zendesk

Changed in fuel:
milestone: 5.1 → 5.0
assignee: Fuel Hardening Team (fuel-hardening) → Bogdan Dobrelya (bogdando)

Fix proposed to branch: master
Review: https://review.openstack.org/93869

Changed in fuel:
status: Confirmed → In Progress
Bogdan Dobrelya (bogdando) wrote :
description: updated
Bogdan Dobrelya (bogdando) wrote :

The 'reap' patch should be backported for Havana, in IceHouse reap is a default as well.

Changed in fuel:
milestone: 5.0 → 4.1.1
Changed in fuel:
milestone: 4.1.1 → 5.0
status: In Progress → Invalid

Reviewed: https://review.openstack.org/93927
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f35bab1b7d2997f933824c451daf2a434d7e2445
Submitter: Jenkins
Branch: stable/4.1

commit f35bab1b7d2997f933824c451daf2a434d7e2445
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300

    Sync running_deleted_instance_action for nova

    * Sync running_deleted_instance_action=reap from nova upstream
      in order to make all instaces at compute nodes being consistent
      with the Nova DB state.
    * In Havana release the default value was 'log' and that was a cause
      of the flopping nova-computes and their queues uncontrolled grow
      due to the DB inconsistent state.
    See
    https://github.com/openstack/nova/blob/master/nova/compute/manager.py
    poke ci

    Closes-bug: #1317488

    Change-Id: I6de6898ab2e1c7e327eea0cd79c82f4fa6a94bb0
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers