nova-compute permanently goes down for no particular reason

Bug #1682841 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
In Progress
High
Dmitry Mescheryakov

Bug Description

Version: 6.1

Steps to reproduce:
1. Install a MOS environment and leave it be.

At some point some nova-compute services might start to show as down and they never go up again.

The symptom is similar to the https://bugs.launchpad.net/mos/+bug/1454174 but there are differences underneath. GMR report shows that all the needed threads are alive, including the one reporting state to conductor. But all the threads are stuck getting a RabbitMQ connection from pool to send a message. Another hint is that logs stop to appear at some point and the latest message is an error stating that connection was closed due to 'Too many heartbeats missed'.

The GMR report is attached.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Changed in mos:
milestone: none → 6.1-updates
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
importance: Undecided → High
status: New → Confirmed
tags: added: area-oslo
tags: added: customer-found
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/33131

Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote : Re: nova-compute permanently goes down after with no particular reason

The issue could be reproduced with the following patch into pyamqp: http://paste.openstack.org/show/606989/

The patch emulates situation when RabbitMQ forgets to send heartbeats and as a result 1 minute later oslo.messaging drops connection with the message 'Too many heartbeats missed'. It turns out that our current implementation does not reduce number of RabbitMQ connections allocated by pool. As a result, after 1.5 hours, all 30 connections are 'gone' and the pool can not allocate more, even though in reality 0 connections exist. As a result, nova-compute can not send any message. Note that listening connections live outside of the pool, it includes only connections for sending.

Obviously, in real life the issue reproduces much slower and rearer.

summary: - nova-compute permanently goes down after with no particular reason
+ nova-compute permanently goes down for no particular reason
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.