Mirantis OpenStack

nova-compute permanently goes down for no particular reason

Bug #1682841 reported by Dmitry Mescheryakov on 2017-04-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	In Progress	High	Dmitry Mescheryakov	Mirantis OpenStack 6.1-updates

Bug Description

Version: 6.1

Steps to reproduce:
1. Install a MOS environment and leave it be.

At some point some nova-compute services might start to show as down and they never go up again.

The symptom is similar to the https://bugs.launchpad.net/mos/+bug/1454174 but there are differences underneath. GMR report shows that all the needed threads are alive, including the one reporting state to conductor. But all the threads are stuck getting a RabbitMQ connection from pool to send a message. Another hint is that logs stop to appear at some point and the latest message is an error stating that connection was closed due to 'Too many heartbeats missed'.

The GMR report is attached.

Tags:

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2017-04-14:

gmr-report.txt Edit (30.6 KiB, text/plain)

Changed in mos:
milestone:	none → 6.1-updates
assignee:	nobody → Dmitry Mescheryakov (dmitrymex)
importance:	Undecided → High
status:	New → Confirmed
tags:	added: area-oslo
tags:	added: customer-found

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-14: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/33131

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2017-04-18: Re: nova-compute permanently goes down after with no particular reason

The issue could be reproduced with the following patch into pyamqp: http://paste.openstack.org/show/606989/

The patch emulates situation when RabbitMQ forgets to send heartbeats and as a result 1 minute later oslo.messaging drops connection with the message 'Too many heartbeats missed'. It turns out that our current implementation does not reduce number of RabbitMQ connections allocated by pool. As a result, after 1.5 hours, all 30 connections are 'gone' and the pool can not allocate more, even though in reality 0 connections exist. As a result, nova-compute can not send any message. Note that listening connections live outside of the pool, it includes only connections for sending.

Obviously, in real life the issue reproduces much slower and rearer.

Dmitry Mescheryakov (dmitrymex) on 2017-04-19

summary:

- nova-compute permanently goes down after with no particular reason
+ nova-compute permanently goes down for no particular reason