Fuel for OpenStack

Corosync at controller stucks at reboot or takes a lot of time for it

Bug #1407678 reported by Bogdan Dobrelya on 2015-01-05

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1

Bug Description

Steps to reproduce:
1. Find management_vip node:
# ssh node-1 'crm status'
2. Reboot node with any other at the same time:
# for i in node-{1,2}; do ssh $i 'nohup reboot &'; done
3. In ~10% cases one node will hang on shutdown process.
3. In ~30% cases one nodes' shutdown process takes a lot of time

Looks like RabbitMQ RA causes cluster to hangs during reboot.
Updating the shutdown-escalation property would not solve the issue.

See original description

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-01-05:

I suggest to set shutdown-escalation to at least 5 min as 120 sec looks too low value compared to default 20 min.

Changed in fuel:
status:	New → Triaged
milestone:	none → 6.1
assignee:	nobody → Bogdan Dobrelya (bogdando)
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-05: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/144985

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Bartosz Kupidura (zynzel) wrote on 2015-01-07:

This fix is not working.

What we know:
* RabbitMQ/RabbitMQ RA causes cluster to hangs during reboot
* shutdown-escalation property didnt help

Bogdan Dobrelya (bogdando) on 2015-01-07

description:	updated
Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status:	In Progress → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-07: Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/144985

Revision history for this message

Sergey Yudin (tsipa740) wrote on 2015-04-01:

coros1.png Edit (30.4 KiB, image/png)

I've done some investigation on this and it seems like entirely internal corosync issue.

With enabled debug it seems like corosync tryes to notify neighbours but for some reasons it fail.

It seems like there is some internal loop which begins like screen 1 and in about 20mins it ends like screen 2(unfortenatelly i wasnt able to capture exact moment when loop ends).

i've noticed that when corosync have lost token in the time network was shat down and node become unpingable, i've tried to modify initscript with something like
# Required-Stop: $remote_fs $network $syslog $named $local_fs openvswitch-switch ssh
but even with this options networking was shatdown before corosync died.

According to the
start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile $PIDFILE
corosync must be shat down syncronously, so networking must be disable not earlyer than corosync will die, but it not works that way for some reasons.

Actually i'm not quite sure if initscript ordering have any relations to the issue, cause i've seen few times when token was lost while node still be pingable for almost 3-5 seconds after that.

Revision history for this message

Sergey Yudin (tsipa740) wrote on 2015-04-01:

coros2.png Edit (29.9 KiB, image/png)

recovery screen:

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-04-02:

We need to recheck if this is an issue with corosync 2.x

Changed in fuel:
importance:	Medium → High

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-02:

Note, in the 6.1 there is a user maintenance mode feature which relies completely on the graceful node reboot. And there were no issues with corosync/pacemaker shutdown while testing it. So hopefully, the Corosync 2.3.3 we have in the 6.1 have this issue resolved

Vladimir Kuklin (vkuklin) on 2015-04-03