Corosync at controller stucks at reboot or takes a lot of time for it

Bug #1407678 reported by Bogdan Dobrelya on 2015-01-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya

Bug Description

Steps to reproduce:
1. Find management_vip node:
# ssh node-1 'crm status'
2. Reboot node with any other at the same time:
# for i in node-{1,2}; do ssh $i 'nohup reboot &'; done
3. In ~10% cases one node will hang on shutdown process.
3. In ~30% cases one nodes' shutdown process takes a lot of time

Looks like RabbitMQ RA causes cluster to hangs during reboot.
Updating the shutdown-escalation property would not solve the issue.

Bogdan Dobrelya (bogdando) wrote :

I suggest to set shutdown-escalation to at least 5 min as 120 sec looks too low value compared to default 20 min.

Changed in fuel:
status: New → Triaged
milestone: none → 6.1
assignee: nobody → Bogdan Dobrelya (bogdando)
importance: Undecided → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/144985

Changed in fuel:
status: Triaged → In Progress
Bartosz Kupidura (zynzel) wrote :

This fix is not working.

What we know:
* RabbitMQ/RabbitMQ RA causes cluster to hangs during reboot
* shutdown-escalation property didnt help

description: updated
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: In Progress → Confirmed

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/144985

Sergey Yudin (tsipa740) wrote :

I've done some investigation on this and it seems like entirely internal corosync issue.

With enabled debug it seems like corosync tryes to notify neighbours but for some reasons it fail.

It seems like there is some internal loop which begins like screen 1 and in about 20mins it ends like screen 2(unfortenatelly i wasnt able to capture exact moment when loop ends).

i've noticed that when corosync have lost token in the time network was shat down and node become unpingable, i've tried to modify initscript with something like
# Required-Stop: $remote_fs $network $syslog $named $local_fs openvswitch-switch ssh
but even with this options networking was shatdown before corosync died.

According to the
start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile $PIDFILE
corosync must be shat down syncronously, so networking must be disable not earlyer than corosync will die, but it not works that way for some reasons.

Actually i'm not quite sure if initscript ordering have any relations to the issue, cause i've seen few times when token was lost while node still be pingable for almost 3-5 seconds after that.

Sergey Yudin (tsipa740) wrote :

recovery screen:

Vladimir Kuklin (vkuklin) wrote :

We need to recheck if this is an issue with corosync 2.x

Changed in fuel:
importance: Medium → High
Bogdan Dobrelya (bogdando) wrote :

Note, in the 6.1 there is a user maintenance mode feature which relies completely on the graceful node reboot. And there were no issues with corosync/pacemaker shutdown while testing it. So hopefully, the Corosync 2.3.3 we have in the 6.1 have this issue resolved

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Bogdan Dobrelya (bogdando) wrote :

This issue was resolved with the Corosync-2 blueprint https://blueprints.launchpad.net/fuel/+spec/corosync-2

Changed in fuel:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers