Corosync at controller stucks at reboot or takes a lot of time for it

Bug #1407678 reported by Bogdan Dobrelya
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya

Bug Description

Steps to reproduce:
1. Find management_vip node:
# ssh node-1 'crm status'
2. Reboot node with any other at the same time:
# for i in node-{1,2}; do ssh $i 'nohup reboot &'; done
3. In ~10% cases one node will hang on shutdown process.
3. In ~30% cases one nodes' shutdown process takes a lot of time

Looks like RabbitMQ RA causes cluster to hangs during reboot.
Updating the shutdown-escalation property would not solve the issue.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I suggest to set shutdown-escalation to at least 5 min as 120 sec looks too low value compared to default 20 min.

Changed in fuel:
status: New → Triaged
milestone: none → 6.1
assignee: nobody → Bogdan Dobrelya (bogdando)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/144985

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Bartosz Kupidura (zynzel) wrote :

This fix is not working.

What we know:
* RabbitMQ/RabbitMQ RA causes cluster to hangs during reboot
* shutdown-escalation property didnt help

description: updated
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/144985

Revision history for this message
Sergey Yudin (tsipa740) wrote :

I've done some investigation on this and it seems like entirely internal corosync issue.

With enabled debug it seems like corosync tryes to notify neighbours but for some reasons it fail.

It seems like there is some internal loop which begins like screen 1 and in about 20mins it ends like screen 2(unfortenatelly i wasnt able to capture exact moment when loop ends).

i've noticed that when corosync have lost token in the time network was shat down and node become unpingable, i've tried to modify initscript with something like
# Required-Stop: $remote_fs $network $syslog $named $local_fs openvswitch-switch ssh
but even with this options networking was shatdown before corosync died.

According to the
start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile $PIDFILE
corosync must be shat down syncronously, so networking must be disable not earlyer than corosync will die, but it not works that way for some reasons.

Actually i'm not quite sure if initscript ordering have any relations to the issue, cause i've seen few times when token was lost while node still be pingable for almost 3-5 seconds after that.

Revision history for this message
Sergey Yudin (tsipa740) wrote :

recovery screen:

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

We need to recheck if this is an issue with corosync 2.x

Changed in fuel:
importance: Medium → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, in the 6.1 there is a user maintenance mode feature which relies completely on the graceful node reboot. And there were no issues with corosync/pacemaker shutdown while testing it. So hopefully, the Corosync 2.3.3 we have in the 6.1 have this issue resolved

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue was resolved with the Corosync-2 blueprint https://blueprints.launchpad.net/fuel/+spec/corosync-2

Changed in fuel:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.