Fuel for OpenStack

Bug #1407678
Comment #5

Comment 5 for bug 1407678

Revision history for this message

Sergey Yudin (tsipa740) wrote on 2015-04-01:

coros1.png Edit (30.4 KiB, image/png)

I've done some investigation on this and it seems like entirely internal corosync issue.

With enabled debug it seems like corosync tryes to notify neighbours but for some reasons it fail.

It seems like there is some internal loop which begins like screen 1 and in about 20mins it ends like screen 2(unfortenatelly i wasnt able to capture exact moment when loop ends).

i've noticed that when corosync have lost token in the time network was shat down and node become unpingable, i've tried to modify initscript with something like
# Required-Stop: $remote_fs $network $syslog $named $local_fs openvswitch-switch ssh
but even with this options networking was shatdown before corosync died.

According to the
start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile $PIDFILE
corosync must be shat down syncronously, so networking must be disable not earlyer than corosync will die, but it not works that way for some reasons.

Actually i'm not quite sure if initscript ordering have any relations to the issue, cause i've seen few times when token was lost while node still be pingable for almost 3-5 seconds after that.