Comment 5 for bug 1407678

Revision history for this message
Sergey Yudin (tsipa740) wrote :

I've done some investigation on this and it seems like entirely internal corosync issue.

With enabled debug it seems like corosync tryes to notify neighbours but for some reasons it fail.

It seems like there is some internal loop which begins like screen 1 and in about 20mins it ends like screen 2(unfortenatelly i wasnt able to capture exact moment when loop ends).

i've noticed that when corosync have lost token in the time network was shat down and node become unpingable, i've tried to modify initscript with something like
# Required-Stop: $remote_fs $network $syslog $named $local_fs openvswitch-switch ssh
but even with this options networking was shatdown before corosync died.

According to the
start-stop-daemon --stop --quiet --retry forever/QUIT/1 --pidfile $PIDFILE
corosync must be shat down syncronously, so networking must be disable not earlyer than corosync will die, but it not works that way for some reasons.

Actually i'm not quite sure if initscript ordering have any relations to the issue, cause i've seen few times when token was lost while node still be pingable for almost 3-5 seconds after that.