Comment 3 for bug 1614914

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> restarting ceph services on all nodes simultaneously is not a good idea

And the test does exactly that:

ceph-mon.node-1.log (@node-1):

2016-08-19 09:47:20.571625 7f57e392d700 -1 mon.node-1@1(peon) e3 *** Got Signal Terminated ***

ceph-mon.node-4.log (@node-4):

2016-08-19 09:47:08.387261 7f6c36390700 -1 mon.node-4@0(leader) e3 *** Got Signal Terminated ***

ceph-mon.node-5.log (@node-5):

2016-08-19 09:47:12.558032 7f68292a0700 -1 mon.node-5@2(peon) e3 *** Got Signal Terminated ***

ceph-osd.3.log (@node-2):

2016-08-19 09:47:09.011077 7f44e91e5700 -1 osd.3 51 *** Got signal Terminated ***

ceph-osd.0.log (@node-2):

2016-08-19 09:47:09.010648 7f348e780700 -1 osd.0 52 *** Got signal Terminated ***

ceph-osd.1.log (@node-3):

2016-08-19 09:47:13.110102 7f068618b700 -1 osd.1 54 *** Got signal Terminated ***

ceph-osd.2.log (@node-3):

2016-08-19 09:47:13.110372 7fdcbe48d700 -1 osd.2 54 *** Got signal Terminated ***

If one wants the (ceph) cluster to be always available restarting ceph services should be properly
orchestrated:

1) restart monitors one by one, giving each monitor enough time to join the quorum
2) restart OSDs one by one, giving each OSD enough time to perform peering (and possibly recovery)
   Also it's wise to set noout noscrub nodeep-scrub OSD flags before restarting OSDs (and clear
   them after all OSDs have been restarted) to prevent unnecessary data migration between OSDs

Otherwise the cluster will be unavailable during some time (eventually it will recover).

should be properly orche