Comment 6 for bug 1424060

Revision history for this message
Mykola Golub (mgolub) wrote :

Just checking if an osd daemon is still running is not safe. You could have the daemon down but still the node to have valid data, availble only on this node, so removing it would lead to data loss.

I think we should follow the documentation:

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

I.e. if the node is going to be removed from the cluster, the first command to run is:

 ceph osd out {osd-num}

which stars migrating placement groups out of the OSD. And don't proceed with removal until it completes. If the osd daemon is down, but data is available from other replica, it will be used for rebalancing. If there is no other replica available the migration gets stuck and in this case manual intervation is necessary.

There could be different ways to check if the migration complete, the simplest looks like

  ceph pg stat

and check that all pgs are active+clean, but it might give a false positive when the problem is due to some other osd. Still it might be safer to just abort here?

If a user wants a faster way it should do this on her own risk.