Part of the upgrade process to jewel involves a recursive 'chown' on files under /var/lib/ceph. If this step takes longer than 10 minutes, the node is assumed dead and moves on upgrading the next node in the cluster.
Subsequently, if enough osd's are taken down for upgrade, it can bring down whole placement groups.
The following log shows this timeout being reached even though the chown is still being performed on the node.
juju log:
2018-04-09 15:56:27 INFO juju-log node-15 is not finished. Waiting
2018-04-09 15:56:27 INFO config-changed obtained 'osd_node-15_jewel_start'
2018-04-09 15:56:27 DEBUG worker.uniter.jujuc server.go:178 hook context id "ceph-osd/3-config-changed-2244033196891562383"; dir "/var/lib/juju/agents/unit-ceph-osd-3/charm"
2018-04-09 15:56:27 INFO juju-log Waited 10 mins on node node-15. current time: 1523288787.3321168 > previous node start time: 1523288768.8523903 Moving on
I think general best practice would be to use the 'pause-health' and 'resume-health' actions on one on the ceph-mon units before commencing the ceph-osd upgrade; however there is another bug here in the fact that the charm moves on after 10 minutes of waiting - which is insufficient.