As indicated above, the solution for this LP is addressed by 2 actions:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.
The 1st action is higher priority and the commits are merged into r/stx.3.0.
The 2nd action is a good preventative addition but is not required for r/stx.3.0. It only needs to merge into master for the stx.4.0 release.
As indicated above, the solution for this LP is addressed by 2 actions:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.
The 1st action is higher priority and the commits are merged into r/stx.3.0.
The 2nd action is a good preventative addition but is not required for r/stx.3.0. It only needs to merge into master for the stx.4.0 release.