commit 37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 2 18:47:12 2019 +0200
ceph: Add semantic check on host-lock to avoid data/service loss
Avoid locking nodes that have OSDs in recovery state.
If an OSD that fills others with newer data is stopped(host-lock), then
ceph doesn't feed the old data to consumers, thus K8s pods get stuck.
Parse `ceph health details` for PGs in `recovery_wait` or `recovering`
state. Identify OSDs acting on those PGs. Identify which nodes own the
OSDs. Deny the lock on those OSDs until ceph recovered.
Mock `ceph health details` and `ceph osd tree` in tests with a simple
AIO-DX configuration. Controller-0 with OSD.0. Controller-1 with OSD.1.
Example `ceph health details` output:
pg 1.0 is active+recovery_wait+degraded, acting [1,0]
pg 1.1 is active+recovering+degraded, acting [1,0]
Partial-Bug: 1851287
Change-Id: Id644d1de5ba2a0bff51638fb9cb8a4d2732e7278
Signed-off-by: Dan Voiculeasa <email address hidden>
Reviewed: https:/ /review. opendev. org/696938 /git.openstack. org/cgit/ starlingx/ config/ commit/ ?id=37db1ee792d 7ea1eee77b3b134 f078a5fca5fdbe
Committed: https:/
Submitter: Zuul
Branch: master
commit 37db1ee792d7ea1 eee77b3b134f078 a5fca5fdbe
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 2 18:47:12 2019 +0200
ceph: Add semantic check on host-lock to avoid data/service loss
Avoid locking nodes that have OSDs in recovery state.
If an OSD that fills others with newer data is stopped(host-lock), then
ceph doesn't feed the old data to consumers, thus K8s pods get stuck.
Parse `ceph health details` for PGs in `recovery_wait` or `recovering`
state. Identify OSDs acting on those PGs. Identify which nodes own the
OSDs. Deny the lock on those OSDs until ceph recovered.
Mock `ceph health details` and `ceph osd tree` in tests with a simple
AIO-DX configuration. Controller-0 with OSD.0. Controller-1 with OSD.1.
Example `ceph health details` output: recovery_ wait+degraded, acting [1,0] recovering+ degraded, acting [1,0]
pg 1.0 is active+
pg 1.1 is active+
Partial-Bug: 1851287 bff51638fb9cb8a 4d2732e7278
Change-Id: Id644d1de5ba2a0
Signed-off-by: Dan Voiculeasa <email address hidden>