Comment 15 for bug 1851287

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/696938
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Submitter: Zuul
Branch: master

commit 37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 2 18:47:12 2019 +0200

    ceph: Add semantic check on host-lock to avoid data/service loss

    Avoid locking nodes that have OSDs in recovery state.
    If an OSD that fills others with newer data is stopped(host-lock), then
    ceph doesn't feed the old data to consumers, thus K8s pods get stuck.

    Parse `ceph health details` for PGs in `recovery_wait` or `recovering`
    state. Identify OSDs acting on those PGs. Identify which nodes own the
    OSDs. Deny the lock on those OSDs until ceph recovered.

    Mock `ceph health details` and `ceph osd tree` in tests with a simple
    AIO-DX configuration. Controller-0 with OSD.0. Controller-1 with OSD.1.

    Example `ceph health details` output:
    pg 1.0 is active+recovery_wait+degraded, acting [1,0]
    pg 1.1 is active+recovering+degraded, acting [1,0]

    Partial-Bug: 1851287
    Change-Id: Id644d1de5ba2a0bff51638fb9cb8a4d2732e7278
    Signed-off-by: Dan Voiculeasa <email address hidden>