Better health checks for radosgw

Bug #1946280 reported by James Troup
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Released
Medium
Cornellius Metto

Bug Description

We recently had a situation where one of the 3 radosgw daemons had gotten wedged and apache was returning a stream of 503s. This manifested on the swift client side as long time outs.

It looks like Ceph has supported /swift/healthcheck since 2016, so...

 a) as long as haproxy is in use, we might as well make use of it's check capabilities and ask it to check /swift/healthcheck

 b) we should have an nrpe check of the same

(This was PS5, so Focal/Ussuri, if it matters)

Revision history for this message
Junien F (axino) wrote :

For what it's worth, "ceph status" was showing only 2 "rgw" daemons out of 3. The wedged one wasn't in the list.

Changed in charm-ceph-radosgw:
status: New → Triaged
importance: Undecided → Medium
Changed in charm-ceph-radosgw:
assignee: nobody → Cornellius Metto (ckmetto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (master)
Changed in charm-ceph-radosgw:
status: Triaged → In Progress
Revision history for this message
Cornellius Metto (ckmetto) wrote :

For [b] (nrpe check), I found that the check is already provided by charmhelpers (https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/files/check_haproxy.sh). I verified that the check fails when a server is marked 'DOWN'.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/817582
Committed: https://opendev.org/openstack/charm-ceph-radosgw/commit/31a4584169c1c78e59dc505de4163d528d1cae0a
Submitter: "Zuul (22348)"
Branch: master

commit 31a4584169c1c78e59dc505de4163d528d1cae0a
Author: Cornellius Metto <email address hidden>
Date: Thu Nov 11 16:03:13 2021 +0000

    Enable HAProxy HTTP Health Checks

    Ceph radosgw supports [0] the swift health check endpoint
    "/swift/healthcheck". This change adds the haproxy
    configuration [1] necessary to take the response of "GET
    /swift/healthcheck" into account when determining the health
    of a radosgw service.

    For testing, I verified that:
    - HAProxy starts and responds to requests normally with this
      configuration.
    - Servers with status != 2xx or 3xx are removed from the
      backend.
    - Servers that take too long to respond are also removed
      from the backend. The default timeout value is 2s.

    [0] https://tracker.ceph.com/issues/11682
    [1] https://www.haproxy.com/documentation/hapee/2-0r1/onepage/#4.2-option%20httpchk

    Closes-Bug: 1946280
    Change-Id: I82634255ca3423fec3fc15c1e714dcb31db5da7a

Changed in charm-ceph-radosgw:
status: In Progress → Fix Committed
Felipe Reyes (freyes)
Changed in charm-ceph-radosgw:
milestone: none → 22.04
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (stable/pacific)

Fix proposed to branch: stable/pacific
Review: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/837065

Alvaro Uria (aluria)
tags: added: bseng-71
Changed in charm-ceph-radosgw:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-radosgw (stable/pacific)

Change abandoned by "James Page <email address hidden>" on branch: stable/pacific
Review: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/837065
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Jorge Merlino (jorge-merlino) wrote (last edit ):

The purpose of this comment is to clarify the status of this bug and the steps needed for a backport. To solve this patch two different merge requests were created:

1) https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/817582

This patch adds support for HAProxy to check the swift health check endpoint "/swift/healthcheck" to determine the health of a radosgw service. It also adds a https flag to the context when https is in use. This flag is relevant in the context of the next patch.

2) https://github.com/juju/charm-helpers/pull/662

This patch to charm-helpers adds some TLS related parameters to the HAProxy configuration when the https context value is enabled. This patch has to be included in the charm by syncing the charm-helpers code.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (stable/pacific)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/837065
Committed: https://opendev.org/openstack/charm-ceph-radosgw/commit/3c7f468b0ca01d509566c670ab0a54a82b25f375
Submitter: "Zuul (22348)"
Branch: stable/pacific

commit 3c7f468b0ca01d509566c670ab0a54a82b25f375
Author: Cornellius Metto <email address hidden>
Date: Thu Nov 11 16:03:13 2021 +0000

    Enable HAProxy HTTP Health Checks

    Ceph radosgw supports [0] the swift health check endpoint
    "/swift/healthcheck". This change adds the haproxy
    configuration [1] necessary to take the response of "GET
    /swift/healthcheck" into account when determining the health
    of a radosgw service.

    For testing, I verified that:
    - HAProxy starts and responds to requests normally with this
      configuration.
    - Servers with status != 2xx or 3xx are removed from the
      backend.
    - Servers that take too long to respond are also removed
      from the backend. The default timeout value is 2s.

    [0] https://tracker.ceph.com/issues/11682
    [1] https://www.haproxy.com/documentation/hapee/2-0r1/onepage/#4.2-option%20httpchk

    Closes-Bug: 1946280
    Change-Id: I82634255ca3423fec3fc15c1e714dcb31db5da7a
    (cherry picked from commit 31a4584169c1c78e59dc505de4163d528d1cae0a)

tags: added: in-stable-pacific
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (stable/octopus)

Fix proposed to branch: stable/octopus
Review: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/866007

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (stable/octopus)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/866007
Committed: https://opendev.org/openstack/charm-ceph-radosgw/commit/6c725f1335833f32b75af2f39559d7d841030014
Submitter: "Zuul (22348)"
Branch: stable/octopus

commit 6c725f1335833f32b75af2f39559d7d841030014
Author: Cornellius Metto <email address hidden>
Date: Thu Nov 11 16:03:13 2021 +0000

    Enable HAProxy HTTP Health Checks

    Ceph radosgw supports [0] the swift health check endpoint
    "/swift/healthcheck". This change adds the haproxy
    configuration [1] necessary to take the response of "GET
    /swift/healthcheck" into account when determining the health
    of a radosgw service.

    For testing, I verified that:
    - HAProxy starts and responds to requests normally with this
      configuration.
    - Servers with status != 2xx or 3xx are removed from the
      backend.
    - Servers that take too long to respond are also removed
      from the backend. The default timeout value is 2s.

    [0] https://tracker.ceph.com/issues/11682
    [1] https://www.haproxy.com/documentation/hapee/2-0r1/onepage/#4.2-option%20httpchk

    Closes-Bug: 1946280
    Change-Id: I82634255ca3423fec3fc15c1e714dcb31db5da7a
    (cherry picked from commit 31a4584169c1c78e59dc505de4163d528d1cae0a)

tags: added: in-stable-octopus
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.