HAProxy healthchecks sometimes fail on a FIPS-enabled control plane

Bug #2020490 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Confirmed
Medium
Damien Ciabrini

Bug Description

On a FIPS-enabled HA control plane deployed on VMs, we are seeing a high number of healthchecks failure in HAProxy logs for all services, but most of them impact the mysql service.

[WARNING] (730596) : Health check for backup server mysql/controller-1.internalapi.redhat.local failed, reason: Socket error, check duration: 1ms, status: 2/3 UP.
[WARNING] (730596) : Health check for backup server mysql/controller-1.internalapi.redhat.local succeeded, reason: Layer7 check passed, code: 200, check duration: 89ms, status: 3/3 UP.

This seems to be a systemic issue on the environment, which is consuming a lot of sys time (from 10 to 30 sys time in top). Under this situation, the galera service itself is working, but the healthcheck are incorrectly parsed by HAProxy, due to what seems to be a race condition in socket closure between HAProxy and the healthcheck script `clustercheck`

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/884158

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/884176

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/884178

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/zed)

Change abandoned by "Damien Ciabrini <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/884158

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/zed)

Change abandoned by "Damien Ciabrini <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/884163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/884176
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/f04fdef6662f8117c73ec65069009f02501132dd
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f04fdef6662f8117c73ec65069009f02501132dd
Author: Damien Ciabrini <email address hidden>
Date: Wed May 24 11:39:25 2023 +0200

    Allow clustercheck to wait before finishing

    Make clustercheck use additional parameter from
    /etc/sysconfig/clustercheck to wait a configured amount of time
    after it returned a HTTP status. This helps on a heavy loaded
    environment to prevent HAProxy from reporting wrong `socket error`
    when it disconnects after clustercheck.

    Closes-Bug: #2020490

    Change-Id: Iab75091c50178e684217840c15ec5d9974d34674

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/884178
Committed: https://opendev.org/openstack/puppet-tripleo/commit/9e0bcc89dc88fe2883cb0ac7187c08d47e103d8f
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 9e0bcc89dc88fe2883cb0ac7187c08d47e103d8f
Author: Damien Ciabrini <email address hidden>
Date: Wed May 24 13:06:52 2023 +0200

    Allow clustercheck to wait before finishing

    Under rare circumstances of heavy loaded environments, HAProxy
    may report 'socket error' when disconnecting from clustercheck if
    the latter closed its side of the connection first.
    Add a configurable post status wait time in clustercheck to fix
    HAProxy error reporting. By default, this post wait is not used.

    Change-Id: I07bf928742f5ff579070f4d1d6248d5b105bce55
    Related-Bug: #2020490

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.