Insufficient monitoring of p_haproxy and vip__management corosync resources

Bug #1704657 reported by Nadezhda Kabanova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Critical
Aleksey Zvyagintsev

Bug Description

Detailed bug description:
Currently we have the following function to check status of haproxy, that basically checks that in the ps output we have process
haproxy_status() {
        get_variables

        # check and make PID file dir
        local PID_DIR="$( dirname ${PIDFILE} )"
        if [ ! -d "${PID_DIR}" ] ; then
                ocf_log debug "Create pid file dir: ${PID_DIR}"
                mkdir -p "${PID_DIR}"
                # no need to chown, root is user for haproxy
                chmod 755 "${PID_DIR}"
        fi

        if [ -n "${PIDFILE}" -a -f "${PIDFILE}" ]; then
                # haproxy is probably running
                # get pid from pidfile
                PID="`cat ${PIDFILE}`"
                if [ -n "${PID}" ]; then
                        # check if process exists
                        if $RUN ps -p "${PID}" | grep -q haproxy; then
                                ocf_log info "haproxy daemon running"
                                return $OCF_SUCCESS
                        else
                                ocf_log warn "haproxy daemon is not running but pid file exists"
                                return $OCF_NOT_RUNNING
                        fi
                else
                        ocf_log err "PID file empty!"
                        return $OCF_ERR_GENERIC
                fi
        fi
        # haproxy is not running
        ocf_log info "haproxy daemon is not running"
        return $OCF_NOT_RUNNING
}

Only one check, that process is running is not enough to ensure that haproxy works as expected.
Below you can find situations, when pcs will not know that resource is not functional.

Steps to reproduce:
(1) Simulation of an overloaded system
Let assume having too few or no CPU time for haproxy. This can be simulated by suspending the haproxy; In case of “kill -STOP $(pidof haproxy)“ on the haproxy owning the VIP, the haproxy instance would be suspended and therefore no OpenStack commands would be possible even though the haproxy is running thus CRM cannot recognize that VIP movement would be needed

(2) Simulation of a forwarding error on the network
This can be simulated by applying an “iptables -P INPUT DROP” in the haproxy namespace, thus the haproxy instance would be unresponsive but CRM cannot recognize that VIP movement would be needed since the haproxy itself is running and the VIP address responds to the ARP probes.

Expected results:
resource should be moved to another node

Actual result:
pcs reports that resource is running, no problems detected

tags: added: customer-found
Changed in mos:
status: New → Confirmed
Changed in fuel:
status: New → Confirmed
importance: Undecided → Critical
Changed in mos:
importance: Undecided → Critical
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Pacemaker's goal it to monitor the service itself, not its ability to work properly.
We can imagine another case where haproxy works, but can't reach backends located on other controllers.
We can't demand deep insight from pacemaker scripts in such cases.
Especially case (2) is completely artificial.

I'm not sure this even fixable

Revision history for this message
Zoltan Szeder (zoltan-szeder) wrote :

Hi Eugene,

If I understand it correctly, there is a socket configured, and it provides statistics from the haproxy monitored ports.
I think, using that socket, it can be verified that the haproxy is working properly.

Case (2) is an artificial break in the system to present a fault in monitoring. Although the root cause was not discovered, we have already seen issues that produced the same behavior.

Could you reproduce the same in your environment?

BR: Zoltan Szeder

no longer affects: mos
Changed in fuel:
status: Confirmed → Opinion
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Let me share my thoughts on the matter.

Although at first glance your proposal seems reasonable, it may be overcomplicated to implement properly. Regarding the case #1 - in most cases we assume that most of the load happens on the node which has a VIP at the moment. I.e. if we move the VIP from overloaded node onto a standby one this standby node will become overloaded very soon too. This means we have to move the VIP again? This cycle never ends. So the solution for this case is - the workloads distribution: you have to reduce the load removing some heavy services from a controller node to a dedicated server, because if the system is overloaded moving this load as a whole from one node to another won't help at all.

#2 is even more unrealistic: forwarding errors may occur due to dozens of reasons and most of them are recurring or persistent. For example, if there was a problem with the ARP cache on the controller holding the VIP what is the probability this problem occurs on the second controller when we move the VIP there? Almost 100%, because the ARP issue is only a consequence of some other conditions, and moving the VIP we do not solve the root cause of the issue. Also I may think of some misconfiguration, for example if some automation made erroneously "iptables -P INPUT DROP". I'm not sure it is a correct case because any misconfiguration should never be worked around in the software.

So I'm not convinced that there is a real issue in haproxy's OCF, so far the root cause seems to be in your particular configuration and the overloaded control-plane. Moving to Opinion. Please feel free to share your thoughts.

Revision history for this message
Zoltan Szeder (zoltan-szeder) wrote :

Hi Denis,

The purpose of corosync/pacemaker is to reallocate or restart services if they behave abnormally.
In the scenario, when haproxy is frozen, it is the fault management system's duty to restart haproxy, or if restart fails, migrate the vip_management to a different node.

Pacemaker should try, and move the resource around automatically, with an alarm, that the resource is fluctuating, rather than leave it to expensive human interaction to deal with the outage.
*Immediate* human interaction should only be needed in the unlikely case when the fault reoccurs on all three controllers.

I accept, that the root cause should be handled. There is a ticket for that already in the Mirantis' internal ticketing system, and we are waiting for the resolution of that issue since February.

This ticket is not about the root cause, but the mitigation of previously mentioned, and similar issues in the system.

The iptables command mentioned in the ticket is obviously not to represent real configuration, but to simulate behavior easily.
Alternatively, you can monitor the vip_management address using ICMP ECHO requests too (beside ARP
 ping), to discover layer 3 network outage on that interface.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/485253

Changed in fuel:
status: Opinion → In Progress
assignee: nobody → Aleksey Zvyagintsev (azvyagintsev)
milestone: 9.x-updates → 9.2-mu-3
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Moving to Invalid since the root cause of the issue is found and it is not connected with insufficient monitoring bug with the overloaded hardware.

Changed in fuel:
status: In Progress → Invalid
milestone: 9.2-mu-3 → 9.x-updates
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/mitaka)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/mitaka
Review: https://review.opendev.org/485253
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.