healthchecks are mostly unreliable

Bug #1860556 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Cédric Jeanneret

Bug Description

Hello there,

After some checks and digging with the healthchecks, it appears most of them are unreliable due to the lack of strict error checking, such as "set -o pipefail" and other options.
We probably want to add the following options:
set -eo pipefail
in tripleo-common/healthcheck/common.sh

------

While digging into that issue, I also found out that, apparently, "grep -q -E ..." doesn't return the correct exit code when it does match a piped content - at least in some cases (see bellow).

For instance, in nova_conductor container::

(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -Eq ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $?
141

But if we do it without -q:
(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $?
tcp ESTAB 0 0 192.168.24.1:54136 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=9))
tcp ESTAB 0 0 192.168.24.1:54138 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=9))
tcp ESTAB 0 0 192.168.24.1:54140 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=9))
tcp ESTAB 0 0 192.168.24.1:54142 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=9))
tcp ESTAB 0 0 192.168.24.1:54144 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=9))
tcp ESTAB 0 0 192.168.24.1:54146 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=9))
tcp ESTAB 0 0 192.168.24.1:54148 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=9))
tcp ESTAB 0 0 192.168.24.1:54150 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=9))
tcp ESTAB 0 0 192.168.24.1:57270 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=10))
tcp ESTAB 0 0 192.168.24.1:57310 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=10))
tcp ESTAB 0 0 192.168.24.1:57320 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=10))
tcp ESTAB 0 0 192.168.24.1:57324 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=10))
tcp ESTAB 0 0 192.168.24.1:57326 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=10))
tcp ESTAB 0 0 192.168.24.1:57364 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=10))
tcp ESTAB 8 0 192.168.24.1:57328 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=10))
tcp ESTAB 8 0 192.168.24.1:57360 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=10))
0

This unreliable behaviour was detected in a rhel-8 OSP-16 container, while on the rhel-8 host, it was working as expected. There's probably something fishy with the container env at some point, but to be honest, I didn't dig further.

A solution for that last issue is to drop the -q and redirect STDOUT to /dev/null:

(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $?
0

since it will return 0 if nothing is matched, as you can see here:
(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(15672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $?
1

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I have a two jokes for this:

- here containers proven to become the "sufficiently advanced technology" by Arthur C. Clarke

- let's rewrite on awk. grep -q as of now considered harmful

Revision history for this message
Michele Baldessari (michele) wrote :

It is not a bug in grep. It's documented although very surprising behaviour:
       -q, --quiet, --silent
              Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected. Also see the -s or --no-messages
              option.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Nope - there's one more thing I forget to add:
I tried the very same commands using a temporary script:

(ss -ntuap; sudo -u nova ss -ntuap) | sort -u > tmp; echo $?;
/usr/bin/grep -E ":(15672).*,pid=($(pgrep -d '|' nova-conductor))" tmp; echo $?

there, we get 0....

Apparently, the piped content with that -q makes grep all weird - if we remove that -q, we get the effective 0 exit code (and 1 if it doesn't match anything).

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Michele, right there is another issue that "grep -q will return 0 if an error is detected".
But the most astonishing one is that sometimes the piped grep -q returns 141 while 0 is expected

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Michele oh you're right, both stems from the same cause https://stackoverflow.com/questions/19120263/why-exit-code-141-with-grep-q

tags: added: train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/703819

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/703819
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ba02b0a23582b286b775fd129e3a09970862755b
Submitter: Zuul
Branch: master

commit ba02b0a23582b286b775fd129e3a09970862755b
Author: Cédric Jeanneret <email address hidden>
Date: Wed Jan 22 16:23:24 2020 +0100

    Make healthchecks more strict

    It was discovered that healthchecks aren't really reliable because they
    aren't strict enough.

    The current patch adds the "standard" options in order to ensure we
    actually catch errors soon enough in order to return the actual state of
    the checked element.

    It also requires a small change for the healthcheck_port() function,
    since the "piping" returned a 141 code instead of 0 due SIGPIPE being
    sent at some point[1].

    It also depends on two other changes, in order to ensure we won't get
    any "sudo" issues inside the checks (here again, healthcheck_port is
    tricky).

    [1] https://stackoverflow.com/questions/19120263/why-exit-code-141-with-grep-q
        http://www.tldp.org/LDP/lpg/node20.html

    Depends-On: https://review.opendev.org/703818
    Depends-On: https://review.opendev.org/703816
    Change-Id: Iada9fb49881c8edc9c6ede46a939d1853204f896
    Closes-Bug: #1860556
    Related: https://bugzilla.redhat.com/show_bug.cgi?id=1794044

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/704651

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/704651
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=dddf584505bb24c53ba9d5fd92396695dc7a540c
Submitter: Zuul
Branch: stable/train

commit dddf584505bb24c53ba9d5fd92396695dc7a540c
Author: Cédric Jeanneret <email address hidden>
Date: Wed Jan 22 16:23:24 2020 +0100

    Make healthchecks more strict

    It was discovered that healthchecks aren't really reliable because they
    aren't strict enough.

    The current patch adds the "standard" options in order to ensure we
    actually catch errors soon enough in order to return the actual state of
    the checked element.

    It also requires a small change for the healthcheck_port() function,
    since the "piping" returned a 141 code instead of 0 due SIGPIPE being
    sent at some point[1].

    It also depends on two other changes, in order to ensure we won't get
    any "sudo" issues inside the checks (here again, healthcheck_port is
    tricky).

    [1] https://stackoverflow.com/questions/19120263/why-exit-code-141-with-grep-q
        http://www.tldp.org/LDP/lpg/node20.html

    Depends-On: https://review.opendev.org/703818
    Depends-On: https://review.opendev.org/703816
    Change-Id: Iada9fb49881c8edc9c6ede46a939d1853204f896
    Closes-Bug: #1860556
    Related: https://bugzilla.redhat.com/show_bug.cgi?id=1794044
    (cherry picked from commit ba02b0a23582b286b775fd129e3a09970862755b)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/707799

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/707799
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=475368e9b7b6175f4f11d99bf7ce8f5bb89744e7
Submitter: Zuul
Branch: master

commit 475368e9b7b6175f4f11d99bf7ce8f5bb89744e7
Author: Cédric Jeanneret <email address hidden>
Date: Wed Jan 22 16:23:24 2020 +0100

    Make healthchecks more strict

    It was discovered that healthchecks aren't really reliable because they
    aren't strict enough.

    The current patch adds the "standard" options in order to ensure we
    actually catch errors soon enough in order to return the actual state of
    the checked element.

    It also requires a small change for the healthcheck_port() function,
    since the "piping" returned a 141 code instead of 0 due SIGPIPE being
    sent at some point[1].

    [1] https://stackoverflow.com/questions/19120263/why-exit-code-141-with-grep-q
        http://www.tldp.org/LDP/lpg/node20.html

    Change-Id: If13b6ca177d47a0af29ba5e5099e040eea62876c
    Closes-Bug: #1860556
    Related: https://bugzilla.redhat.com/show_bug.cgi?id=1794044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/708304

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 12.1.0

This issue was fixed in the openstack/tripleo-common 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/708304
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5a1f5dbdd18712183d44955e9125ff9a805ca97f
Submitter: Zuul
Branch: stable/train

commit 5a1f5dbdd18712183d44955e9125ff9a805ca97f
Author: Cédric Jeanneret <email address hidden>
Date: Wed Jan 22 16:23:24 2020 +0100

    Make healthchecks more strict

    It was discovered that healthchecks aren't really reliable because they
    aren't strict enough.

    The current patch adds the "standard" options in order to ensure we
    actually catch errors soon enough in order to return the actual state of
    the checked element.

    It also requires a small change for the healthcheck_port() function,
    since the "piping" returned a 141 code instead of 0 due SIGPIPE being
    sent at some point[1].

    [1] https://stackoverflow.com/questions/19120263/why-exit-code-141-with-grep-q
        http://www.tldp.org/LDP/lpg/node20.html

    Change-Id: If13b6ca177d47a0af29ba5e5099e040eea62876c
    Closes-Bug: #1860556
    Related: https://bugzilla.redhat.com/show_bug.cgi?id=1794044
    (cherry picked from commit 475368e9b7b6175f4f11d99bf7ce8f5bb89744e7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 12.2.0

This issue was fixed in the openstack/tripleo-common 12.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.