systemd timers for podman healthchecks are too high, break AMQP healthchecks

Bug #1826281 reported by Damien Ciabrini on 2019-04-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Alex Schultz

Bug Description

With the docker container engine, we used to run container healthcheck every 30s. This ensured that the mod_wsgi process hosting an OpenStack services would wake up at least every 30s, which guaranteed that any AMQP heartbeat packet sent by rabbitmq would be honoured in time and would keep an idle AMQP connection afloat.

Since we switched to podman, paunch now generates an additional systemd timer unit which govern the frequency at which the healthcheck is run in a podman container. The current timer is configured to run every 60s plus an random 45s on top, which is way above the previous 30s interval and has a side effect on AMQP traffic.

If a containerized service doesn't receive much traffic, it may not process AMQP heartbeat sent by rabbitmq in less than 60s, which would make rabbitmq terminate the connection:

2019-04-15 16:27:16.677 [error] <0.3388.0> closing AMQP connection <0.3388.0> (10.109.1.2:57610 -> 10.109.1.2:5672 - mod_wsgi:20:544ec298-0b37-4123-87e0-2362416c7183):
missed heartbeats from client, timeout: 60s

... and force the OpenStack service to reconnect eventually:

nova/nova-api.log:2019-04-15 16:27:24.740 20 ERROR oslo.messaging._drivers.impl_rabbit [req-82a5fdc7-5fe8-4314-a31c-394eabdc418d b36fa1be80ae4275bb4511bb74d78eff 89d60bdd36ab49c5b0720eb43fa7d1dd - default default] [f78bdc50-0734-4823-b91b-7c9ac4227fd0] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection

This needlessly fills logs, consume resources and may cause unexpected side effects on some services (we saw some failures in mistral on the undercloud).

Changed in tripleo:
importance: Undecided → Critical
tags: added: containers
Bogdan Dobrelya (bogdando) wrote :

Ideally, we should instead fix those "unexpected side effects on some services" and allow AMQP connections fail and re-establish as they need. That becomes especially important for large scale and/or edge deployments. Whereby the expectations for the "edge site turns offline" failure mode is that AMQP connections to *all* involved services recover w/o human care.

Damien Ciabrini (dciabrin) wrote :

Agreed, I left an intentionally vague comment because I didn't experience the failure myself lately and I don't know how to trigger it consistently. As soon as it reappears, let's track it on a dedicated bug.

Changed in tripleo:
importance: Critical → High

Fix proposed to branch: master
Review: https://review.opendev.org/656901

Changed in tripleo:
assignee: Damien Ciabrini (dciabrin) → Michele Baldessari (michele)
Changed in tripleo:
assignee: Michele Baldessari (michele) → Herve Beraud (herveberaud)
Changed in tripleo:
assignee: Herve Beraud (herveberaud) → Alex Schultz (alex-schultz)

Reviewed: https://review.opendev.org/656901
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2e54cff0db213494d07c72ce0f36d60831d2e84d
Submitter: Zuul
Branch: master

commit 2e54cff0db213494d07c72ce0f36d60831d2e84d
Author: Michele Baldessari <email address hidden>
Date: Thu May 2 23:20:48 2019 +0200

    Use oslo_rootwrap subprocess module in order to gain proper eventlet awareness

    Via https://github.com/openstack/oslo.rootwrap/commit/31cfdbd4076bb6556cf9612171ba43fa44475d71
    oslo.rootwrap gained support eventlet when using subprocess. By moving
    to oslo_rootwrap.subprocess we make sure that with python3 the
    subprocess calls use eventlet.green.subprocess if eventlet is used.
    This worked on python2 because (from above commit):
    """
    On Python 2, it "works" to use directly subprocess: subprocess.Popen
    calls os.pipe() and os.fdopen(fd) which are both monkey-patched. On
    Python 3, it doesn't work because subprocess uses os.pipe() and
    io.open(fd), and the io module is *not* monkey-patched at all.
    """

    By applying this change what happens is that the heartbeat thread is
    able to be scheduled every 15seconds by default. Without this patch
    what we have been observing with python3 is that while running ansible
    mistral would constantly log error messages like the following:
    2019-05-02 19:14:36.702 8 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: amqp.exceptions.ConnectionForced: Too many heartbeats missed

    With this change we could not reproduce this issue during a deployment
    and no missed heartbeat messages were observed during the deploy.

    Co-Authored-By: Damien Ciabrini <email address hidden>
    Co-Authored-By: Hervé Beraud <email address hidden>

    Closes-Bug: #1826281

    Change-Id: Id22b1465d6d2424d90781983b970aba4545feb8a

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/657090
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=6f55c8917101b12f4600796a681505e170969a14
Submitter: Zuul
Branch: stable/stein

commit 6f55c8917101b12f4600796a681505e170969a14
Author: Michele Baldessari <email address hidden>
Date: Thu May 2 23:20:48 2019 +0200

    Use oslo_rootwrap subprocess module in order to gain proper eventlet awareness

    Via https://github.com/openstack/oslo.rootwrap/commit/31cfdbd4076bb6556cf9612171ba43fa44475d71
    oslo.rootwrap gained support eventlet when using subprocess. By moving
    to oslo_rootwrap.subprocess we make sure that with python3 the
    subprocess calls use eventlet.green.subprocess if eventlet is used.
    This worked on python2 because (from above commit):
    """
    On Python 2, it "works" to use directly subprocess: subprocess.Popen
    calls os.pipe() and os.fdopen(fd) which are both monkey-patched. On
    Python 3, it doesn't work because subprocess uses os.pipe() and
    io.open(fd), and the io module is *not* monkey-patched at all.
    """

    By applying this change what happens is that the heartbeat thread is
    able to be scheduled every 15seconds by default. Without this patch
    what we have been observing with python3 is that while running ansible
    mistral would constantly log error messages like the following:
    2019-05-02 19:14:36.702 8 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: amqp.exceptions.ConnectionForced: Too many heartbeats missed

    With this change we could not reproduce this issue during a deployment
    and no missed heartbeat messages were observed during the deploy.

    Co-Authored-By: Damien Ciabrini <email address hidden>
    Co-Authored-By: Hervé Beraud <email address hidden>

    Closes-Bug: #1826281

    Change-Id: Id22b1465d6d2424d90781983b970aba4545feb8a
    (cherry picked from commit 2e54cff0db213494d07c72ce0f36d60831d2e84d)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/657120
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f5ef977e1c3a24af70f511ec9d4d7a496e57b2d5
Submitter: Zuul
Branch: stable/stein

commit f5ef977e1c3a24af70f511ec9d4d7a496e57b2d5
Author: Emilien Macchi <email address hidden>
Date: Sat May 4 14:52:22 2019 +0000

    Revert "mistral: configure heartbeat parameters to avoid action timeout"

    This reverts commit 374fafd66afa792ba197403b479dadbfa3055bce.

    The root cause of the timeout has been addressed by:
    Id22b1465d6d2424d90781983b970aba4545feb8a

    We don't need that horrible hack.
    Related-Bug: #1826281

    Change-Id: I5f1c89e7fad7624c2edbf557ec39f5777b089d55
    (cherry picked from commit 738486f10850425a56809f23b830951832712e0b)

Reviewed: https://review.opendev.org/657119
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=738486f10850425a56809f23b830951832712e0b
Submitter: Zuul
Branch: master

commit 738486f10850425a56809f23b830951832712e0b
Author: Emilien Macchi <email address hidden>
Date: Sat May 4 14:52:22 2019 +0000

    Revert "mistral: configure heartbeat parameters to avoid action timeout"

    This reverts commit 374fafd66afa792ba197403b479dadbfa3055bce.

    The root cause of the timeout has been addressed by:
    Id22b1465d6d2424d90781983b970aba4545feb8a

    We don't need that horrible hack.
    Related-Bug: #1826281

    Change-Id: I5f1c89e7fad7624c2edbf557ec39f5777b089d55

Bogdan Dobrelya (bogdando) wrote :

I wonder if we want the similar thing in python-tripleoclient where we have a containerized undercloud installer invoking ansible as well?

This issue was fixed in the openstack/tripleo-common 11.0.0 release.

This issue was fixed in the openstack/tripleo-common 10.8.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers