systemd timers for podman healthchecks are too high, break AMQP healthchecks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
Alex Schultz |
Bug Description
With the docker container engine, we used to run container healthcheck every 30s. This ensured that the mod_wsgi process hosting an OpenStack services would wake up at least every 30s, which guaranteed that any AMQP heartbeat packet sent by rabbitmq would be honoured in time and would keep an idle AMQP connection afloat.
Since we switched to podman, paunch now generates an additional systemd timer unit which govern the frequency at which the healthcheck is run in a podman container. The current timer is configured to run every 60s plus an random 45s on top, which is way above the previous 30s interval and has a side effect on AMQP traffic.
If a containerized service doesn't receive much traffic, it may not process AMQP heartbeat sent by rabbitmq in less than 60s, which would make rabbitmq terminate the connection:
2019-04-15 16:27:16.677 [error] <0.3388.0> closing AMQP connection <0.3388.0> (10.109.1.2:57610 -> 10.109.1.2:5672 - mod_wsgi:
missed heartbeats from client, timeout: 60s
... and force the OpenStack service to reconnect eventually:
nova/nova-
This needlessly fills logs, consume resources and may cause unexpected side effects on some services (we saw some failures in mistral on the undercloud).
Changed in tripleo: | |
importance: | Undecided → Critical |
tags: | added: containers |
Changed in tripleo: | |
importance: | Critical → High |
Changed in tripleo: | |
assignee: | Michele Baldessari (michele) → Herve Beraud (herveberaud) |
Changed in tripleo: | |
assignee: | Herve Beraud (herveberaud) → Alex Schultz (alex-schultz) |
Ideally, we should instead fix those "unexpected side effects on some services" and allow AMQP connections fail and re-establish as they need. That becomes especially important for large scale and/or edge deployments. Whereby the expectations for the "edge site turns offline" failure mode is that AMQP connections to *all* involved services recover w/o human care.