podman: healthcheck services start in failure during deployment (then work fine later)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Medium
|
Jill Rouleau |
Bug Description
The systemd timers and healthchecks start in failure when paunch create and enable the systemd unit files:
http://
But then later when the services are up and running the healthchecks services are working fine:
[root@undercloud ~]# service tripleo_
Redirecting to /bin/systemctl status tripleo_
● tripleo_
Loaded: loaded (/etc/systemd/
Active: inactive (dead) since Wed 2018-12-26 22:48:01 UTC; 12s ago
Process: 226721 ExecStart=
Main PID: 226721 (code=exited, status=0/SUCCESS)
Dec 26 22:48:00 undercloud.
Dec 26 22:48:01 undercloud.
Dec 26 22:48:01 undercloud.
Dec 26 22:48:01 undercloud.
Dec 26 22:48:01 undercloud.
Dec 26 22:48:01 undercloud.
Maybe we should reduce the verbosity in paunch to avoid false alarms from operators or just make the healthchecks working at first run by paunch.
Changed in tripleo: | |
assignee: | nobody → Jill Rouleau (jillrouleau) |
Changed in tripleo: | |
milestone: | stein-2 → stein-3 |
We technically had this same situation with docker, it just didn't show in the logs. On container start, something that takes more than 30s to pass healthchecks totally will show unhealthy in docker ps before turning to healthy. The functionality here is the same, it's just more visible in the logs.
We could leave healthcheck units disabled on the host and modify the container images to emit notify signals to enable checks when the container is ready, but that's a pretty involved approach.
I've been experimenting with unit file options to try to work out a combo that reliably delays activating the healthcheck by X seconds/minutes on both install and boot but still properly brings everything up after the delay. I can keep chasing this path but it's worth asking how much we want to invest what amounts to a cosmetic error. It should be expected that the container wouldn't be considered healthy until it's completed its startup processes. I don't think we want to suppress these messages because we would want to know if it never came up ok, and I'm not sure the way it is now is really a problem. What about a debug message in the task that "hey fyi, some containers might error a bit before being healthy"? I'll see what I can do with that idea in common_ deploy_ steps_tasks. yaml.