podman: healthcheck services start in failure during deployment (then work fine later)

Bug #1809833 reported by Emilien Macchi on 2018-12-26
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Jill Rouleau

Bug Description

The systemd timers and healthchecks start in failure when paunch create and enable the systemd unit files:

But then later when the services are up and running the healthchecks services are working fine:

[root@undercloud ~]# service tripleo_nova_conductor_healthcheck status
Redirecting to /bin/systemctl status tripleo_nova_conductor_healthcheck.service
● tripleo_nova_conductor_healthcheck.service - nova_conductor healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor_healthcheck.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2018-12-26 22:48:01 UTC; 12s ago
  Process: 226721 ExecStart=/usr/bin/podman exec nova_conductor /openstack/healthcheck (code=exited, status=0/SUCCESS)
 Main PID: 226721 (code=exited, status=0/SUCCESS)

Dec 26 22:48:00 undercloud.localdomain systemd[1]: Starting nova_conductor healthcheck...
Dec 26 22:48:01 undercloud.localdomain podman[226721]: - users:(("nova-conductor",pid=25,fd=8))
Dec 26 22:48:01 undercloud.localdomain podman[226721]: - users:(("nova-conductor",pid=25,fd=4))
Dec 26 22:48:01 undercloud.localdomain podman[226721]: - users:(("nova-conductor",pid=26,fd=8))
Dec 26 22:48:01 undercloud.localdomain podman[226721]: - users:(("nova-conductor",pid=26,fd=4))
Dec 26 22:48:01 undercloud.localdomain systemd[1]: Started nova_conductor healthcheck.

Maybe we should reduce the verbosity in paunch to avoid false alarms from operators or just make the healthchecks working at first run by paunch.

Changed in tripleo:
assignee: nobody → Jill Rouleau (jillrouleau)
Jill Rouleau (jillrouleau) wrote :

We technically had this same situation with docker, it just didn't show in the logs. On container start, something that takes more than 30s to pass healthchecks totally will show unhealthy in docker ps before turning to healthy. The functionality here is the same, it's just more visible in the logs.

We could leave healthcheck units disabled on the host and modify the container images to emit notify signals to enable checks when the container is ready, but that's a pretty involved approach.

I've been experimenting with unit file options to try to work out a combo that reliably delays activating the healthcheck by X seconds/minutes on both install and boot but still properly brings everything up after the delay. I can keep chasing this path but it's worth asking how much we want to invest what amounts to a cosmetic error. It should be expected that the container wouldn't be considered healthy until it's completed its startup processes. I don't think we want to suppress these messages because we would want to know if it never came up ok, and I'm not sure the way it is now is really a problem. What about a debug message in the task that "hey fyi, some containers might error a bit before being healthy"? I'll see what I can do with that idea in common_deploy_steps_tasks.yaml.

Changed in tripleo:
milestone: stein-2 → stein-3

Fix proposed to branch: master
Review: https://review.openstack.org/631374

Changed in tripleo:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/631374
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=fc5aecd5d1a6b8a3ad9b554eb62c5b8dc4475a68
Submitter: Zuul
Branch: master

commit fc5aecd5d1a6b8a3ad9b554eb62c5b8dc4475a68
Author: Jill Rouleau <email address hidden>
Date: Wed Jan 16 19:15:32 2019 -0700

    Prevent podman healthcheck failures during deploy

    Modify systemd healthcheck units so that check is not executed until
    2min after being enabled, to allow containerized service time to become
    available. The combination of OnActive and OnUnitActive replaces OnCalendar.
    After the OnActive countdown causes the service unit to be first run,
    OnUnitActive will provide a monotonic timer for future executions
    Also, do not explicitly enable the service unit as only the timer should
    control this service.

    Closes-Bug: #1809833

    Change-Id: I6a916d62e7b3eed5190c12ea3b37ba240a9ab184

Changed in tripleo:
status: In Progress → Fix Released

This issue was fixed in the openstack/paunch 4.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers