paunch systemd units enablement might trigger the rate limiting

Bug #1839841 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Bogdan Dobrelya

Bug Description

For consequent overcloud deployments (performing stack UPDATE), paunch enables services "too fast", which exhaust DefaultStartLimitIntervalSec(10s) and StartLimitBurst(defaults to 5) defined in systemd-system.conf.

The failed service unit then logs, for example:

/var/log/paunch.log:2019-08-12 10:19:06.502 395435 DEBUG paunch [ ] Creating systemd unit file: /etc/systemd/system/tripleo_memcached-uy0nkwuo.service
/var/log/paunch.log:2019-08-12 10:19:06.869 395435 DEBUG paunch.utils.systemctl [ ] Executing: systemctl enable --now tripleo_memcached-uy0nkwuo
/var/log/paunch.log:subprocess.CalledProcessError: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:paunch.utils.systemctl.SystemctlException: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:2019-08-12 10:19:07.555 395435 ERROR paunch [ ] Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:subprocess.CalledProcessError: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:paunch.utils.systemctl.SystemctlException: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Can't open PID file /var/run/memcached-uy0nkwuo.pid (yet?) after start: No such fil
e or directory
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Failed with result 'protocol'.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Service RestartSec=100ms expired, scheduling restart.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Scheduled restart job, restart counter is at 1.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Can't open PID file /var/run/memcached-uy0nkwuo.pid (yet?) after start: No such file or directory
...
...controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Scheduled restart job, restart counter is at 5.

and then the deployment is failed as the service unit can not be started any more

Changed in tripleo:
importance: Undecided → High
milestone: none → train-3
tags: added: idempotency
tags: added: containers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675895

Changed in tripleo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: New → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It is possible that the issue is not present in the master branch, but only comes after https://review.opendev.org/#/c/675692/1/paunch/builder/base.py

Revision history for this message
Luke Short (ekultails) wrote :

On a related note, today we noticed that the PID files does not match what's being used by the systemd service file.

```
$ grep PIDFile /etc/systemd/system/tripleo_memcached-4btmx9ju.service
PIDFile=/var/run/memcached-4btmx9ju.pid
$ ls -1 /var/run/memcached-4btmx9ju.pid
ls: cannot access '/var/run/memcached-4btmx9ju.pid': No such file or directory
$ ls -1 /var/run/memcached.pid
/var/run/memcached.pid
$ cat /var/run/memcached.pid
666389
```

Changed in tripleo:
importance: High → Medium
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (master)

Reviewed: https://review.opendev.org/675895
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=2eaebe2cd96aa1e121d02f6998e5addd7c52ef86
Submitter: Zuul
Branch: master

commit 2eaebe2cd96aa1e121d02f6998e5addd7c52ef86
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Aug 12 15:24:33 2019 +0200

    Fix systemd service start rate limiting

    The default limit is to allow 5 restarts in a 10sec period. If a
    service goes over that threshold due to the Restart= config option in
    the service definition, it will not attempt to restart any further.

    We should not set StartLimitIntervalSec to 0 to disable any kind of
    rate limiting as that may end up impacting the node load.

    Instead, use tenacity to retry with an exponential backoff, when the
    service unit enablement fails. Before to retry it, reset the unit's
    failure counters with the systemctl wrapper. This is a crash-loop
    approach that provides an efficient feature parity to the classic
    rate limiting, shall we want to implement that for the systemctl
    command wrapper instead.

    Closes-bug: #1839841

    Change-Id: I537fbf9933f2cbe6e1c2f627ba77da645bd55f25
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to paunch (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676749

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to paunch (stable/stein)

Reviewed: https://review.opendev.org/676749
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=5b5e578cc571437965ff55b9c5a660bfe6310b2e
Submitter: Zuul
Branch: stable/stein

commit 5b5e578cc571437965ff55b9c5a660bfe6310b2e
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Aug 12 15:24:33 2019 +0200

    Fix systemd service start rate limiting

    The default limit is to allow 5 restarts in a 10sec period. If a
    service goes over that threshold due to the Restart= config option in
    the service definition, it will not attempt to restart any further.

    We should not set StartLimitIntervalSec to 0 to disable any kind of
    rate limiting as that may end up impacting the node load.

    Instead, use tenacity to retry with an exponential backoff, when the
    service unit enablement fails. Before to retry it, reset the unit's
    failure counters with the systemctl wrapper. This is a crash-loop
    approach that provides an efficient feature parity to the classic
    rate limiting, shall we want to implement that for the systemctl
    command wrapper instead.

    Closes-bug: #1839841

    Change-Id: I537fbf9933f2cbe6e1c2f627ba77da645bd55f25
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2eaebe2cd96aa1e121d02f6998e5addd7c52ef86)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 4.5.1

This issue was fixed in the openstack/paunch 4.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/paunch 5.2.0

This issue was fixed in the openstack/paunch 5.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.