tripleo

paunch systemd units enablement might trigger the rate limiting

Bug #1839841 reported by Bogdan Dobrelya on 2019-08-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Medium	Bogdan Dobrelya	tripleo train-3

Bug Description

For consequent overcloud deployments (performing stack UPDATE), paunch enables services "too fast", which exhaust DefaultStartLimitIntervalSec(10s) and StartLimitBurst(defaults to 5) defined in systemd-system.conf.

The failed service unit then logs, for example:

/var/log/paunch.log:2019-08-12 10:19:06.502 395435 DEBUG paunch [ ] Creating systemd unit file: /etc/systemd/system/tripleo_memcached-uy0nkwuo.service
/var/log/paunch.log:2019-08-12 10:19:06.869 395435 DEBUG paunch.utils.systemctl [ ] Executing: systemctl enable --now tripleo_memcached-uy0nkwuo
/var/log/paunch.log:subprocess.CalledProcessError: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:paunch.utils.systemctl.SystemctlException: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:2019-08-12 10:19:07.555 395435 ERROR paunch [ ] Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:subprocess.CalledProcessError: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/paunch.log:paunch.utils.systemctl.SystemctlException: Command '['systemctl', 'enable', '--now', 'tripleo_memcached-uy0nkwuo']' returned non-zero exit status 1.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Can't open PID file /var/run/memcached-uy0nkwuo.pid (yet?) after start: No such fil
e or directory
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Failed with result 'protocol'.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Service RestartSec=100ms expired, scheduling restart.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Scheduled restart job, restart counter is at 1.
/var/log/messages:Aug 12 10:19:07 controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Can't open PID file /var/run/memcached-uy0nkwuo.pid (yet?) after start: No such file or directory
...
...controller-2 systemd[1]: tripleo_memcached-uy0nkwuo.service: Scheduled restart job, restart counter is at 5.

and then the deployment is failed as the service unit can not be started any more

Tags:

Bogdan Dobrelya (bogdando) on 2019-08-12

Changed in tripleo:
importance:	Undecided → High
milestone:	none → train-3
tags:	added: idempotency
tags:	added: containers

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Fix proposed to paunch (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675895

Changed in tripleo:
assignee:	nobody → Bogdan Dobrelya (bogdando)
status:	New → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-08-12:

It is possible that the issue is not present in the master branch, but only comes after https://review.opendev.org/#/c/675692/1/paunch/builder/base.py

Revision history for this message

Luke Short (ekultails) wrote on 2019-08-12:

On a related note, today we noticed that the PID files does not match what's being used by the systemd service file.

```
$ grep PIDFile /etc/systemd/system/tripleo_memcached-4btmx9ju.service
PIDFile=/var/run/memcached-4btmx9ju.pid
$ ls -1 /var/run/memcached-4btmx9ju.pid
ls: cannot access '/var/run/memcached-4btmx9ju.pid': No such file or directory
$ ls -1 /var/run/memcached.pid
/var/run/memcached.pid
$ cat /var/run/memcached.pid
666389
```

Bogdan Dobrelya (bogdando) on 2019-08-13

Changed in tripleo:
importance:	High → Medium

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2019-08-13:

@Luka, right, here is https://bugs.launchpad.net/tripleo/+bug/1839929 for this

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix merged to paunch (master)

Reviewed: https://review.opendev.org/675895
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=2eaebe2cd96aa1e121d02f6998e5addd7c52ef86
Submitter: Zuul
Branch: master

commit 2eaebe2cd96aa1e121d02f6998e5addd7c52ef86
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Aug 12 15:24:33 2019 +0200

Fix systemd service start rate limiting

    The default limit is to allow 5 restarts in a 10sec period. If a
    service goes over that threshold due to the Restart= config option in
    the service definition, it will not attempt to restart any further.

We should not set StartLimitIntervalSec to 0 to disable any kind of
rate limiting as that may end up impacting the node load.

    Instead, use tenacity to retry with an exponential backoff, when the
    service unit enablement fails. Before to retry it, reset the unit's
    failure counters with the systemctl wrapper. This is a crash-loop
    approach that provides an efficient feature parity to the classic
    rate limiting, shall we want to implement that for the systemctl
    command wrapper instead.

Closes-bug: #1839841

Change-Id: I537fbf9933f2cbe6e1c2f627ba77da645bd55f25
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-15: Fix proposed to paunch (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676749

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-16: Fix merged to paunch (stable/stein)

Reviewed: https://review.opendev.org/676749
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=5b5e578cc571437965ff55b9c5a660bfe6310b2e
Submitter: Zuul
Branch: stable/stein

commit 5b5e578cc571437965ff55b9c5a660bfe6310b2e
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Aug 12 15:24:33 2019 +0200

Fix systemd service start rate limiting

We should not set StartLimitIntervalSec to 0 to disable any kind of
rate limiting as that may end up impacting the node load.

Closes-bug: #1839841

    Change-Id: I537fbf9933f2cbe6e1c2f627ba77da645bd55f25
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2eaebe2cd96aa1e121d02f6998e5addd7c52ef86)