many containers are needlessly restarted on redeploy with no config changes

Bug #1786065 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

Rerunning the overcloud deploy command with no changes restarts a truckload of containers (first seen this via https://bugzilla.redhat.com/show_bug.cgi?id=1612960).
So we really have three separate issues here. Below is the list of all the containers
that may restart needlessly (at least what I have observed in my tests):
A) cron category:
ceilometer_agent_notification
cinder_api
cinder_api_cron
cinder_scheduler
heat_api
heat_api_cfn
heat_api_cron
heat_engine
keystone
keystone_cron
logrotate_crond
nova_api
nova_api_cron
nova_conductor
nova_consoleauth
nova_metadata
nova_scheduler
nova_vnc_proxy
openstack-cinder-volume-docker-0
panko_api

These end up being restarted because in the config volume for the container there is
a cron file and cron files are generated with a timestamp inside:
$ cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
# HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: keystone-manage token_flush
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 * * * * keystone-manage token_flush >>/var/log/keystone/keystone-tokenflush.log 2>&1

The timestamp is unfortunately hard coded into puppet in both the cron provider and the parsedfile
provider:
https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/cron/crontab.rb#L127
https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L104

One possible fix would be to change the tar command in docker-puppet.py to be something like
this:
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' | tar xO | sed '/#.*HEADER.*/d'

Note that the 'tar xO' is needed in order to force text output and not use the
native tar format (if you don't the sed won't work).

B) swift category:
swift_account_auditor
swift_account_reaper
swift_account_replicator
swift_account_server
swift_container_auditor
swift_container_replicator
swift_container_server
swift_container_updater
swift_object_auditor
swift_object_expirer
swift_object_replicator
swift_object_server
swift_object_updater
swift_proxy
swift_rsync

So the swift containers restart because when recalculating the md5 over the
/var/lib/config-data/puppet-generated/swift folder we also include:
B.1) /etc/swift/backups/... which is a folder which over time collects backup of the ringfiles
B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time

I see two potential fixes here:
B.3) We simply exclude rings and their backups from the tar commands:
EXCL="--exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'"
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d'

B.4) We stop storing the rings under /etc/swift and we store them in some other place ?

I lack the necessary swift knowledge to add anything too useful here.

C) libvirt category:
nova_compute
nova_libvirt
nova_migration_target
nova_virtlogd

This one seems to be due to the fact that the /etc/libvirt/passwd.db file contains a timestamp or somehow
is just different after a redeploy:
[root@compute-1 nova_libvirt]# git diff cb2441bb1caf7572ccfd870561dcc29d7819ba04..0c7441f30926b111603ce4d4b60c6000fe49d290 .
diff --git a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
index 66e2024..6ffa492 100644
Binary files a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db and b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db differ

(I have not exactly understood why yet because in my quick test without tls a simple 'saslpasswd2 -d -a libvirt -u overcloud migration' did not change the md5 for me, but maybe I am missing something)

Excluding this file from the tar command is probably a bad idea because if you do indeed change the password of the migration user then you probably do want the libvirtd container to restart (actually you likely want only that one to restart, as opposed to all nova* containers on the compute which is what happens now)

Just to make sure that the above theories are all correct I used the following tar commands:
EXCL="--exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz' --exclude '*/etc/libvirt/passwd.db'"
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d' | md5sum | awk '{print $1}' > /var/lib/config-data/${NAME}.md5sum
tar -c -f - /var/lib/config-data/puppet-generated/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d' | md5sum | awk '{print $1}' > /var/lib/config-data/puppet-generated/${NAME}.md5sum

And I observed no spurious restarts.

tags: added: idempotency
Changed in tripleo:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/590008

Revision history for this message
Oliver Walsh (owalsh) wrote :

Perhaps we should restore https://review.openstack.org/562876

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/590780

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Martin Schuppert (<email address hidden>) on branch: master
Review: https://review.openstack.org/590780

Changed in tripleo:
milestone: rocky-rc1 → rocky-rc2
Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)
Download full text (4.0 KiB)

Reviewed: https://review.openstack.org/590008
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=42c3f180514da6aed5410a26f17db32d6ac1a62f
Submitter: Zuul
Branch: master

commit 42c3f180514da6aed5410a26f17db32d6ac1a62f
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 8 21:04:53 2018 +0200

    Make redeploy idempotent

    Rerunning the overcloud deploy command with no changes restarts a
    truckload of containers (first seen this via
    https://bugzilla.redhat.com/show_bug.cgi?id=1612960). So we really have
    three separate issues here. Below is the list of all the containers that
    may restart needlessly (at least what I have observed in my tests):
    A) cron category:
    ceilometer_agent_notification cinder_api cinder_api_cron cinder_scheduler
    heat_api heat_api_cfn heat_api_cron heat_engine keystone keystone_cron
    logrotate_crond nova_api nova_api_cron nova_conductor nova_consoleauth
    nova_metadata nova_scheduler nova_vnc_proxy openstack-cinder-volume-docker-0
    panko_api

    These end up being restarted because in the config volume for the container there is
    a cron file and cron files are generated with a timestamp inside:
    $ cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
    ...
     # HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
    ...

    The timestamp is unfortunately hard coded into puppet in both the cron provider and the parsedfile
    provider:
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/cron/crontab.rb#L127
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L104

    We fix this by repiping tar into 'tar xO' and grepping away any line
    that starts with # HEADER.

    B) swift category:
    swift_account_auditor swift_account_reaper swift_account_replicator
    swift_account_server swift_container_auditor swift_container_replicator
    swift_container_server swift_container_updater swift_object_auditor
    swift_object_expirer swift_object_replicator swift_object_server
    swift_object_updater swift_proxy swift_rsync

    So the swift containers restart because when recalculating the md5 over the
    /var/lib/config-data/puppet-generated/swift folder we also include:
    B.1) /etc/swift/backups/... which is a folder which over time collects backup of the ringfiles
    B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time

    We just add a parameter to the tar command to exclude those files as
    we do not need to trigger a restart if those files change.
    --exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'

    C) libvirt category:
    nova_compute nova_libvirt nova_migration_target nova_virtlogd

    This one seems to be due to the fact that the /etc/libvirt/passwd.db file contains a timestamp and
    even when we disable a user and passwd.db does not exist, it gets
    created:
    [root@compute-1 nova_libvirt]# git diff cb2441bb1caf7572ccfd870561dcc29d7819ba04..0c7441f30926b111603ce4d4b60c6000fe49d290 .

    passwd.db changes do not need to trig...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/599290

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/599292

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)
Download full text (4.1 KiB)

Reviewed: https://review.openstack.org/599290
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2aa664e45878bba930b4c9eecb48532fb19ec7b7
Submitter: Zuul
Branch: stable/rocky

commit 2aa664e45878bba930b4c9eecb48532fb19ec7b7
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 8 21:04:53 2018 +0200

    Make redeploy idempotent

    Rerunning the overcloud deploy command with no changes restarts a
    truckload of containers (first seen this via
    https://bugzilla.redhat.com/show_bug.cgi?id=1612960). So we really have
    three separate issues here. Below is the list of all the containers that
    may restart needlessly (at least what I have observed in my tests):
    A) cron category:
    ceilometer_agent_notification cinder_api cinder_api_cron cinder_scheduler
    heat_api heat_api_cfn heat_api_cron heat_engine keystone keystone_cron
    logrotate_crond nova_api nova_api_cron nova_conductor nova_consoleauth
    nova_metadata nova_scheduler nova_vnc_proxy openstack-cinder-volume-docker-0
    panko_api

    These end up being restarted because in the config volume for the container there is
    a cron file and cron files are generated with a timestamp inside:
    $ cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
    ...
     # HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
    ...

    The timestamp is unfortunately hard coded into puppet in both the cron provider and the parsedfile
    provider:
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/cron/crontab.rb#L127
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L104

    We fix this by repiping tar into 'tar xO' and grepping away any line
    that starts with # HEADER.

    B) swift category:
    swift_account_auditor swift_account_reaper swift_account_replicator
    swift_account_server swift_container_auditor swift_container_replicator
    swift_container_server swift_container_updater swift_object_auditor
    swift_object_expirer swift_object_replicator swift_object_server
    swift_object_updater swift_proxy swift_rsync

    So the swift containers restart because when recalculating the md5 over the
    /var/lib/config-data/puppet-generated/swift folder we also include:
    B.1) /etc/swift/backups/... which is a folder which over time collects backup of the ringfiles
    B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time

    We just add a parameter to the tar command to exclude those files as
    we do not need to trigger a restart if those files change.
    --exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'

    C) libvirt category:
    nova_compute nova_libvirt nova_migration_target nova_virtlogd

    This one seems to be due to the fact that the /etc/libvirt/passwd.db file contains a timestamp and
    even when we disable a user and passwd.db does not exist, it gets
    created:
    [root@compute-1 nova_libvirt]# git diff cb2441bb1caf7572ccfd870561dcc29d7819ba04..0c7441f30926b111603ce4d4b60c6000fe49d290 .

    passwd.db changes do not need t...

Read more...

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)
Download full text (4.2 KiB)

Reviewed: https://review.openstack.org/599292
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=be05b9bc375aabd1936d238733b7dad09f042e21
Submitter: Zuul
Branch: stable/queens

commit be05b9bc375aabd1936d238733b7dad09f042e21
Author: Michele Baldessari <email address hidden>
Date: Wed Aug 8 21:04:53 2018 +0200

    Make redeploy idempotent

    Rerunning the overcloud deploy command with no changes restarts a
    truckload of containers (first seen this via
    https://bugzilla.redhat.com/show_bug.cgi?id=1612960). So we really have
    three separate issues here. Below is the list of all the containers that
    may restart needlessly (at least what I have observed in my tests):
    A) cron category:
    ceilometer_agent_notification cinder_api cinder_api_cron cinder_scheduler
    heat_api heat_api_cfn heat_api_cron heat_engine keystone keystone_cron
    logrotate_crond nova_api nova_api_cron nova_conductor nova_consoleauth
    nova_metadata nova_scheduler nova_vnc_proxy openstack-cinder-volume-docker-0
    panko_api

    These end up being restarted because in the config volume for the container there is
    a cron file and cron files are generated with a timestamp inside:
    $ cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
    ...
     # HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
    ...

    The timestamp is unfortunately hard coded into puppet in both the cron provider and the parsedfile
    provider:
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/cron/crontab.rb#L127
    https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L104

    We fix this by repiping tar into 'tar xO' and grepping away any line
    that starts with # HEADER.

    B) swift category:
    swift_account_auditor swift_account_reaper swift_account_replicator
    swift_account_server swift_container_auditor swift_container_replicator
    swift_container_server swift_container_updater swift_object_auditor
    swift_object_expirer swift_object_replicator swift_object_server
    swift_object_updater swift_proxy swift_rsync

    So the swift containers restart because when recalculating the md5 over the
    /var/lib/config-data/puppet-generated/swift folder we also include:
    B.1) /etc/swift/backups/... which is a folder which over time collects backup of the ringfiles
    B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time

    We just add a parameter to the tar command to exclude those files as
    we do not need to trigger a restart if those files change.
    --exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'

    C) libvirt category:
    nova_compute nova_libvirt nova_migration_target nova_virtlogd

    This one seems to be due to the fact that the /etc/libvirt/passwd.db file contains a timestamp and
    even when we disable a user and passwd.db does not exist, it gets
    created:
    [root@compute-1 nova_libvirt]# git diff cb2441bb1caf7572ccfd870561dcc29d7819ba04..0c7441f30926b111603ce4d4b60c6000fe49d290 .

    passwd.db changes do not need ...

Read more...

tags: added: in-stable-queens
Changed in tripleo:
milestone: rocky-rc2 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/616116
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b49629f085b5f04be999e0d5697a77e3791f7dcc
Submitter: Zuul
Branch: master

commit b49629f085b5f04be999e0d5697a77e3791f7dcc
Author: Christian Schwede <email address hidden>
Date: Wed Nov 7 08:53:48 2018 +0100

    Do not ignore Swift ring changes to trigger container restart

    Swift containers need to restart if the rings change. In
    non-containerized environments this is not required, because the Swift
    processes will reload the rings on any changes. However, this does not
    work within containers, thus a restart is required.

    This also restarts swift_copy_rings and swift_setup_srv container. This
    will copy the updated ring files and ensure new storage mount points are
    using the right permissions.

    Closes-Bug: 1802066
    Related-Bug: 1786065
    Change-Id: Ie2b9f003dc34f2f02a45293d06d6a40c8d5ed8ff

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/713415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/713415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/713415
Reason: thanks for testing this!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/713415
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9bc664090741d43fd49857230238fc345bf8d0c9
Submitter: Zuul
Branch: master

commit 9bc664090741d43fd49857230238fc345bf8d0c9
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Mar 17 12:45:15 2020 +0100

    Fix swift containers idempotency

    The change https://review.opendev.org/#/c/616116 unwinded the swift
    part of the https://review.opendev.org/#/c/590008/ changes. So the
    contents of the /var/lib/config-data/swift_ringbuilder config volume
    used to be managed by container-puppet tool. That made swift containers
    always restarted because the puppet-generated rings are always changing
    on each deployment/update execution.

    Restore that unwinded change back and exclude swift rings from the
    management of container-puppet tooling. Instead make init containers
    swift_copy_rings and swift_setup_srv to be always executed (takes
    the same approach as in https://review.opendev.org/#/c/564798/).

    That also fixes the issue with swift_copy_rings seems never been
    executed - at least there is no traces of it in CI jobs logs for swift
    init containers.

    Change-Id: I23b469057e4c47c42601beb166f815ee71147c14
    Closes-Bug: #1867765
    Related-Bug: #1802066
    Related-Bug: #1786065
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.