Bug #1795411 “The scenario001(and 2)-multinode-oooq-container jo...” : Bugs : tripleo

wes hayutin (weshayutin) on 2018-10-01

tags:

added: alert promotion-blocker

wes hayutin (weshayutin) on 2018-10-01

Changed in tripleo:
milestone:	none → stein-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Related fix proposed to tripleo-quickstart-extras (master)

#1

Related fix proposed to branch: master
Review: https://review.openstack.org/607021

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Change abandoned on tripleo-quickstart-extras (master)

#2

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/607021
Reason: too much gets backed out and it won't work, let me try a different approach

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Related fix proposed to puppet-tripleo (master)

#3

Related fix proposed to branch: master
Review: https://review.openstack.org/607052

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2018-10-01:

#4

So to explain why the revert, I was reviewing the fail/success from https://review.openstack.org/#/c/560445/. Between Sep 30 5:36 AM and Sep 30 3:33 PM the scenarios broke. In reviewing the rpm differences, I found the following diffs:

GOOD:
ansible-role-container-registry-1.0.1-0.20180907005818.b33f893.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180928180316.c9d0857.el7.noarch
puppet-tripleo-9.3.1-0.20180920012809.7edd775.el7.noarch
python2-tripleoclient-10.5.1-0.20180929200845.cb7289c.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180929200845.cb7289c.el7.noarch

BAD:
ansible-role-container-registry-1.0.1-0.20180930082252.119f525.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180930082746.6895693.el7.noarch
puppet-tripleo-9.3.1-0.20180930084020.013154e.el7.noarch
python2-tripleoclient-10.5.1-0.20180930083140.66b7763.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180930083140.66b7763.el7.noarch

From these we can determine what merged in between the two runs that might have broken everything.

https://review.openstack.org/#/c/606073/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/606079/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/596275/ -
Fix postrotate to notify holders of rotated logs
https://review.openstack.org/#/c/602828/ - Make sure rhel-plugin-push.service is stopped after pacemaker stops
https://review.openstack.org/#/c/604325/ - Add 2h timeout when waiting for websocket messages on package_update

The rhel-plugin-push-service change (https://review.openstack.org/#/c/602828/) is likely not the cause because it affects pacemaker deployment and we're not even there yet.

The timeout change (https://review.openstack.org/#/c/604325/) only affects the update/upgrade workflows which are not being executed here.

That leaves https://review.openstack.org/#/c/596275/ as the possible cause of the issues. My thought is that a runaway signal from log rotation is affecting deployments. The cron runs every hour on the dot (plus a fudge factor). which may explain why we've actually seen some successful runs (like in promotions) mean while the upstream jobs are failing at times.

So to explain why the revert, I was reviewing the fail/success from https://review.openstack.org/#/c/560445/. Between Sep 30 5:36 AM and Sep 30 3:33 PM the scenarios broke.  In reviewing the rpm differences, I found the following diffs:

GOOD:
ansible-role-container-registry-1.0.1-0.20180907005818.b33f893.el7.noarch	
ansible-role-tripleo-modify-image-1.0.1-0.20180928180316.c9d0857.el7.noarch	
puppet-tripleo-9.3.1-0.20180920012809.7edd775.el7.noarch	
python2-tripleoclient-10.5.1-0.20180929200845.cb7289c.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180929200845.cb7289c.el7.noarch	
	
BAD:
ansible-role-container-registry-1.0.1-0.20180930082252.119f525.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180930082746.6895693.el7.noarch
puppet-tripleo-9.3.1-0.20180930084020.013154e.el7.noarch
python2-tripleoclient-10.5.1-0.20180930083140.66b7763.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180930083140.66b7763.el7.noarch

From these we can determine what merged in between the two runs that might have broken everything.

https://review.openstack.org/#/c/606073/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/606079/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/596275/ - 
Fix postrotate to notify holders of rotated logs
https://review.openstack.org/#/c/602828/ - Make sure rhel-plugin-push.service is stopped after pacemaker stops
https://review.openstack.org/#/c/604325/ - Add 2h timeout when waiting for websocket messages on package_update

The rhel-plugin-push-service change (https://review.openstack.org/#/c/602828/) is likely not the cause because it affects pacemaker deployment and we're not even there yet.

The timeout change (https://review.openstack.org/#/c/604325/) only affects the update/upgrade workflows which are not being executed here.

That leaves https://review.openstack.org/#/c/596275/ as the possible cause of the issues.  My thought is that a runaway signal from log rotation is affecting deployments. The cron runs every hour on the dot (plus a fudge factor). which may explain why we've actually seen some successful runs (like in promotions) mean while the upstream jobs are failing at times.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2018-10-01:

#5

The logrotate revert is being tested by https://review.openstack.org/#/c/603275/8

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2018-10-01:

#6

Also see, http://logs.openstack.org/82/559182/6/check/tripleo-ci-centos-7-containers-multinode/c92e168/logs/undercloud/var/lib/mistral/overcloud/ansible.log.txt.gz#_2018-10-01_20_00_50_811 where it fails at 20:00 on the dot as well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-02: Related fix merged to puppet-tripleo (master)

#7

Reviewed: https://review.openstack.org/607052
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=8a1ab7d865bdb44522ce5dceb9e99324f43a6967
Submitter: Zuul
Branch: master

commit 8a1ab7d865bdb44522ce5dceb9e99324f43a6967
Author: Alex Schultz <email address hidden>
Date: Mon Oct 1 20:55:58 2018 +0000

Revert "Fix postrotate to notify holders of rotated logs"

This reverts commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e.

Change-Id: Ie9f5d3b6380caa6824ca940ca48ed0fcf6308608
Related-Bug: #1795411

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#8

So something CI infra related (like collect logs) is accessing /var/log/containers on CI nodes and is getting into a bad shape once it receives a SUGHUP from logrotate, IIUC.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#9

Or was that a part of the deployment framework, like ansible, running something accessing /var/log/containers and got terminated that way...

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#10

I'm trying to reproduce it with http://logs.openstack.org/45/560445/151/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/cd439d5/logs/reproducer-quickstart.sh

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#11

It seems the root cause is mistral engine received HUP upon log rotation in the middle of deployment. F.e. here it happens for executor http://logs.openstack.org/45/560445/151/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/cd439d5/logs/undercloud/var/log/containers/mistral/executor.log.txt.gz#_2018-10-01_10_00_37_800

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#12

We need to check if mistral containers for the failed jobs had the fix https://review.openstack.org/#/c/605633/ (not sure if that fix is relevant tho)

tags:

added: workflows

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#13

In the local environment, I can see that logrotation does not well handled by heat engine as well.

(while creating heat stacks)
[zuul@undercloud ~]$ docker exec -it logrotate_crond logrotate -f /etc/logrotate-crond.conf
kill -HUP 49301
kill -HUP 50264
kill -HUP 51066
kill -HUP 51417
kill -HUP 52071
[zuul@undercloud ~]$ ps auxf | grep -e 49301 -e 50264 -e 51066 -e 51417 -e 52071
42436 49301 1.9 1.1 308316 93152 ? Rs 13:41 0:18 | \_ /usr/bin/python2 /usr/bin/nova-scheduler
42418 50264 1.9 0.9 352484 79596 ? Rs 13:41 0:18 | \_ /usr/bin/python2 /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat.conf
42435 51066 0.7 1.3 354336 109812 ? Ss 13:41 0:07 | \_ /usr/bin/python2 /usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --log-file=/var/log/neutron/server.log
42415 51417 1.4 1.3 540096 106932 ? Rs 13:41 0:13 | \_ /usr/bin/python2 /usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf
42430 52071 2.2 1.2 427404 97216 ? Ss 13:41 0:21 | \_ /usr/bin/python2 /usr/bin/mistral-server --config-file=/etc/mistral/mistral.conf --log-file=/var/log/mistral/api.log --server=api

and the deployment log:
2018-10-02 13:56:41 | 2018-10-02 13:56:37Z [overcloud.ControllerServiceChain.ServiceChain.20]: CREATE_IN_PROGRESS state changed
2018-10-02 14:07:19 | 2018-10-02 13:5ERROR: Internal Error
2018-10-02 14:07:19 | END return value: 1

Now trying to catch mistral external deployments with SIGHUP the same way...

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-02:

#14

(note, the neutron server got HUP as I'm testing with the https://review.openstack.org/607052 revert in place, so it usese the old logrotation config)

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2018-10-02:

#15

revert landed, closing this out for now

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Thomas Herve (therve) wrote on 2018-10-02:

#16

Created https://review.openstack.org/#/c/607306/ for mistral.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Fix proposed to puppet-tripleo (stable/queens)

#17

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/607510

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-03: Fix proposed to puppet-tripleo (stable/pike)

#18

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/607512

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-03:

#19

Related https://review.openstack.org/#/c/607491/, fixes the revert via chosing another rotation strategy

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix proposed to puppet-tripleo (stable/rocky)

#20

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/608594

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10: Fix merged to puppet-tripleo (master)

#21

Reviewed: https://review.openstack.org/607491
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2b223de04b26210cc865c3f6061d8d139859f34c
Submitter: Zuul
Branch: master

commit 2b223de04b26210cc865c3f6061d8d139859f34c
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

As long as key deployment framework components heat-engine and
mistral-engine do not tolerate SIGHUP copytruncate should be used.

There is more openstack services, like neutron-server, nova-scheduler
that cannot handle SIGHUP nicely yet.

Nor can we fall back to that predates the containerization of services
because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

[0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/607491
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2b223de04b26210cc865c3f6061d8d139859f34c
Submitter: Zuul
Branch:    master

commit 2b223de04b26210cc865c3f6061d8d139859f34c
Author: Bogdan Dobrelya <bdobreli@redhat.com>
Date:   Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration
    
    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.
    
    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.
    
    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.
    
    Nor can we fall back to that predates the containerization of services
    because of the following reasons:
    
    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.
    
    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.
    
    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.
    
    [0] https://review.openstack.org/#/c/462900
    
    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-14: Fix merged to puppet-tripleo (stable/rocky)

#22

Reviewed: https://review.openstack.org/608594
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Submitter: Zuul
Branch: stable/rocky

commit b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

As long as key deployment framework components heat-engine and
mistral-engine do not tolerate SIGHUP copytruncate should be used.

There is more openstack services, like neutron-server, nova-scheduler
that cannot handle SIGHUP nicely yet.

Nor can we fall back to that predates the containerization of services
because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

[0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

Reviewed:  https://review.openstack.org/608594
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Submitter: Zuul
Branch:    stable/rocky

commit b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Author: Bogdan Dobrelya <bdobreli@redhat.com>
Date:   Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration
    
    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.
    
    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.
    
    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.
    
    Nor can we fall back to that predates the containerization of services
    because of the following reasons:
    
    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.
    
    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.
    
    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.
    
    [0] https://review.openstack.org/#/c/462900
    
    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-19: Fix merged to puppet-tripleo (stable/queens)

#23

Reviewed: https://review.openstack.org/607510
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=59a1e9a4ce8d23374f578ca0484979b8ea376a80
Submitter: Zuul
Branch: stable/queens

commit 59a1e9a4ce8d23374f578ca0484979b8ea376a80
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

As long as key deployment framework components heat-engine and
mistral-engine do not tolerate SIGHUP copytruncate should be used.

There is more openstack services, like neutron-server, nova-scheduler
that cannot handle SIGHUP nicely yet.

Nor can we fall back to that predates the containerization of services
because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

[0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

Reviewed:  https://review.openstack.org/607510
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=59a1e9a4ce8d23374f578ca0484979b8ea376a80
Submitter: Zuul
Branch:    stable/queens

commit 59a1e9a4ce8d23374f578ca0484979b8ea376a80
Author: Bogdan Dobrelya <bdobreli@redhat.com>
Date:   Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration
    
    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.
    
    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.
    
    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.
    
    Nor can we fall back to that predates the containerization of services
    because of the following reasons:
    
    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.
    
    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.
    
    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.
    
    [0] https://review.openstack.org/#/c/462900
    
    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-20: Fix merged to puppet-tripleo (stable/pike)

#24

Reviewed: https://review.openstack.org/607512
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f62eb706f26a30e187356dd05d65c4348e007a20
Submitter: Zuul
Branch: stable/pike

commit f62eb706f26a30e187356dd05d65c4348e007a20
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

As long as key deployment framework components heat-engine and
mistral-engine do not tolerate SIGHUP copytruncate should be used.

There is more openstack services, like neutron-server, nova-scheduler
that cannot handle SIGHUP nicely yet.

Nor can we fall back to that predates the containerization of services
because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

[0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

Reviewed:  https://review.openstack.org/607512
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f62eb706f26a30e187356dd05d65c4348e007a20
Submitter: Zuul
Branch:    stable/pike

commit f62eb706f26a30e187356dd05d65c4348e007a20
Author: Bogdan Dobrelya <bdobreli@redhat.com>
Date:   Wed Oct 3 09:53:35 2018 +0200

Copytruncate containerized logrotate configuration
    
    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.
    
    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.
    
    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.
    
    Nor can we fall back to that predates the containerization of services
    because of the following reasons:
    
    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.
    
    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.
    
    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.
    
    [0] https://review.openstack.org/#/c/462900
    
    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-28: Fix included in openstack/puppet-tripleo 10.1.0

#25

This issue was fixed in the openstack/puppet-tripleo 10.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-15: Fix included in openstack/puppet-tripleo 8.4.0

#26

This issue was fixed in the openstack/puppet-tripleo 8.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-14: Fix included in openstack/puppet-tripleo 9.4.0

#27

This issue was fixed in the openstack/puppet-tripleo 9.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-14: Fix included in openstack/puppet-tripleo 7.4.17

#28

This issue was fixed in the openstack/puppet-tripleo 7.4.17 release.

tripleo

The scenario001(and 2)-multinode-oooq-container job is failing during the overcloud deploy for "[Run tripleo-container-image-prepare]"

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches