The scenario001(and 2)-multinode-oooq-container job is failing during the overcloud deploy for "[Run tripleo-container-image-prepare]"

Bug #1795411 reported by Marios Andreou
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Marios Andreou

Bug Description

The scenario001(and 2)-multinode-oooq-container job is failing during the overcloud deploy for "[Run tripleo-container-image-prepare]" , the trace looks like:

  2018-10-01 03:03:47 | TASK [Write role data file] ****************************************************
  2018-10-01 03:03:47 | task path: /var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml:109
  2018-10-01 03:03:47 | Monday 01 October 2018 02:55:48 +0000 (0:00:00.785) 0:07:52.951 ********
  2018-10-01 03:03:47 | changed: [undercloud] => {"changed": true, "checksum": "66782a847fd0aeae7e3d2aeb7d66765aba8fd4de", "dest": "/tmp/ansible.PpI6o3-role-data", "gid": 42430, "group": "mistral", "md5sum": "5d99c6ab2e46d8bd2c9957f19c1a99f4", "mode": "0600", "owner": "mistral", "size": 155, "src": "/tmp/ansible-/ansible-tmp-1538362548.46-114656717379781/source", "state": "file", "uid": 42430}
  2018-10-01 03:03:47 |
  2018-10-01 03:03:47 | TASK [Run tripleo-container-image-prepare] *************************************
  2018-10-01 03:03:47 | task path: /var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml:124
  2018-10-01 03:03:47 | Monday 01 October 2018 02:55:49 +0000 (0:00:00.816) 0:07:53.769 ********
  2018-10-01 03:03:47 |
  2018-10-01 03:03:47 | Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log.
  2018-10-01 03:03:48 | + status_code=1
  2018-10-01 03:03:48 | + openstack stack list
  2018-10-01 03:03:48 | + grep -q overcloud
  2018-10-01 03:03:56 | + openstack stack list
  2018-10-01 03:03:56 | + grep -Eq '(CREATE|UPDATE)_COMPLETE'
  2018-10-01 03:04:01 | + openstack overcloud status
  2018-10-01 03:04:01 | + grep -Eq DEPLOY_SUCCESS
  2018-10-01 03:04:09 | + openstack overcloud failures
  2018-10-01 03:04:19 | Exception getting deployment failures: Ansible errors file not found at /var/lib/mistral/overcloud/ansible-errors.json
  2018-10-01 03:04:19 | + true
  2018-10-01 03:04:19 | + exit 1

Examples are scen1 @ [1,2] and scen2@ [3]. The external_deploy_steps_tasks:124 being pointed to in the trace looks like [4]

[1] http://logs.openstack.org/90/605090/6/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/e5a2a81/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[2] http://logs.openstack.org/79/606179/1/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/a7d1911/logs/undercloud/home/zuul/overcloud_deploy.log.gz
[3] http://logs.openstack.org/45/560445/151/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/cd439d5/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-01_10_03_05
[4] http://logs.openstack.org/45/560445/151/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/cd439d5/logs/undercloud/var/lib/mistral/overcloud/external_deploy_steps_tasks.yaml.txt.gz

wes hayutin (weshayutin)
tags: added: alert promotion-blocker
wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → stein-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/607021

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/607021
Reason: too much gets backed out and it won't work, let me try a different approach

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/607052

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So to explain why the revert, I was reviewing the fail/success from https://review.openstack.org/#/c/560445/. Between Sep 30 5:36 AM and Sep 30 3:33 PM the scenarios broke. In reviewing the rpm differences, I found the following diffs:

GOOD:
ansible-role-container-registry-1.0.1-0.20180907005818.b33f893.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180928180316.c9d0857.el7.noarch
puppet-tripleo-9.3.1-0.20180920012809.7edd775.el7.noarch
python2-tripleoclient-10.5.1-0.20180929200845.cb7289c.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180929200845.cb7289c.el7.noarch

BAD:
ansible-role-container-registry-1.0.1-0.20180930082252.119f525.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20180930082746.6895693.el7.noarch
puppet-tripleo-9.3.1-0.20180930084020.013154e.el7.noarch
python2-tripleoclient-10.5.1-0.20180930083140.66b7763.el7.noarch
python2-tripleoclient-heat-installer-10.5.1-0.20180930083140.66b7763.el7.noarch

From these we can determine what merged in between the two runs that might have broken everything.

https://review.openstack.org/#/c/606073/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/606079/ - upstream CI layout change (no effect)
https://review.openstack.org/#/c/596275/ -
Fix postrotate to notify holders of rotated logs
https://review.openstack.org/#/c/602828/ - Make sure rhel-plugin-push.service is stopped after pacemaker stops
https://review.openstack.org/#/c/604325/ - Add 2h timeout when waiting for websocket messages on package_update

The rhel-plugin-push-service change (https://review.openstack.org/#/c/602828/) is likely not the cause because it affects pacemaker deployment and we're not even there yet.

The timeout change (https://review.openstack.org/#/c/604325/) only affects the update/upgrade workflows which are not being executed here.

That leaves https://review.openstack.org/#/c/596275/ as the possible cause of the issues. My thought is that a runaway signal from log rotation is affecting deployments. The cron runs every hour on the dot (plus a fudge factor). which may explain why we've actually seen some successful runs (like in promotions) mean while the upstream jobs are failing at times.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

The logrotate revert is being tested by https://review.openstack.org/#/c/603275/8

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/607052
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=8a1ab7d865bdb44522ce5dceb9e99324f43a6967
Submitter: Zuul
Branch: master

commit 8a1ab7d865bdb44522ce5dceb9e99324f43a6967
Author: Alex Schultz <email address hidden>
Date: Mon Oct 1 20:55:58 2018 +0000

    Revert "Fix postrotate to notify holders of rotated logs"

    This reverts commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e.

    Change-Id: Ie9f5d3b6380caa6824ca940ca48ed0fcf6308608
    Related-Bug: #1795411

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

So something CI infra related (like collect logs) is accessing /var/log/containers on CI nodes and is getting into a bad shape once it receives a SUGHUP from logrotate, IIUC.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Or was that a part of the deployment framework, like ansible, running something accessing /var/log/containers and got terminated that way...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It seems the root cause is mistral engine received HUP upon log rotation in the middle of deployment. F.e. here it happens for executor http://logs.openstack.org/45/560445/151/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/cd439d5/logs/undercloud/var/log/containers/mistral/executor.log.txt.gz#_2018-10-01_10_00_37_800

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We need to check if mistral containers for the failed jobs had the fix https://review.openstack.org/#/c/605633/ (not sure if that fix is relevant tho)

tags: added: workflows
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

In the local environment, I can see that logrotation does not well handled by heat engine as well.

(while creating heat stacks)
[zuul@undercloud ~]$ docker exec -it logrotate_crond logrotate -f /etc/logrotate-crond.conf
 kill -HUP 49301
 kill -HUP 50264
 kill -HUP 51066
 kill -HUP 51417
 kill -HUP 52071
[zuul@undercloud ~]$ ps auxf | grep -e 49301 -e 50264 -e 51066 -e 51417 -e 52071
42436 49301 1.9 1.1 308316 93152 ? Rs 13:41 0:18 | \_ /usr/bin/python2 /usr/bin/nova-scheduler
42418 50264 1.9 0.9 352484 79596 ? Rs 13:41 0:18 | \_ /usr/bin/python2 /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat.conf
42435 51066 0.7 1.3 354336 109812 ? Ss 13:41 0:07 | \_ /usr/bin/python2 /usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --log-file=/var/log/neutron/server.log
42415 51417 1.4 1.3 540096 106932 ? Rs 13:41 0:13 | \_ /usr/bin/python2 /usr/bin/glance-api --config-file /usr/share/glance/glance-api-dist.conf --config-file /etc/glance/glance-api.conf
42430 52071 2.2 1.2 427404 97216 ? Ss 13:41 0:21 | \_ /usr/bin/python2 /usr/bin/mistral-server --config-file=/etc/mistral/mistral.conf --log-file=/var/log/mistral/api.log --server=api

and the deployment log:
2018-10-02 13:56:41 | 2018-10-02 13:56:37Z [overcloud.ControllerServiceChain.ServiceChain.20]: CREATE_IN_PROGRESS state changed
2018-10-02 14:07:19 | 2018-10-02 13:5ERROR: Internal Error
2018-10-02 14:07:19 | END return value: 1

Now trying to catch mistral external deployments with SIGHUP the same way...

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

(note, the neutron server got HUP as I'm testing with the https://review.openstack.org/607052 revert in place, so it usese the old logrotation config)

Revision history for this message
Alex Schultz (alex-schultz) wrote :

revert landed, closing this out for now

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Thomas Herve (therve) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/607510

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/607512

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Related https://review.openstack.org/#/c/607491/, fixes the revert via chosing another rotation strategy

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/608594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/607491
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2b223de04b26210cc865c3f6061d8d139859f34c
Submitter: Zuul
Branch: master

commit 2b223de04b26210cc865c3f6061d8d139859f34c
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

    Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.

    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.

    Nor can we fall back to that predates the containerization of services
    because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

    [0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.openstack.org/608594
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Submitter: Zuul
Branch: stable/rocky

commit b3d4ca0b6d3c79ae1e97d115df5c48e6881614ca
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

    Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.

    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.

    Nor can we fall back to that predates the containerization of services
    because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

    [0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/607510
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=59a1e9a4ce8d23374f578ca0484979b8ea376a80
Submitter: Zuul
Branch: stable/queens

commit 59a1e9a4ce8d23374f578ca0484979b8ea376a80
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

    Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.

    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.

    Nor can we fall back to that predates the containerization of services
    because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

    [0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/pike)

Reviewed: https://review.openstack.org/607512
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f62eb706f26a30e187356dd05d65c4348e007a20
Submitter: Zuul
Branch: stable/pike

commit f62eb706f26a30e187356dd05d65c4348e007a20
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Oct 3 09:53:35 2018 +0200

    Copytruncate containerized logrotate configuration

    This reimplements commit 67a7dc70f2885b7db2a42bc28c25ece0bbeba3e4.
    Copytruncate becomes a default for containerized logrotate. The
    solution based on signals processing goes away.

    As long as key deployment framework components heat-engine and
    mistral-engine do not tolerate SIGHUP copytruncate should be used.

    There is more openstack services, like neutron-server, nova-scheduler
    that cannot handle SIGHUP nicely yet.

    Nor can we fall back to that predates the containerization of services
    because of the following reasons:

    * We cannot and should not use the restart command in postrotate as it
      was before containerization of services. For that a container needs
      to be privileged and granted a docker socket bind-mount, which is a
      total security antipattern and defeats the very purpose of
      containerization. Things may change with future adoption of Podman
      and/or kubelet control plane though. If/when that happens, we might
      consider an option for postrotate to terminate a process with
      SIGTERM, to have the process instantly respawned via its systemd
      unit/kubelet restart policies.

    * Individual services' logrotate configs worth nothing, when still
      being handled by a central logrotation container running crond. And
      it needs to remain centralized as individual containers neither do
      run crond nor contain logrotate, nor lightweight containers following
      12-factors apps recommendations should do anything like that. Nor the
      host logrotate/crond can do rotation of logs for containers as we do/
      should not install required packages on the host, but only in
      containers. See also the spec [0] explaining the reasoning better.

    All of that makes copytruncate a global choice for logs rotation of
    containerized services as we just cannont be sure, if a service foo
    *really* does correct processing of SIGHUP. We leave that option for
    future implementation in the hope things get fixed eventually. As well
    as the aforementioned systemd/kublet option, or the option to provide
    stdout only logging [0] and let the logrotate thing go.

    [0] https://review.openstack.org/#/c/462900

    Closes-Bug: #1795411
    Related-Bug: #1276694
    Change-Id: Ibdad7859a389d0ff37bbf7bfd9f4c521a05a5ea1
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 2b223de04b26210cc865c3f6061d8d139859f34c)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.1.0

This issue was fixed in the openstack/puppet-tripleo 10.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.4.0

This issue was fixed in the openstack/puppet-tripleo 8.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.4.0

This issue was fixed in the openstack/puppet-tripleo 9.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.4.17

This issue was fixed in the openstack/puppet-tripleo 7.4.17 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.