tripleomaster/centos-binary-collectd:current-tripleo-updated-20180730001257 \"kolla_start\" Restarting

Bug #1784307 reported by Quique Llorente
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Gabriele Cerami

Bug Description

Looks like collectd container is restarting on overcloud deployt at a noop change, for scenario001

http://logs.openstack.org/45/560445/99/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/bf6c481/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-30_02_16_20

2018-07-30 02:16:20 | TASK [Check for unhealthy containers after step 5] *****************************
2018-07-30 02:16:20 | task path: /var/lib/mistral/overcloud/common_deploy_steps_tasks.yaml:231
2018-07-30 02:16:20 | Monday 30 July 2018 02:11:13 +0000 (0:00:00.203) 0:42:35.574 ***********
2018-07-30 02:16:20 | ok: [centos-7-ovh-gra1-0001043637] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-07-30 02:16:20 |
2018-07-30 02:16:20 |
2018-07-30 02:16:20 | TASK [Debug output for task which failed: Check for unhealthy containers after step 5] ***
2018-07-30 02:16:20 | task path: /var/lib/mistral/overcloud/common_deploy_steps_tasks.yaml:272
2018-07-30 02:16:20 | Monday 30 July 2018 02:16:16 +0000 (0:05:03.627) 0:47:39.202 ***********
2018-07-30 02:16:20 | fatal: [centos-7-ovh-gra1-0001043637]: FAILED! => {
2018-07-30 02:16:20 | "failed_when_result": true,
2018-07-30 02:16:20 | "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
2018-07-30 02:16:20 | "6dfca3fff517 192.168.24.1:8787/tripleomaster/centos-binary-collectd:current-tripleo-updated-20180730001257 \"kolla_start\" 7 minutes ago Restarting (1) 16 seconds ago collectd"
2018-07-30 02:16:20 | ]
2018-07-30 02:16:20 | }
2018-07-30 02:16:20 |
2018-07-30 02:16:20 | NO MORE HOSTS LEFT *************************************************************
2018-07-30 02:16:20 |
2018-07-30 02:16:20 | PLAY RECAP *********************************************************************
2018-07-30 02:16:20 | centos-7-ovh-gra1-0001043637 : ok=211 changed=96 unreachable=0 failed=1
2018-07-30 02:16:20 | undercloud : ok=23 changed=12 unreachable=0 failed=0
2018-07-30 02:16:20 |
2018-07-30 02:16:20 | Monday 30 July 2018 02:16:16 +0000 (0:00:00.106) 0:47:39.309 ***********
2018-07-30 02:16:20 | ===============================================================================
2018-07-30 02:16:20 |
2018-07-30 02:16:20 | Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log.
2018-07-30 02:16:20 | + status_code=1
2018-07-30 02:1

Also from the collectd container starting at the overcloud
http://logs.openstack.org/45/560445/99/check/tripleo-ci-centos-7-scenario001-multinode-oooq-container/bf6c481/logs/subnode-2/var/log/extra/docker/containers/collectd/stdout.log.txt.gz

++ cat /run_command
+ CMD='/usr/sbin/collectd -f'
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/collectd ]]
+++ stat -c %a /var/log/kolla/collectd
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/collectd
+ echo 'Running command: '\''/usr/sbin/collectd -f'\'''
+ exec /usr/sbin/collectd -f
Running command: '/usr/sbin/collectd -f'
plugin_load: plugin "python" successfully loaded.
[2018-07-30 02:09:01] plugin_load: plugin "logfile" successfully loaded.
Error: Reading the config file failed!
Read the logs for details.
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files

Tags: alert ci
Changed in tripleo:
importance: Undecided → Critical
description: updated
Revision history for this message
Quique Llorente (quiquell) wrote :

Looks like health check is kind of new https://github.com/openstack/tripleo-heat-templates/commit/bd1d5d72caf25010e373f1ad2ed6ebc5aee96914

Maybe it needs some tunning.

Changed in tripleo:
assignee: Rafael Folco (rafaelfolco) → Gabriele Cerami (gcerami)
Revision history for this message
Quique Llorente (quiquell) wrote :
Changed in tripleo:
status: New → Triaged
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Revision history for this message
Quique Llorente (quiquell) wrote :

Reverting review of the healthcheck in case is needed:

https://review.openstack.org/#/c/587006/

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The health check failure is correct and I hope it will be re-reverted back once the collectd container gets fixed in kolla.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/587006
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2f44dbd93809faa25cd774aee278e3723da1c2a1
Submitter: Zuul
Branch: master

commit 2f44dbd93809faa25cd774aee278e3723da1c2a1
Author: Quique Llorente <email address hidden>
Date: Mon Jul 30 13:27:43 2018 +0200

    Revert "Fix deploy health checks"

    This reverts commit bd1d5d72caf25010e373f1ad2ed6ebc5aee96914.

    Closes-Bug: #1784307
    Change-Id: Ia2c12d7455564b6297c5f0934812b10fabbdc914

Changed in tripleo:
status: In Progress → Fix Released
Changed in tripleo:
milestone: stein-1 → rocky-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.0.0.0rc1

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.