container health check fails in step 5 on centos-binary-nova-api

Bug #1782598 reported by wes hayutin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Oliver Walsh

Bug Description

http://logs.openstack.org/55/573255/4/gate/tripleo-ci-centos-7-containers-multinode/91fb9e6/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-07-19_14_42_21

2018-07-19 14:42:20 | TASK [Check for unhealthy containers after step 5] *****************************
2018-07-19 14:42:21 | ok: [undercloud]
2018-07-19 14:42:21 |
2018-07-19 14:42:21 | TASK [Debug output for task which failed: Check for unhealthy containers after step 5] ***
2018-07-19 14:42:21 | fatal: [undercloud]: FAILED! => {
2018-07-19 14:42:21 | "failed_when_result": true,
2018-07-19 14:42:21 | "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
2018-07-19 14:42:21 | "5b15953d850a 192.168.24.1:8787/tripleomaster/centos-binary-nova-api:current-tripleo-updated-20180719132959 \"kolla_start\" 2 minutes ago Up 2 minutes (unhealthy) nova_api"
2018-07-19 14:42:21 | ]
2018-07-19 14:42:21 | }
2018-07-19 14:42:21 |
2018-07-19 14:42:21 | NO MORE HOSTS LEFT *************************************************************
2018-07-19 14:42:21 |
2018-07-19 14:42:21 | PLAY RECAP *********************************************************************
2018-07-19 14:42:21 | undercloud : ok=215 changed=70 unreachable=0 failed=1

Revision history for this message
wes hayutin (weshayutin) wrote :

Elastic Recheck Query

 build_name: *tripleo-ci* AND build_status: FAILURE AND message: "Up 2 minutes (unhealthy)" and message: "centos-binary-nova-api:current-tripleo-updated"

Changed in tripleo:
assignee: nobody → Oliver Walsh (owalsh)
tags: added: alert
tags: added: promotion-blocker
Revision history for this message
Oliver Walsh (owalsh) wrote :

I don't see any of the healtcheck requests actually failing - http://logs.openstack.org/55/573255/4/gate/tripleo-ci-centos-7-containers-multinode/91fb9e6/logs/undercloud/var/log/containers/httpd/nova-api/nova_api_wsgi_access.log.txt.gz

Defaults for health checks are:
interval=30s
timeout=30s
retries=3

But the timestamps are all over the place for the first few minutes:

192.168.24.1 - - [19/Jul/2018:14:40:09 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:40:49 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:41:30 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:42:07 +0000] "GET /v2.1/os-services HTTP/1.1" 200 206 "-" "python-novaclient"
192.168.24.1 - - [19/Jul/2018:14:42:10 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:42:50 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:43:20 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:43:50 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:44:20 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:44:50 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"
192.168.24.1 - - [19/Jul/2018:14:45:21 +0000] "GET / HTTP/1.1" 200 417 "-" "curl-healthcheck"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/584119

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
chandan kumar (chkumar246) wrote :
Download full text (4.3 KiB)

We are hitting similar errors in overcloud deploy also:
Multinode scenario 001. 002 and 004 jobs are failing at master check in gates multiple times at overcloud deploy.
http://logs.openstack.org/88/584088/1/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/e490512/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-20_02_35_08

2018-07-20 02:35:08 | " Last run: 1532054Overcloud configuration failed.
2018-07-20 02:35:08 | END return value: 1
2018-07-20 02:35:08 | 092",
2018-07-20 02:35:08 | " Pcmk property: 18.41",
2018-07-20 02:35:08 | " Config retrieval: 2.89",
2018-07-20 02:35:08 | " Pcmk bundle: 37.33",
2018-07-20 02:35:08 | " Total: 58.65",
2018-07-20 02:35:08 | " Config: 1532054033",
2018-07-20 02:35:08 | "+ CONFIG='include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::backup_bundle'",
2018-07-20 02:35:08 | "+ puppet apply --verbose --detailed-exitcodes --summarize --color=false --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,pacemaker::resource::bundle,pacemaker::property,pacemaker::constraint::location -e 'include ::tripleo::profile::base::pacemaker;include ::tripleo::profile::pacemaker::cinder::backup_bundle'",
2018-07-20 02:35:08 | " with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/backup.pp\", 63]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/cinder/backup.pp\", 33]",
2018-07-20 02:35:08 | "Warning: Unknown variable: 'ensure'. at /etc/puppet/modules/cinder/manifests/backup.pp:83:18"
2018-07-20 02:35:08 | ]
2018-07-20 02:35:08 | }
2018-07-20 02:35:08 |
2018-07-20 02:35:08 | TASK [Check for unhealthy containers after step 5] *****************************
2018-07-20 02:35:08 | task path: /var/lib/mistral/164d547d-f31b-4431-aa8d-617ed1f56e98/common_deploy_steps_tasks.yaml:227
2018-07-20 02:35:08 | Friday 20 July 2018 02:35:06 +0000 (0:00:00.253) 0:46:21.495 ***********
2018-07-20 02:35:08 | ok: [centos-7-inap-mtl01-0000842070] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-07-20 02:35:08 |
2018-07-20 02:35:08 | TASK [Debug output for task which failed: Check for unhealthy containers after step 5] ***
2018-07-20 02:35:08 | task path: /var/lib/mistral/164d547d-f31b-4431-aa8d-617ed1f56e98/common_deploy_steps_tasks.yaml:242
2018-07-20 02:35:08 | Friday 20 July 2018 02:35:06 +0000 (0:00:00.427) 0:46:21.923 ***********
2018-07-20 02:35:08 | fatal: [centos-7-inap-mtl01-0000842070]: FAILED! => {
2018-07-20 02:35:08 | "failed_when_result": true,
2018-07-20 02:35:08 | "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
2018-07-20 02:35:08 | "97054cb023f3 192.168.24.1:8787/tripleomaster/centos-binary-gnocchi-statsd:current-tripleo-updated-20180720003910 \"kolla_start\" About a minute ago Up A...

Read more...

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

https://review.openstack.org/569153 added Check container health as part of the deploy support, Reverting it and testing it here https://review.openstack.org/#/c/584284/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :
tags: removed: alert
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/584119
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bd1d5d72caf25010e373f1ad2ed6ebc5aee96914
Submitter: Zuul
Branch: master

commit bd1d5d72caf25010e373f1ad2ed6ebc5aee96914
Author: Oliver Walsh <email address hidden>
Date: Thu Jul 19 22:20:53 2018 +0100

    Fix deploy health checks

    Allow up to 5 minutes for unhealthy and restarting containers to stabilise.

    Change-Id: Icb0ef7648920e77fe368409f07612cdcba83e4cf
    Related-Bug: 1782598

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/586536

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/queens)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/586536

wes hayutin (weshayutin)
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.