platform-integ-apps apply failed

Bug #1830290 reported by Juan Carlos Alonso
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------
Using Ansible bootstrap plybook configuration, 'platform-integ-apps' application failed during apply.
It tries to re-apply but operation aborted with status: 'apply-failed'

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Installation of AIO Simplex system

Expected Behavior
------------------
Status of 'platform-integ-apps': applied
Provisioning succeed

Actual Behavior
----------------
Status of 'platform-integ-apps': apply-failed
Provisioning failed

Reproducibility
---------------
<Reproducible/100%>

System Configuration
--------------------
AIO Simplex
BUILD_ID="20190523T013000Z"

Last Pass
---------
ISO: 20190522T013000Z

Timestamp/Logs
--------------
/var/log/sysinv.log attached
Changes on this ISO attached

Test Activity
-------------
Test deployment with Ansible

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The following commit will make the application apply more deterministic on All-in-one systems:
https://review.opendev.org/#/c/660918/

This was merged on May 23. Please re-test with the cengn May 24 build.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Tested with May 24 build. Issue is still present.

[wrsroot@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190524T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="114"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-24 01:30:00 +0000"

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------------+---------+-------------------------------+---------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+---------------+------------------------------------------+
| platform-integ-apps | 1.0-5 | platform-integration-manifest | manifest.yaml | upload-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+---------------+------------------------------------------+

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Are the two occurrences the same? The first failure you mention is an apply-failed. The second failure is upload-failed. Is the upload-failure consistent/reproducible with the new load? Do you have logs from the second failure?

summary: - platform-integ-apps apply failed
+ Simplex: platform-integ-apps apply failed
Revision history for this message
Al Bailey (albailey1974) wrote : Re: Simplex: platform-integ-apps apply failed

Please provide full collect logs for the second failure.
The sysinv logs from the 23rd indicate that the service running on port 9001 (local docker registry?) may not have been running or configured properly.

2019-05-23 22:02:14.120 99896 INFO sysinv.conductor.kube_app [-] Application overrides generated.
2019-05-23 22:02:14.165 99896 INFO sysinv.conductor.kube_app [-] Armada manifest file has no img tags for chart helm-toolkit
2019-05-23 22:02:14.183 99896 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download started from local registry
2019-05-23 22:02:14.212 99896 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/port/ceph-config-helper:v1.10.3 download started from local registry
2019-05-23 22:02:14.384 92020 INFO sysinv.agent.manager [req-9ba7d1b3-c924-4da3-9f92-535efb16040a admin None] Runtime manifest apply completed for classes [u'openstack::keystone::endpoint::runtime', u'platform::firewall::runtime', u'platform::sysinv::runtime'].
2019-05-23 22:02:14.385 92020 INFO sysinv.agent.manager [req-9ba7d1b3-c924-4da3-9f92-535efb16040a admin None] Agent config applied 79e3d68c-85bc-4f58-854e-aa0dc51bf3fa
2019-05-23 22:02:14.409 99896 INFO sysinv.conductor.manager [req-9ba7d1b3-c924-4da3-9f92-535efb16040a admin None] SYS_I Clear system config alarm: controller-0 target config 79e3d68c-85bc-4f58-854e-aa0dc51bf3fa
2019-05-23 22:02:24.469 99896 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/port/ceph-config-helper:v1.10.3 download failed from local registry: 500 Server Error: Internal Server Error ("Get https://192.168.204.2:9001/v2/: net/http: TLS handshake timeout")
2019-05-23 22:02:24.480 99896 ERROR sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download failed from local registry: 500 Server Error: Internal Server Error ("Get https://192.168.204.2:9001/v2/: net/http: TLS handshake timeout")
2019-05-23 22:02:24.480 99896 ERROR sysinv.conductor.kube_app [-] Deployment of application platform-integ-apps (1.0-5) failed: failed to download one or more image(s).
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1197, in perform_app_apply
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app self._download_images(app)
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 534, in _download_images
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app reason="failed to download one or more image(s).")
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app KubeAppApplyFailure: Deployment of application platform-integ-apps (1.0-5) failed: failed to download one or more image(s).
2019-05-23 22:02:24.480 99896 TRACE sysinv.conductor.kube_app
2019-05-23 22:02:24.491 99896 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

First failure: 'apply-failed', using proxy.
Second failure: 'upload-failed', using local registry.

Talking with team, use of proxy is the best option in virtual environments.

Re-tested with ISO May 24, first failure faced again.

[wrsroot@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190524T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="114"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-24 01:30:00 +0000"

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :

The collect logs show the same TLS Handshake timeout.

I see many other projects with bugs reported related to go net/http: tls handshake timeout

One suggested an MTU mismatch (I am uncertain how to check this)

Another bug raised against kubernetes indicates that https_proxy is causing the problem
https://github.com/kubernetes/kubernetes/issues/13382

Revision history for this message
Jerry Sun (jerry-sun-u) wrote :

Do you have both http and https proxy configured for the registry? We believe that could be causing issues (see https://bugs.launchpad.net/starlingx/+bug/1830436)

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Yes, I am using:

docker_http_proxy: http://proxy-chain.intel.com:911
docker_https_proxy: http://proxy-chain.intel.com:912

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Hi Juan, Is there a reason you need two proxies? Please retest with only one (either http or https proxy) and let us know the results.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Working on Baremetal, Standard (2+2) and using 20190527T233000Z and ansible (with corresponding workarounds), platform-integ-apps remains on Uploaded status.

If we try to manually apply it (system application-apply platform-integ-apps) it fails:

+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-5 | platform-integration-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+---------+-------------------------------+---------------+--------------+------------------------------------------+

The first error encountered on /var/log/sysinv.log is as follow:

2019-05-28 12:43:28.431 106547 ERROR sysinv.conductor.kube_app [-] Received a false positive response from Docker/Armada. Failed to apply application manifest /manifests/platform-integ-apps/1.0-5/platform-integ-apps-manifest.yaml: 2019-05-28 12:37:19.800 42 DEBUG armada.handlers.document [-] Resolving reference /manifests/platform-integ-apps/1.0-5/platform-integ-apps-manifest.yaml. resolve_reference /usr/local/lib/python3.6/dist-packages/armada/handlers/document.py:49

A full collect is attached.

Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (17.7 KiB)

This is also failing for me in a 2+2 virtual box configuration. This is a non-openstack install - I am not applying the openstack related labels to any of the hosts.

The platform-integ-apps application repeatedly fails to apply. The signature is slightly different than in the collect from Christopher. I will attach a collect.

Here are the armada logs for the failed apply:

2019-05-28 19:50:27.869 41 DEBUG armada.handlers.document [-] Resolving reference /manifests/platform-integ-apps/1.0-5/platform-integ-apps-manifest.yaml. resolve_reference /usr/local/lib/python3.6/dist-packages/armada/handlers/document.py:49
2019-05-28 19:50:27.904 41 DEBUG armada.handlers.tiller [-] Using Tiller namespace: kube-system _get_tiller_namespace /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:174
2019-05-28 19:50:27.963 41 DEBUG armada.handlers.tiller [-] Found at least one Running Tiller pod. _get_tiller_pod /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:150
2019-05-28 19:50:27.963 41 DEBUG armada.handlers.tiller [-] Using Tiller pod IP: 192.168.204.3 _get_tiller_ip /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:165
2019-05-28 19:50:27.963 41 DEBUG armada.handlers.tiller [-] Using Tiller host port: 44134 _get_tiller_port /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:170
2019-05-28 19:50:27.964 41 DEBUG armada.handlers.tiller [-] Tiller getting gRPC insecure channel at 192.168.204.3:44134 with options: [grpc.max_send_message_length=429496729, grpc.max_receive_message_length=429496729] get_channel /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:124
2019-05-28 19:50:27.999 41 DEBUG armada.handlers.tiller [-] Armada is using Tiller at: None:44134, namespace=kube-system, timeout=300 __init__ /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:104
2019-05-28 19:50:28.010 41 INFO armada.handlers.lock [-] Acquiring lock
2019-05-28 19:50:28.024 41 INFO armada.handlers.lock [-] Lock Custom Resource Definition not found, creating now
2019-05-28 19:50:28.047 41 DEBUG armada.handlers.lock [-] Encountered known issue while creating CRD, continuing create_definition /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:297
2019-05-28 19:50:28.050 41 INFO armada.handlers.lock [-] Lock Custom Resource Definition not found, creating now
2019-05-28 19:50:28.109 41 DEBUG armada.utils.validate [-] Validating document [armada/Chart/v1] helm-toolkit validate_armada_document /usr/local/lib/python3.6/dist-packages/armada/utils/validate.py:152
2019-05-28 19:50:28.111 41 DEBUG armada.utils.validate [-] Validating document [armada/Chart/v1] kube-system-rbd-provisioner validate_armada_document /usr/local/lib/python3.6/dist-packages/armada/utils/validate.py:152
2019-05-28 19:50:28.111 41 DEBUG armada.utils.validate [-] Validating document [armada/Chart/v1] kube-system-ceph-pools-audit validate_armada_document /usr/local/lib/python3.6/dist-packages/armada/utils/validate.py:152
2019-05-28 19:50:28.112 41 DEBUG armada.utils.validate [-] Validating document [armada/ChartGroup/v1] starlingx-ceph-charts validate_armada_document /usr/local/lib/python3.6/dist-packages/armada/utils/valida...

summary: - Simplex: platform-integ-apps apply failed
+ platform-integ-apps apply failed
Revision history for this message
Bart Wensley (bartwensley) wrote :

A couple more things. This is happening in a designer load built on May 28:
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"
JOB="n/a"
BUILD_BY="bwensley"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-bwensley-lx-vm2"
BUILD_DATE="2019-05-28 06:45:22 -0500"
BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/bwensley/starlingx-1/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/bwensley/starlingx-1/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Also, this happened twice today. I have not been able to do a successful installation.

Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Just a note, the collect and logs that I uploaded, correspond to a baremetal server using mirror (local) registry. Initially, this bug is reported for proxy. As suggested, I created a new launchpad to track the registry issue separately: https://bugs.launchpad.net/starlingx/+bug/1830826

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
importance: Undecided → High
status: Incomplete → Confirmed
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; impacts container deployment.

tags: added: stx.2.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/662075

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/662075
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=12ff7c16f8850355fc9e0afa7a406083b4d42deb
Submitter: Zuul
Branch: master

commit 12ff7c16f8850355fc9e0afa7a406083b4d42deb
Author: Robert Church <email address hidden>
Date: Wed May 29 02:04:35 2019 -0400

    Update rbd-provisioner replicas based on installed controllers

    Currently the number of rbd-provisioner replicas is driven by the
    stx-openstack application's 'openstack-control-plane' labels.

    On systems where this label has not been applied to the controllers,
    this will result in zero provisioners being installed.

    Break the dependency on the stx-openstack app and set the number of
    replicas based on the number of installed controllers as the
    rbd-provisioner node selector will install in k8s masters (i.e.
    controllers).

    Also update the provisioner's storage-init pod to align with the same
    node selection criteria as the rbd-provisioner pod.

    Change-Id: Ida180fd12a4923c8cdd5bccf25a1a1e2af4f8a90
    Closes-Bug: #1830290
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Just for your information, after the switch to Ansible, I was able to reproduce this issue in all configs.
All hosts can be unlocked, enabled and available and provisioning fails in the same point.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote : Kubernetes cheat sheet
Revision history for this message
Cristopher Lemus (cjlemusc) wrote : Sanity logs

En un rato te mando external en virtual

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Somehow, a slack comment managed to update this bug, please, disregard.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.