Application-apply failed due to error copying secret ceph-pool-kube-rbd

Bug #1828896 reported by Maria Guadalupe Perez Ibara
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Bob Church

Bug Description

Brief Description
-----------------
Application-apply failed due to error copying secret ceph-pool-kube-rbd

Severity
--------
Critical

Steps to Reproduce
------------------
1. Have a deployment Standar 2+2 or 2+2+2 ready
2. Execute application apply
  $ system application-apply stx-openstack

Expected Behavior
------------------
Application apply should be completed successfully

Actual Behavior
----------------
Application-apply failed

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
Multi-node system, Dedicated storage BM

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190512T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="99"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-12 23:30:00 +0000"

Timestamp/Logs
--------------
following error is logged on /var/log/sysinv.log

2019-05-13 11:59:08.623 96691 ERROR sysinv.common.kubernetes [req-c06222cc-5e13-4269-86d0-97ecaba9b21d admin admin] Failed to copy Secret ceph-pool-kube-rbd from Namespace kube-system to Namespace opensta
ck: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}
2019-05-13 11:59:08.623 96691 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}
2019-05-13 11:59:08.623 96691 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}

2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1150, in perform_app_apply
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app self._create_storage_provisioner_secrets(app.name)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 697, in _create_storage_provisioner_secrets
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app pool_secret, common.HELM_NS_STORAGE_PROVISIONER, ns)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/common/kubernetes.py", line 133, in kube_copy_secret
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app body = c.read_namespaced_secret(name, src_namespace, export=True)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19486, in read_namespaced_secret
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app (data) = self.read_namespaced_secret_with_http_info(name, namespace, **kwargs)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19577, in read_namespaced_secret_with_http_info
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app collection_formats=collection_formats)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app _return_http_data_only, collection_formats, _preload_content, _request_timeout)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app _request_timeout=_request_timeout)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app headers=headers)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app query_params=query_params)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app raise ApiException(http_resp=r)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app ApiException: (404)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app Reason: Not Found

Test Activity
-------------
Sanity

Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :
Erich Cordoba (ericho)
summary: - Application-apply failed.
+ Application-apply failed due to error copying secret ceph-pool-kube-rbd
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → Critical
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Suspect this is related to recent rbd-provisioner de-coupling; assigning to Bob to triage

Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Bob Church (rchurch) wrote :

Looks like the storage init job for the red-provisioner failed. Because of the failure, the secret is not created.

For the 2+2 and the 2+2+2 we are provisioning the OSDs (which loads the crushmap) much later as we need to establish a quorum (2 of 3 monitors). The platform-integ-apps will apply successfully early during provisioning, but the provisioner storage-init job will fail since a quorum and/or the crushmap is not loaded in the cluster.

https://review.opendev.org/#/c/658942/ will ensure that platform-integ-apps will not be applied until the required Ceph cluster dependencies are available.

Bob Church (rchurch)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Bob Church (rchurch) wrote :

The workaround for this issue is to run the following after the system is fully provisioned and the Ceph cluster is operational.

$ system application-remove platform-integ-apps
$ system application-apply platform-integ-apps

After this you can upload and apply the stx-openstack application.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; critical priority as this is causing a red sanity.

tags: added: stx.2.0 stx.sanity
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/658942
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=a5a0619ebccea603fc85df39fbc6e190ddae0f93
Submitter: Zuul
Branch: master

commit a5a0619ebccea603fc85df39fbc6e190ddae0f93
Author: Robert Church <email address hidden>
Date: Mon May 13 04:28:02 2019 -0400

    Add application apply prerequisites for platform managed apps

    Add an application-apply dependency for the platform integration
    application which launches the Ceph related charts. This dependency will
    require that a quorum has been established and the crushmap has been
    loaded prior to launching the application.

    This will ensure that the charts have the Ceph connectivity required for
    a successful chart release.

    Change-Id: I56528200d16c68d129bc092e3dcc9af135cff16a
    Story: 2005424
    Task: 30977
    Related-Bug: #1828896
    Signed-off-by: Robert Church <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (5.3 KiB)

Still saw same issue on SM-2 load: 2019-05-18_06-36-50

2019-05-21 13:16:46.547 99992 ERROR sysinv.common.kubernetes [req-0b94ce00-104d-4d3f-9aa1-9acf4c75201a admin admin] Failed to copy Secret ceph-pool-kube-rbd from Namespace kube-system to Namespace openstack: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}
2019-05-21 13:16:46.547 99992 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}
2019-05-21 13:16:46.547 99992 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}

2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1149, in perform_app_apply
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app self._create_storage_provisioner_secrets(app.name)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 696, in _create_storage_provisioner_secrets
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app pool_secret, common.HELM_NS_STORAGE_PROVISIONER, ns)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/common/kubernetes.py", line 133, in kube_copy_secret
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app body = c.read_namespaced_secret(name, src_namespace, export=True)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19486, in read_namespaced_secret
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app (data) = self.read_namespaced_secret_with_http_info(name, namespace, **kwargs)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19577, in read_namespaced_secret_with_http_info
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app collection_formats=collection_formats)
2019-...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Bob Church (rchurch) wrote :
Download full text (6.8 KiB)

Issues observed here:

1) The stx-openstack app is being applied prior to the completion of the platform-integ-apps
   - Per: http://lists.starlingx.io/pipermail/starlingx-discuss/2019-May/004447.html

     Prior to running any additional user applications (including the stx-openstack
     application), you will want to make sure that the platform application has been
     applied [4] to ensure that persistent volume claims will be serviced. Other than
     this check, no other additional changes are required from an automation
     perspective to launch the stx-openstack application.

   - https://opendev.org/starlingx/config/commit/4758cdfbd864826d46e6e06571d40693dd040b14 will prevent this apply if attempted too soon

2) The stx-openstack apply aborts because the secret created by the platform-integ-apps didn't occur yet
   - Update for #1 will avoid this

3) platform-integ-apps overrides are being overwritten for the helm toolkit when the stx-openstack upload/apply occurs. This seems to cause the abort of platform-integ-apps
   - We need to land https://review.opendev.org/#/c/660498/ to isolate the app overrides. The difference is shown here as the toolkit is present in both helm repos as they are required by both apps

     [wrsroot@controller-0 19.05(keystone_admin)]$ diff helm-toolkit-helm-toolkit.yaml ~/openstack-save/helm-toolkit-helm-toolkit.yaml
     3c3
     < location: http://controller:8080/helm_charts/stx-platform/helm-toolkit-0.1.0.tgz
     ---
     > location: http://controller:8080/helm_charts/starlingx/helm-toolkit-0.1.0.tgz

In summary, we have a sequencing issue here which can no longer happen based on the inter_app dependency code that I added in https://opendev.org/starlingx/config/commit/4758cdfbd864826d46e6e06571d40693dd040b14

Timeline:
---------------------------------------------
# Ceph client is accessable

2019-05-21 07:13:06.530 98217 INFO ceph_client [-] Request params: url=https://controller-0:5001/request?wait=1, json={'prefix': 'fsid', 'format': 'text'}
2019-05-21 07:13:06.546 98217 INFO ceph_client [-] Result: {u'waiting': [], u'has_failed': False, u'state': u'success', u'is_waiting': False, u'running': [], u'failed': [], u'finished': [{u'outb': u'326ed215-c644-4855-b5f9-eaeb0328ff73\n', u'outs': u'', u'command': u'fsid format=text'}], u'is_finished': True, u'id': u'140310473308432'}

# Audit task triggers creation/upload of platform-integ-apps

2019-05-21 07:13:20.082 99992 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Creating...
2019-05-21 07:13:21.428 99992 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Uploading...
2019-05-21 07:13:21.430 99992 INFO sysinv.conductor.kube_app [-] Application (platform-integ-apps) upload started.
2019-05-21 07:13:23.633 99992 INFO sysinv.conductor.kube_app [-] Manifest file /manifests/platform-integ-apps-manifest.yaml was successfully validated.
2019-05-21 07:13:24.178 99992 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps will load charts to chart repo stx-platform
2019-05-21 07:13:27.362 99992 INFO sysinv.conductor.kube_app [-] Generating application overrides...

Read more...

Revision history for this message
Saul Wold (sgw-starlingx) wrote :

As suggested by Al, I might be seeing a manifestation of this issue in a Virtual Environment, when I follow the existing test-suite scripts used for Sanity Testing, I always fail the first apply and it succeeds on the second apply.

http://lists.starlingx.io/pipermail/starlingx-discuss/2019-May/004677.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers