Application-apply failed due to error copying secret ceph-pool-kube-rbd

Bug #1828896 reported by Maria Guadalupe Perez Ibara
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Bob Church

Bug Description

Brief Description
-----------------
Application-apply failed due to error copying secret ceph-pool-kube-rbd

Severity
--------
Critical

Steps to Reproduce
------------------
1. Have a deployment Standar 2+2 or 2+2+2 ready
2. Execute application apply
  $ system application-apply stx-openstack

Expected Behavior
------------------
Application apply should be completed successfully

Actual Behavior
----------------
Application-apply failed

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
Multi-node system, Dedicated storage BM

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190512T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="99"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-12 23:30:00 +0000"

Timestamp/Logs
--------------
following error is logged on /var/log/sysinv.log

2019-05-13 11:59:08.623 96691 ERROR sysinv.common.kubernetes [req-c06222cc-5e13-4269-86d0-97ecaba9b21d admin admin] Failed to copy Secret ceph-pool-kube-rbd from Namespace kube-system to Namespace opensta
ck: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}
2019-05-13 11:59:08.623 96691 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}
2019-05-13 11:59:08.623 96691 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 13 May 2019 11:59:08 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kin
d":"secrets"},"code":404}

2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1150, in perform_app_apply
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app self._create_storage_provisioner_secrets(app.name)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 697, in _create_storage_provisioner_secrets
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app pool_secret, common.HELM_NS_STORAGE_PROVISIONER, ns)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/common/kubernetes.py", line 133, in kube_copy_secret
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app body = c.read_namespaced_secret(name, src_namespace, export=True)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19486, in read_namespaced_secret
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app (data) = self.read_namespaced_secret_with_http_info(name, namespace, **kwargs)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19577, in read_namespaced_secret_with_http_info
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app collection_formats=collection_formats)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app _return_http_data_only, collection_formats, _preload_content, _request_timeout)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app _request_timeout=_request_timeout)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app headers=headers)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app query_params=query_params)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app raise ApiException(http_resp=r)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app ApiException: (404)
2019-05-13 11:59:08.623 96691 TRACE sysinv.conductor.kube_app Reason: Not Found

Test Activity
-------------
Sanity

Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :
Erich Cordoba (ericho)
summary: - Application-apply failed.
+ Application-apply failed due to error copying secret ceph-pool-kube-rbd
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → Critical
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Suspect this is related to recent rbd-provisioner de-coupling; assigning to Bob to triage

Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Bob Church (rchurch) wrote :

Looks like the storage init job for the red-provisioner failed. Because of the failure, the secret is not created.

For the 2+2 and the 2+2+2 we are provisioning the OSDs (which loads the crushmap) much later as we need to establish a quorum (2 of 3 monitors). The platform-integ-apps will apply successfully early during provisioning, but the provisioner storage-init job will fail since a quorum and/or the crushmap is not loaded in the cluster.

https://review.opendev.org/#/c/658942/ will ensure that platform-integ-apps will not be applied until the required Ceph cluster dependencies are available.

Bob Church (rchurch)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Bob Church (rchurch) wrote :

The workaround for this issue is to run the following after the system is fully provisioned and the Ceph cluster is operational.

$ system application-remove platform-integ-apps
$ system application-apply platform-integ-apps

After this you can upload and apply the stx-openstack application.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; critical priority as this is causing a red sanity.

tags: added: stx.2.0 stx.sanity
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/658942
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=a5a0619ebccea603fc85df39fbc6e190ddae0f93
Submitter: Zuul
Branch: master

commit a5a0619ebccea603fc85df39fbc6e190ddae0f93
Author: Robert Church <email address hidden>
Date: Mon May 13 04:28:02 2019 -0400

    Add application apply prerequisites for platform managed apps

    Add an application-apply dependency for the platform integration
    application which launches the Ceph related charts. This dependency will
    require that a quorum has been established and the crushmap has been
    loaded prior to launching the application.

    This will ensure that the charts have the Ceph connectivity required for
    a successful chart release.

    Change-Id: I56528200d16c68d129bc092e3dcc9af135cff16a
    Story: 2005424
    Task: 30977
    Related-Bug: #1828896
    Signed-off-by: Robert Church <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (5.3 KiB)

Still saw same issue on SM-2 load: 2019-05-18_06-36-50

2019-05-21 13:16:46.547 99992 ERROR sysinv.common.kubernetes [req-0b94ce00-104d-4d3f-9aa1-9acf4c75201a admin admin] Failed to copy Secret ceph-pool-kube-rbd from Namespace kube-system to Namespace openstack: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}
2019-05-21 13:16:46.547 99992 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}
2019-05-21 13:16:46.547 99992 ERROR sysinv.conductor.kube_app [-] (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 21 May 2019 13:16:46 GMT', 'Content-Length': '210', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets \"ceph-pool-kube-rbd\" not found","reason":"NotFound","details":{"name":"ceph-pool-kube-rbd","kind":"secrets"},"code":404}

2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app Traceback (most recent call last):
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 1149, in perform_app_apply
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app self._create_storage_provisioner_secrets(app.name)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 696, in _create_storage_provisioner_secrets
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app pool_secret, common.HELM_NS_STORAGE_PROVISIONER, ns)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib64/python2.7/site-packages/sysinv/common/kubernetes.py", line 133, in kube_copy_secret
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app body = c.read_namespaced_secret(name, src_namespace, export=True)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19486, in read_namespaced_secret
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app (data) = self.read_namespaced_secret_with_http_info(name, namespace, **kwargs)
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app File "/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 19577, in read_namespaced_secret_with_http_info
2019-05-21 13:16:46.547 99992 TRACE sysinv.conductor.kube_app collection_formats=collection_formats)
2019-...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Bob Church (rchurch) wrote :
Download full text (6.8 KiB)

Issues observed here:

1) The stx-openstack app is being applied prior to the completion of the platform-integ-apps
   - Per: http://lists.starlingx.io/pipermail/starlingx-discuss/2019-May/004447.html

     Prior to running any additional user applications (including the stx-openstack
     application), you will want to make sure that the platform application has been
     applied [4] to ensure that persistent volume claims will be serviced. Other than
     this check, no other additional changes are required from an automation
     perspective to launch the stx-openstack application.

   - https://opendev.org/starlingx/config/commit/4758cdfbd864826d46e6e06571d40693dd040b14 will prevent this apply if attempted too soon

2) The stx-openstack apply aborts because the secret created by the platform-integ-apps didn't occur yet
   - Update for #1 will avoid this

3) platform-integ-apps overrides are being overwritten for the helm toolkit when the stx-openstack upload/apply occurs. This seems to cause the abort of platform-integ-apps
   - We need to land https://review.opendev.org/#/c/660498/ to isolate the app overrides. The difference is shown here as the toolkit is present in both helm repos as they are required by both apps

     [wrsroot@controller-0 19.05(keystone_admin)]$ diff helm-toolkit-helm-toolkit.yaml ~/openstack-save/helm-toolkit-helm-toolkit.yaml
     3c3
     < location: http://controller:8080/helm_charts/stx-platform/helm-toolkit-0.1.0.tgz
     ---
     > location: http://controller:8080/helm_charts/starlingx/helm-toolkit-0.1.0.tgz

In summary, we have a sequencing issue here which can no longer happen based on the inter_app dependency code that I added in https://opendev.org/starlingx/config/commit/4758cdfbd864826d46e6e06571d40693dd040b14

Timeline:
---------------------------------------------
# Ceph client is accessable

2019-05-21 07:13:06.530 98217 INFO ceph_client [-] Request params: url=https://controller-0:5001/request?wait=1, json={'prefix': 'fsid', 'format': 'text'}
2019-05-21 07:13:06.546 98217 INFO ceph_client [-] Result: {u'waiting': [], u'has_failed': False, u'state': u'success', u'is_waiting': False, u'running': [], u'failed': [], u'finished': [{u'outb': u'326ed215-c644-4855-b5f9-eaeb0328ff73\n', u'outs': u'', u'command': u'fsid format=text'}], u'is_finished': True, u'id': u'140310473308432'}

# Audit task triggers creation/upload of platform-integ-apps

2019-05-21 07:13:20.082 99992 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Creating...
2019-05-21 07:13:21.428 99992 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Uploading...
2019-05-21 07:13:21.430 99992 INFO sysinv.conductor.kube_app [-] Application (platform-integ-apps) upload started.
2019-05-21 07:13:23.633 99992 INFO sysinv.conductor.kube_app [-] Manifest file /manifests/platform-integ-apps-manifest.yaml was successfully validated.
2019-05-21 07:13:24.178 99992 INFO sysinv.conductor.kube_app [-] Application platform-integ-apps will load charts to chart repo stx-platform
2019-05-21 07:13:27.362 99992 INFO sysinv.conductor.kube_app [-] Generating application overrides...

Read more...

Revision history for this message
Saul Wold (sgw-starlingx) wrote :

As suggested by Al, I might be seeing a manifestation of this issue in a Virtual Environment, when I follow the existing test-suite scripts used for Sanity Testing, I always fail the first apply and it succeeds on the second apply.

http://lists.starlingx.io/pipermail/starlingx-discuss/2019-May/004677.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.