App platform-integ-apps failed to apply (error get configmaps)

Bug #1856078 reported by Cristopher Lemus
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bob Church

Bug Description

Brief Description
-----------------
During provision, platform-integ-apps failed to apply. Armada log reports:

details = "the server could not find the requested resource (get configmaps)"

Severity
--------
Critical: Not possible to complete the setup.

Steps to Reproduce
------------------
Follow up documentation to install and configure starlingx duplex on baremetal.

Expected Behavior
------------------
platform-integ-apps should apply automatically

Actual Behavior
----------------
platform-integ-apps failed to apply

Reproducibility
---------------
Seen once - Will update if we face the issue on another try.

System Configuration
--------------------
Duplex Baremetal

Branch/Pull Time/Commit
-----------------------
MASTER - BUILD_ID="20191211T023000Z"

Last Pass
---------
Passed with build from one day before.

Timestamp/Logs
--------------
Full collect attached.
Additional details:
http://paste.openstack.org/show/787471/

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Christopher, Is this an issue on loads from the r/stx.3.0 branch or is it isolated to master?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Ghada, This is isolated to master, for r/stx.3.0 , none of the configurations (either virtual or baremetal) had this issue, r/stx.3.0 sanity is still green.

Revision history for this message
Al Bailey (albailey1974) wrote :

I don't know if this is it or not, but there is a warning that indicates that it plans to treat requests as anonymous.

{"log":"I1211 11:48:13.020219 1 serving.go:319] Generated self-signed cert in-memory\n","stream":"stderr","time":"2019-12-11T11:48:13.020503599Z"}
{"log":"W1211 11:48:23.557229 1 authentication.go:199] Error looking up in-cluster authentication configuration: Get https://192.168.206.1:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: net/http: TLS handshake timeout\n","stream":"stderr","time":"2019-12-11T11:48:23.55747706Z"}
{"log":"W1211 11:48:23.557253 1 authentication.go:200] Continuing without authentication configuration. This may treat all requests as anonymous.\n","stream":"stderr","time":"2019-12-11T11:48:23.557517181Z"}
{"log":"W1211 11:48:23.557258 1 authentication.go:201] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false\n","stream":"stderr","time":"2019-12-11T11:48:23.557528497Z"}

Afterwards tiller logs show:
{"log":"[storage/driver] 2019/12/11 11:52:03 list: failed to list: the server could not find the requested resource (get configmaps)\n","stream":"stderr","time":"2019-12-11T11:52:03.395419187Z"}

Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Hi Christopher, please let us know if this is seen on the newer load from master and/or r/stx.3.0. Thanks.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Ghada, sorry, yesterday was a holiday on my location. However, I took a look at both sanities, from yesterday, 20191212T031052Z/ and also today 20191213T023000Z/ . The error did not reproduced on baremetal nor virtual environments.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Verified against master branch again, latest build: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20191213T023000Z/

The issue did not reproduced.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Thanks Christopher. This appears to be an intermittent issue. We'll leave it open for now until further investigation.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue appears to be intermittent. Can raise the priority if it becomes more frequent

tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0
removed: stx.3.0
Changed in starlingx:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699306

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699307

Revision history for this message
Cristopher Lemus (cjlemusc) wrote : Re: App platform-integ-apps failed to apply

This behavior reproduced with master ISO BUILD_ID="20191220T023000Z". Attached is a full collect. I see that the review is still not closed. Could you please help me to confirm that is the same issue? Standard External storage is complaining about configmaps.

I got these details from the logs: http://paste.openstack.org/show/787826/

If this is a different issue, I'll proceed to create a new bug. Thanks in advance.

Ghada Khalil (gkhalil)
summary: - App platform-integ-apps failed to apply
+ App platform-integ-apps failed to apply (error get configmaps)
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced during the regression test on wcp71-75. in load 2020-01-31_00-10-00.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on

Lab: WCP_71_75
Load: 2020-02-03_00-10-00

Log @
https://files.starlingx.kube.cengn.ca/launchpad/1856078

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced when system applied both stx-monitor and hello-kitty apps.
Lab: WCP_112
Load: 2020-02-06_04-10-00
Log added @
https://files.starlingx.kube.cengn.ca/launchpad/1856078

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (5.0 KiB)

Issue was apparent on clean install
Lab (IPV6): WP-3-7
2020-02-22_11-13-25

see sysinv.log (controller-0 log attached)

sysinv 2020-02-24 16:02:15.565 274704 INFO sysinv.conductor.kube_app [-] Armada service started!
sysinv 2020-02-24 16:02:15.566 274704 INFO sysinv.conductor.kube_app [-] Armada apply command = /bin/bash -c 'set -o pipefail; armada apply --enable-chart-cleanup --debug /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-rbd-provisioner.yaml --values /overrides/platform-integ-apps/1.0-8/kube-system-ceph-pools-audit.yaml --values /overrides/platform-integ-apps/1.0-8/helm-toolkit-helm-toolkit.yaml --tiller-host tiller-deploy.kube-system.svc.cluster.local | tee /logs/platform-integ-apps-apply_2020-02-24-16-02-15.log'
sysinv 2020-02-24 16:02:16.376 274704 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app platform-integ-apps
sysinv 2020-02-24 16:02:17.513 274704 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/platform-integ-apps/1.0-8/platform-integ-apps-manifest.yaml. See /var/log/armada/platform-integ-apps-apply_2020-02-24-16-02-15.log for details.

see also armada log.

 get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215
2020-02-24 16:02:17.327 16 INFO armada.handlers.lock [-] Releasing lock
2020-02-24 16:02:17.332 16 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
 status = StatusCode.UNKNOWN
 details = "the server could not find the requested resource (get configmaps)"
 debug_error_string = "{"created":"@1582560136.998050126","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"the server could not find the requested resource (get configmaps)","grpc_status":2}"
>

2020-02-24 16:02:17.332 16 ERROR armada.cli Traceback (most recent call last):
2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke

2020-02-24 16:02:17.332 16 ERROR armada.cli self.invoke()
2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-02-24 16:02:17.332 16 ERROR armada.cli resp = self.handle(documents, tiller)
2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-02-24 16:02:17.332 16 ERROR armada.cli return future.result()
2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-02-24 16:02:17.332 16 ERROR armada.cli return self.__get_result()

2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-02-24 16:02:17.332 16 ERROR armada.cli raise self._exception

2020-02-24 16:02:17.332 16 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-02-24 16:02:17.332 16 ERROR armada.cli...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by Bob Church (<email address hidden>) on branch: master
Review: https://review.opendev.org/699307
Reason: We will live with this bug until we move to Helm V3 (which is tilllerless) and will eliminate this race condition

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by Bob Church (<email address hidden>) on branch: master
Review: https://review.opendev.org/699306
Reason: We will live with this bug until we move to Helm V3 (which is tilllerless) and will eliminate this race condition

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This error was found on Baremetal Duplex, fresh install, using build:

Additional info: http://paste.openstack.org/show/790090/

Collect uploaded: https://files.starlingx.kube.cengn.ca/download_file/51

I understand that this will be randomly appearing until the implementation of Helm V3. Should we keep updating the bug with new findings?

Revision history for this message
Frank Miller (sensfan22) wrote :

The root cause of this issue is in the upstream tiller project. There is a planned stx.4.0 activity to upversion helm to v3 which will remove tiller and prevent this issue from occurring: https://storyboard.openstack.org/#!/story/2007000

Until that story is implemented please use this workaround:

If helm commands fail with the following error then a restart of the tiller pod will be required:
controller-0:~$ helm ls -a
Error: the server could not find the requested resource (get configmaps)
To restart the tiller pod use the following:
controller-0:~$ kubectl get pods --all-namespaces -o wide|grep tiller-deploy
kube-system tiller-deploy-d6b59fcb-g676l 1/1 Running 1 18h fd01:1::3 controller-0 <none> <none>

controller-0:~$ kubectl delete pods -n kube-system tiller-deploy-d6b59fcb-g676l

Confirm tiller was restarted:
controller-0:~$ kubectl get pods --all-namespaces -o wide|grep tiller-deploy

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721171

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/699307
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=abbf21f7fcef00e90e75d393f638a73d58b41adb
Submitter: Zuul
Branch: master

commit abbf21f7fcef00e90e75d393f638a73d58b41adb
Author: Robert Church <email address hidden>
Date: Mon Dec 16 12:53:10 2019 -0500

    Patch tiller deployment to provide environment validation

    There appears to be a race condition between when kubelet sees a pod and
    when kubelet sees a service. Due to this race, required environment
    variable are missing to allow tiller to function properly.

    See the comment at
    https://github.com/kubernetes/kubernetes/blob/v1.18.1/pkg/kubelet/kubelet_pods.go#L566

    This change patches the tiller deployment to make sure the four classes
    of environment variables are present prior to starting tiller. If any
    class of variables are not present in the environment, then exit. This
    will recreate the pod and will populate the correct environment for
    tiller to function.

    Since the upgrade to v1.18.1, this has been seen in simplex and duplex
    controller configurations.

    This will cover patching during initial provisioning via ansible and
    will be reverted once StarlingX moves to helm v3.

    Change-Id: I78e43459fedab611a67b8d9b6b3121b78ef048a6
    Partial-Bug: #1856078
    Signed-off-by: Robert Church <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/721171
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=24a0284e3d182faac2b613ddb9f9f36c5ba3995a
Submitter: Zuul
Branch: master

commit 24a0284e3d182faac2b613ddb9f9f36c5ba3995a
Author: Robert Church <email address hidden>
Date: Sun Apr 19 06:22:50 2020 -0400

    Patch Tiller deployment to ensure self-recovery

    On node startup, there appears to be a race condition between when
    kubelet sees a pod and when kubelet sees a service. Due to this race,
    required environment variable are missing to allow tiller to function
    properly.

    See the comment at
    https://github.com/kubernetes/kubernetes/blob/v1.18.1/pkg/kubelet/kubelet_pods.go#L566

    This change patches the tiller deployment to make sure the four classes
    of environment variables are present prior to starting tiller. If any
    class of variables are not present in the environment, then exit. This
    will recreate the pod and will populate the correct environment for
    tiller to function.

    Since the upgrade to v1.18.1, this has been seen in simplex and duplex
    controller configurations.

    Review https://review.opendev.org/#/c/699307/ will cover patching during
    initial provisioning via ansible. This change will check that tiller is
    patched every time the conductor starts as part of the tiller upgrade
    logic. This will cover scenarios where tiller is manually removed from
    the cluster and reinstalled via helm.

    This change should be reverted once StarlingX moves to helm v3.

    Also removed dead code: get_k8s_secret()

    Change-Id: Icd199ec1b1e10840094c0eae59d53838f32ffd6f
    Closes-Bug: #1856078
    Signed-off-by: Robert Church <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (37.5 KiB)

Reviewed: https://review.opendev.org/729812
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=539d476456277c22d0dcbc3cbbc832e623242264
Submitter: Zuul
Branch: f/centos8

commit 320cc40de8518787c2be234d7fdf88ec0a462df2
Author: Don Penney <email address hidden>
Date: Wed May 13 13:06:11 2020 -0400

    Add auto-versioning to starlingx/config packages

    This update makes use of the PKG_GITREVCOUNT variable to auto-version
    the packages in this repo.

    Change-Id: I3a2c8caeb4b4647608978b1f2ccfcf0661508803
    Depends-On: https://review.opendev.org/727837
    Story: 2006166
    Task: 39766
    Signed-off-by: Don Penney <email address hidden>

commit d9f2aea0fb228ed69eb9c9262e29041eedabc15d
Author: Sharath Kumar K <email address hidden>
Date: Wed Apr 22 16:22:22 2020 +0200

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch9 changes.

    Story: 2006387
    Task: 39524

    Change-Id: Ia1fe0f2baafb78c974551100f16e6a7d99882f15
    Signed-off-by: Sharath Kumar K <email address hidden>

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec file
    2. Rename TIS to StarlingX for .service files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch10 changes.

    Story: 2006387
    Task: 36202

    Change-Id: I404ce0da2621495175ad31489e9ad6f7b0211e26
    Signed-off-by: Sharath Kumar K <email address hidden>

commit d141e954fa6bbf688929ec90d1b6604a97792c43
Author: Teresa Ho <email address hidden>
Date: Tue Mar 31 10:08:57 2020 -0400

    Sysinv extensions for FPGA support

    This update adds cli and restapi to support FPGA device
    programming.

    CLI commands:
    system device-image-apply
    system device-image-create
    system device-image-delete
    system device-image-list
    system device-image-remove
    system device-image-show
    system device-image-state-list
    system device-label-list
    system host-device-image-update
    system host-device-image-update-abort
    system host-device-label-assign
    system host-device-label-list
    system host-device-label-remove

    Story: 2006740
    Task: 39498

    Change-Id: I556c2e7a51b3931b5a66ab27b67f51e3a8aebd9f
    Signed-off-by: Teresa Ho <email address hidden>

commit 491cca42ed854d2cb3ee3646b93c56a4f45f563c
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 11:25:26 2020 +0000

    Qcow2 conversion to raw can be done using 'image-conversion' filesystem

    1. Conversion filesystem can be added before/after
       stx-openstack is applied
    2. If conversion filesystem is added after stx-openstack
       is applied, changes to stx-openstack will only take effec...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.