Simplex - application-apply aborts on ceilometer

Bug #1820928 reported by Cristopher Lemus on 2019-03-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Critical
Bart Wensley

Bug Description

Title
-----
Simplex - application-apply fails at 95% - ceilometer pod

Brief Description
-----------------
On simplex virtual and baremetal, system application-apply aborts at 95% of progress, during “processing chart: osh-openstack-ceilometer, overall completion: 95.0%”

Severity
--------
Critical on Simplex.
Duplex and Standard controller are not affected.

Steps to Reproduce
------------------
Follow up https://wiki.openstack.org/wiki/StarlingX/Containers/Installation, observed during system application-apply.
This has been tested on two different bare metal servers and virtual environment.

Expected Behavior
------------------
The system application-apply should complete.

Actual Behavior
----------------
System application-apply aborts at 95%.

Reproducibility
---------------
100% reproducible on simplex.

System Configuration
--------------------
Simplex BareMetal and virtual.

Branch/Pull Time/Commit
-----------------------
CENGN Master ISO: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190318T233000Z/outputs/iso/

Timestamp/Logs
--------------
[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------+---------------+--------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+---------------+--------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+---------------+--------------+------------------------------------------+

Status of ceilometer pods:
$ kubectl get pods --all-namespaces -o wide |grep ceil
openstack ceilometer-central-6c8bf6d7df-rbxdf 0/1 Init:0/1 0 78m 172.16.0.136 controller-0 <none>
openstack ceilometer-compute-cpb4w 0/1 Init:0/1 0 78m 192.168.204.3 controller-0 <none>
openstack ceilometer-ks-service-jzp8j 0/1 Completed 0 78m 172.16.0.138 controller-0 <none>
openstack ceilometer-ks-user-rzjjn 0/1 Completed 0 78m 172.16.0.137 controller-0 <none>
openstack ceilometer-notification-f5ff4657c-sp5bd 0/1 Init:0/1 0 78m 172.16.0.134 controller-0 <none>
openstack ceilometer-rabbit-init-dfvmf 0/1 Completed 0 78m 172.16.0.139 controller-0 <none>

Full “collect” attached.

Cristopher Lemus (cjlemusc) wrote :
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue affects simplex deployments and was reported from community sanity.

Changed in starlingx:
importance: Undecided → High
tags: added: stx.2019.05 stx.containers
Changed in starlingx:
status: New → Triaged
assignee: nobody → Angie Wang (angiewang)
Angie Wang (angiewang) wrote :

This is not specific to ceilometer chart.

We observed keystone auth token issues after or during application apply.
All the openstack commands are broken because of the following error:
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Unable to establish connection to http://keystone.openstack.svc.cluster.local/v3/auth/tokens: HTTPConnectionPool(host='keystone.openstack.svc.cluster.local', port=80): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc90f088cd0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

In this case, keystone auth token issue happened during ceilometer chart installation. It failed to be installed because ceilometer-db-sync job failed. The ceilometer-db-sync job actually creates ceilometer resources through gnocchi REST API which requires to request keystone token.

We have similar error from internal sanity.

Ghada Khalil (gkhalil) on 2019-03-22
Changed in starlingx:
status: Triaged → In Progress
assignee: Angie Wang (angiewang) → Al Bailey (albailey1974)
Ghada Khalil (gkhalil) on 2019-03-22
Changed in starlingx:
importance: High → Critical
Frank Miller (sensfan22) wrote :

Bart has taken over this investigation - so assigning to Bart

Changed in starlingx:
assignee: Al Bailey (albailey1974) → Bart Wensley (bartwensley)
Bart Wensley (bartwensley) wrote :

I believe we are hitting the following bug: https://github.com/kubernetes/kubernetes/issues/74412

The kubelet is hitting a limit of 250 http2 streams in a single connection. I have tested both of the mitigations described in the following comment and they both work:
https://github.com/kubernetes/kubernetes/issues/74412#issuecomment-468437599

Reviewed: https://review.openstack.org/649289
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=0d61ade5b9eff7d4f9c61f43c25cdf9a7043f8c0
Submitter: Zuul
Branch: master

commit 0d61ade5b9eff7d4f9c61f43c25cdf9a7043f8c0
Author: Bart Wensley <email address hidden>
Date: Tue Apr 2 06:54:43 2019 -0500

    Fix application-apply of stx-openstack on simplex

    The application-apply of the stx-openstack application on
    simplex configurations has been failing since the barbican
    chart was added to the application. The failure was due
    to lost node status messages from the kubelet to the
    kube-apiserver, which causes the node to be marked
    NotReady and endpoints to be removed.

    The root cause is the kubernetes bug here:
    https://github.com/kubernetes/kubernetes/issues/74412

    In short, the addition of the barbican chart added enough
    new secrets/configmaps that the kubelet hit the limit of
    http2-max-streams-per-connection. As done upstream, the
    fix is to change the following kubelet config:
    configMapAndSecretChangeDetectionStrategy (from Watch to
    Cache).

    Change-Id: Ic816a91984c4fb82546e4f43b5c83061222c7d05
    Closes-bug: 1820928
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Cristopher Lemus (cjlemusc) wrote :

Downloaded the latest CENGN ISO (20190403T013000Z) and deployed it on Virtual and BareMetal environment. Confirmed that, on both environments, that the issue is fixed, application-apply completed without any issues.

Sanity test passed on Virtual environment. For Baremetal it's in progress. A detailed report should be sent between today and tomorrow.

Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil) on 2019-04-09
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) on 2019-04-09
tags: removed: stx.retestneeded
Peng Peng (ppeng) wrote :

Verified
Lab: SM_2
Load: 20190410T013000Z

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.