After force reboot of active controller keystone-api pod sometimes hangs

Bug #1823375 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
yong hu

Bug Description

Brief Description
-----------------
After force reboot of active controller keystone-api pod sometimes hangs

Severity
--------
Minor - problem is major impact, but only happens rarely

Steps to Reproduce
------------------
1. Install stx-openstack application on a system with two controllers (I saw this on a 2+2 config).
2. Force reboot the active controller.
I saw the issue on 2 out of about 20 reboots of the active controller.

Expected Behavior
------------------
All of the stx-openstack application's pods should return to service in a reasonable amount of time.

Actual Behavior
----------------
Sometimes the readiness probe fails in the keystone-api pod (on the non-rebooting controller) and it stops functioning. This means keystone is unavailable until the rebooting controller finishes its reboot.

Reproducibility
---------------
Intermittent (seen on 2 out of about 20 active controller reboots)

System Configuration
--------------------
Seen on a 2+2 system.
Lab was windriver ironpass-1-4

Branch/Pull Time/Commit
-----------------------
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="bwensley"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-bwensley-lx-vm2"
BUILD_DATE="2019-03-20 08:05:38 -0500"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Last Pass
---------
This is new testing on the containers deployment.

Timestamp/Logs
--------------

Here are the stx-openstack pods while controller-0 was rebooting, showing the keystone-api pod on controller-1 has 0/1 ready:

[root@controller-1 ~(keystone_admin)]# kubectl -n openstack get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
aodh-api-56fc6554df-lcplh 1/1 Running 0 83m 172.16.1.194 controller-1 <none>
aodh-api-56fc6554df-mvnj6 1/1 Unknown 0 20h 172.16.0.65 controller-0 <none>
aodh-api-56fc6554df-rjlqc 0/1 Pending 0 3m24s <none> <none> <none>
aodh-evaluator-9cc59d7f6-5sr7k 1/1 Unknown 0 20h 172.16.0.41 controller-0 <none>
aodh-evaluator-9cc59d7f6-d7qtj 0/1 Pending 0 3m24s <none> <none> <none>
aodh-evaluator-9cc59d7f6-gj6tr 1/1 Running 0 83m 172.16.1.166 controller-1 <none>
aodh-listener-65df7d6bbf-6bfpk 0/1 Pending 0 3m24s <none> <none> <none>
aodh-listener-65df7d6bbf-8zbnf 1/1 Unknown 0 20h 172.16.0.66 controller-0 <none>
aodh-listener-65df7d6bbf-t6s8m 1/1 Running 0 83m 172.16.1.187 controller-1 <none>
aodh-notifier-9bfd474f4-cx7rd 1/1 Unknown 0 20h 172.16.0.57 controller-0 <none>
aodh-notifier-9bfd474f4-tnt7f 0/1 Pending 0 3m24s <none> <none> <none>
aodh-notifier-9bfd474f4-wbt72 1/1 Running 0 83m 172.16.1.176 controller-1 <none>
barbican-api-54f7966ccf-m7zb8 1/1 Unknown 0 20h 172.16.0.49 controller-0 <none>
barbican-api-54f7966ccf-wmvdt 0/1 Pending 0 3m24s <none> <none> <none>
barbican-api-54f7966ccf-xmfh9 1/1 Running 0 83m 172.16.1.189 controller-1 <none>
ceilometer-central-84f98b7d77-l4nxl 0/1 Pending 0 3m24s <none> <none> <none>
ceilometer-central-84f98b7d77-qp24p 1/1 Running 0 83m 172.16.1.168 controller-1 <none>
ceilometer-central-84f98b7d77-vppxv 1/1 Unknown 0 20h 172.16.0.54 controller-0 <none>
ceilometer-compute-2g8tj 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
ceilometer-compute-df27r 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
ceilometer-notification-c4dc7d9c7-hpp6v 1/1 Unknown 0 8h 172.16.0.74 controller-0 <none>
ceilometer-notification-c4dc7d9c7-rdpv4 1/1 Running 0 3m24s 172.16.1.244 controller-1 <none>
ceph-pools-audit-1554393900-2hcl2 0/1 Completed 0 11m 172.16.1.236 controller-1 <none>
ceph-pools-audit-1554394200-tp7pg 0/1 Completed 0 6m21s 172.16.1.239 controller-1 <none>
ceph-pools-audit-1554394500-qqxh6 0/1 Completed 0 87s 172.16.1.251 controller-1 <none>
cinder-api-55948fcbf-4xb82 0/1 Pending 0 3m24s <none> <none> <none>
cinder-api-55948fcbf-88kk7 1/1 Unknown 0 20h 172.16.0.50 controller-0 <none>
cinder-api-55948fcbf-k46g2 1/1 Running 0 83m 172.16.1.195 controller-1 <none>
cinder-backup-788646f4b6-5cfz2 0/1 Pending 0 3m24s <none> <none> <none>
cinder-backup-788646f4b6-5wgf9 1/1 Unknown 0 20h 172.16.0.62 controller-0 <none>
cinder-backup-788646f4b6-h2hst 1/1 Running 0 83m 172.16.1.197 controller-1 <none>
cinder-scheduler-77f7c847fc-kzksb 1/1 Running 0 83m 172.16.1.196 controller-1 <none>
cinder-scheduler-77f7c847fc-tcj5s 1/1 Unknown 0 20h 172.16.0.60 controller-0 <none>
cinder-scheduler-77f7c847fc-xc8kv 0/1 Pending 0 3m24s <none> <none> <none>
cinder-volume-5874867c6d-8qmd4 1/1 Unknown 0 20h 172.16.0.58 controller-0 <none>
cinder-volume-5874867c6d-h7vgh 0/1 Pending 0 3m24s <none> <none> <none>
cinder-volume-5874867c6d-qrt2h 1/1 Running 0 83m 172.16.1.190 controller-1 <none>
cinder-volume-usage-audit-1554393600-f48lk 0/1 Completed 0 16m 172.16.1.234 controller-1 <none>
cinder-volume-usage-audit-1554393900-lcrzl 0/1 Completed 0 11m 172.16.1.237 controller-1 <none>
cinder-volume-usage-audit-1554394200-c2pwk 0/1 Completed 0 6m21s 172.16.1.240 controller-1 <none>
cinder-volume-usage-audit-1554394500-865ms 0/1 Init:0/1 0 87s 172.16.1.252 controller-1 <none>
glance-api-5476f44c75-hw9ff 1/1 Running 0 83m 172.16.1.191 controller-1 <none>
glance-api-5476f44c75-m42qc 0/1 Pending 0 3m24s <none> <none> <none>
glance-api-5476f44c75-vsbkb 1/1 Unknown 0 20h 172.16.0.52 controller-0 <none>
gnocchi-api-6bcc9d9975-dgp57 1/1 Running 0 83m 172.16.1.172 controller-1 <none>
gnocchi-api-6bcc9d9975-jfrfv 0/1 Pending 0 3m23s <none> <none> <none>
gnocchi-api-6bcc9d9975-kmchm 1/1 Unknown 0 20h 172.16.0.53 controller-0 <none>
gnocchi-metricd-hxd5b 1/1 NodeLost 7 7d20h 172.16.0.30 controller-0 <none>
gnocchi-metricd-t5fcc 1/1 Running 0 74m 172.16.1.170 controller-1 <none>
heat-api-cdd6fb9f4-hl6jv 0/1 Pending 0 3m23s <none> <none> <none>
heat-api-cdd6fb9f4-s4zhn 1/1 Unknown 0 20h 172.16.0.46 controller-0 <none>
heat-api-cdd6fb9f4-xqgj7 1/1 Running 0 83m 172.16.1.178 controller-1 <none>
heat-cfn-6fddc4f786-2vbwm 1/1 Running 0 83m 172.16.1.184 controller-1 <none>
heat-cfn-6fddc4f786-6vbn6 1/1 Unknown 0 20h 172.16.0.47 controller-0 <none>
heat-cfn-6fddc4f786-vx96p 0/1 Pending 0 3m23s <none> <none> <none>
heat-engine-546cf796d7-6zr4d 1/1 Running 0 83m 172.16.1.173 controller-1 <none>
heat-engine-546cf796d7-7prqj 0/1 Pending 0 3m23s <none> <none> <none>
heat-engine-546cf796d7-ld7kq 1/1 Unknown 0 20h 172.16.0.33 controller-0 <none>
heat-engine-cleaner-1554393600-27z2l 0/1 Completed 0 16m 172.16.1.235 controller-1 <none>
heat-engine-cleaner-1554393900-szh24 0/1 Completed 0 11m 172.16.1.238 controller-1 <none>
heat-engine-cleaner-1554394200-qsjnd 0/1 Completed 0 6m21s 172.16.1.241 controller-1 <none>
heat-engine-cleaner-1554394500-dbzz2 0/1 Init:0/1 0 87s 172.16.1.253 controller-1 <none>
horizon-6b786bc8df-9fkft 1/1 Running 0 74m 172.16.1.181 controller-1 <none>
ingress-56dd5dbcd4-rzrp2 0/1 Pending 0 3m23s <none> <none> <none>
ingress-56dd5dbcd4-v5h95 1/1 Unknown 0 20h 172.16.0.55 controller-0 <none>
ingress-56dd5dbcd4-xsztr 1/1 Running 0 83m 172.16.1.180 controller-1 <none>
ingress-error-pages-6f58cd4f8f-chqtc 1/1 Running 0 83m 172.16.1.179 controller-1 <none>
ingress-error-pages-6f58cd4f8f-rvm5p 1/1 Unknown 0 20h 172.16.0.59 controller-0 <none>
ingress-error-pages-6f58cd4f8f-rxmtr 0/1 Pending 0 3m22s <none> <none> <none>
keystone-api-56b76565d8-4w6n2 0/1 Pending 0 3m22s <none> <none> <none>
keystone-api-56b76565d8-65nns 1/1 Unknown 0 20h 172.16.0.43 controller-0 <none>
keystone-api-56b76565d8-mm2qb 0/1 Running 0 83m 172.16.1.193 controller-1 <none>
libvirt-qjdpp 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
libvirt-t7sh8 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
mariadb-ingress-5b8584d855-q2t47 0/1 Pending 0 3m22s <none> <none> <none>
mariadb-ingress-5b8584d855-wbhd4 1/1 Running 0 83m 172.16.1.175 controller-1 <none>
mariadb-ingress-5b8584d855-x5shd 1/1 Unknown 0 20h 172.16.0.51 controller-0 <none>
mariadb-ingress-error-pages-66f4f495b9-bvffq 1/1 Running 0 3m22s 172.16.1.245 controller-1 <none>
mariadb-ingress-error-pages-66f4f495b9-l2r9n 1/1 Unknown 0 83m 172.16.0.9 controller-0 <none>
mariadb-server-0 1/1 Unknown 3 20h 172.16.0.63 controller-0 <none>
mariadb-server-1 1/1 Running 3 83m 172.16.1.199 controller-1 <none>
neutron-dhcp-agent-compute-0-75ea0372-xsml8 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-dhcp-agent-compute-1-eae26dba-dhcwg 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-l3-agent-compute-0-31b0f4b3-rqn6b 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-l3-agent-compute-1-eae26dba-rszqz 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-metadata-agent-compute-0-75ea0372-r4vsm 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-metadata-agent-compute-1-3dfb81d6-dksr7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-ovs-agent-compute-0-75ea0372-8mw9x 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-ovs-agent-compute-1-eae26dba-mqv5d 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-server-85dd569cbf-4s8vh 1/1 Running 0 83m 172.16.1.183 controller-1 <none>
neutron-server-85dd569cbf-7hsp7 0/1 Pending 0 3m22s <none> <none> <none>
neutron-server-85dd569cbf-fb8zx 1/1 Unknown 0 20h 172.16.0.35 controller-0 <none>
neutron-sriov-agent-compute-0-75ea0372-8rw6t 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-sriov-agent-compute-1-eae26dba-fmmz7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
nova-api-metadata-8579687b7b-lnpg2 1/1 Unknown 0 20h 172.16.0.44 controller-0 <none>
nova-api-metadata-8579687b7b-v7rcx 0/1 Pending 0 3m22s <none> <none> <none>
nova-api-metadata-8579687b7b-v87nh 1/1 Running 0 83m 172.16.1.169 controller-1 <none>
nova-api-osapi-bf7cc6977-mjxgz 0/1 Pending 0 3m22s <none> <none> <none>
nova-api-osapi-bf7cc6977-nh982 1/1 Running 0 83m 172.16.1.192 controller-1 <none>
nova-api-osapi-bf7cc6977-zqpgw 1/1 Unknown 0 20h 172.16.0.36 controller-0 <none>
nova-api-proxy-5bfcc78cdd-7ngz5 1/1 Unknown 0 8h 172.16.0.78 controller-0 <none>
nova-api-proxy-5bfcc78cdd-v7vg8 1/1 Running 0 3m21s 172.16.1.246 controller-1 <none>
nova-compute-compute-0-75ea0372-7mj62 2/2 Running 0 7d20h 192.168.204.72 compute-0 <none>
nova-compute-compute-1-3dfb81d6-2lwwn 2/2 Running 0 7d20h 192.168.204.39 compute-1 <none>
nova-conductor-5675c54cfd-5dzfj 1/1 Running 0 83m 172.16.1.177 controller-1 <none>
nova-conductor-5675c54cfd-8cngs 1/1 Unknown 0 20h 172.16.0.48 controller-0 <none>
nova-conductor-5675c54cfd-ngh7h 0/1 Pending 0 3m21s <none> <none> <none>
nova-consoleauth-b5fb8679d-nh82k 1/1 Unknown 0 20h 172.16.0.56 controller-0 <none>
nova-consoleauth-b5fb8679d-t6dhk 1/1 Running 0 83m 172.16.1.188 controller-1 <none>
nova-consoleauth-b5fb8679d-zhwhv 0/1 Pending 0 3m21s <none> <none> <none>
nova-novncproxy-75444ff54b-6hfch 1/1 Unknown 0 20h 192.168.204.3 controller-0 <none>
nova-novncproxy-75444ff54b-q5l86 1/1 Running 0 83m 192.168.204.4 controller-1 <none>
nova-novncproxy-75444ff54b-zbnvs 0/1 Pending 0 3m21s <none> <none> <none>
nova-placement-api-5d8c995b48-wb9k8 1/1 Running 0 83m 172.16.1.167 controller-1 <none>
nova-placement-api-5d8c995b48-wt8lt 1/1 Unknown 0 20h 172.16.0.34 controller-0 <none>
nova-placement-api-5d8c995b48-xjjrh 0/1 Pending 0 3m21s <none> <none> <none>
nova-scheduler-64f854d9f8-4glrn 1/1 Running 0 83m 172.16.1.186 controller-1 <none>
nova-scheduler-64f854d9f8-9f5wf 1/1 Unknown 0 20h 172.16.0.31 controller-0 <none>
nova-scheduler-64f854d9f8-j8nrk 0/1 Pending 0 3m21s <none> <none> <none>
osh-openstack-garbd-garbd-85677d9d98-42fjk 1/1 Running 0 7d20h 172.16.2.2 compute-1 <none>
osh-openstack-memcached-memcached-d8866584f-995kt 1/1 Unknown 0 83m 172.16.0.10 controller-0 <none>
osh-openstack-memcached-memcached-d8866584f-lw59f 1/1 Running 0 3m20s 172.16.1.247 controller-1 <none>
osh-openstack-memcached-memcached-exporter-7c86db4f64-bjkhm 1/1 Unknown 0 83m 172.16.0.11 controller-0 <none>
osh-openstack-memcached-memcached-exporter-7c86db4f64-bpwzm 1/1 Running 0 3m20s 172.16.1.248 controller-1 <none>
osh-openstack-rabbitmq-rabbitmq-0 1/1 Running 0 83m 172.16.1.198 controller-1 <none>
osh-openstack-rabbitmq-rabbitmq-1 1/1 Unknown 0 20h 172.16.0.45 controller-0 <none>
osh-openstack-rabbitmq-rabbitmq-exporter-7748d8c78b-84dfl 1/1 Unknown 0 83m 172.16.0.12 controller-0 <none>
osh-openstack-rabbitmq-rabbitmq-exporter-7748d8c78b-h2rqq 1/1 Running 0 3m20s 172.16.1.249 controller-1 <none>
panko-api-75bf686957-bwb5x 0/1 Pending 0 3m20s <none> <none> <none>
panko-api-75bf686957-jww2q 1/1 Unknown 0 20h 172.16.0.37 controller-0 <none>
panko-api-75bf686957-v9w4h 1/1 Running 0 83m 172.16.1.174 controller-1 <none>
panko-events-cleaner-1554390600-d7fnr 0/1 Completed 0 66m 172.16.1.205 controller-1 <none>
panko-events-cleaner-1554394200-r6hn7 0/1 Completed 0 6m21s 172.16.1.242 controller-1 <none>
prometheus-mysql-exporter-5fb97d88c6-2jj6t 1/1 Unknown 0 83m 172.16.0.13 controller-0 <none>
prometheus-mysql-exporter-5fb97d88c6-l9d9n 1/1 Running 0 3m20s 172.16.1.250 controller-1 <none>
rbd-provisioner-64f89b8c69-5ncgq 1/1 Running 0 83m 172.16.1.171 controller-1 <none>
rbd-provisioner-64f89b8c69-cncwb 1/1 Unknown 0 20h 172.16.0.64 controller-0 <none>
rbd-provisioner-64f89b8c69-rbvf8 0/1 Pending 0 3m19s <none> <none> <none>

Here is the pod information showing the readiness probe has failed:

[root@controller-1 ~(keystone_admin)]# kubectl -n openstack describe pod keystone-api-56b76565d8-mm2qb
Name: keystone-api-56b76565d8-mm2qb
Namespace: openstack
Priority: 0
PriorityClassName: <none>
Node: controller-1/192.168.204.4
Start Time: Thu, 04 Apr 2019 15:01:36 +0000
Labels: application=keystone
                    component=api
                    pod-template-hash=56b76565d8
                    release_group=osh-openstack-keystone
Annotations: cni.projectcalico.org/podIP: 172.16.1.193/32
                    configmap-bin-hash: 3680c591230b30a54bc86b1a039f3405131d642f3523aca50eceb78733481280
                    configmap-etc-hash: bdac7cd8a260f3f95025abc4f68496e11f5c4580e4992f0062787ed0464cc807
Status: Running
IP: 172.16.1.193
Controlled By: ReplicaSet/keystone-api-56b76565d8
Init Containers:
  init:
    Container ID: docker://44e64a3f55bc7fd6136c8fd314f917f6a8cceb0ab73cd1edf674a03b10f88891
    Image: 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
    Image ID: docker-pullable://192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint@sha256:2cad635d9079ee130f7f288b2ffb9c65c90fc7711364dcf438973ddf579b7fed
    Port: <none>
    Host Port: <none>
    Command:
      kubernetes-entrypoint
    State: Terminated
      Reason: Completed
      Exit Code: 0
      Started: Thu, 04 Apr 2019 15:01:55 +0000
      Finished: Thu, 04 Apr 2019 15:01:57 +0000
    Ready: True
    Restart Count: 0
    Environment:
      POD_NAME: keystone-api-56b76565d8-mm2qb (v1:metadata.name)
      NAMESPACE: openstack (v1:metadata.namespace)
      INTERFACE_NAME: eth0
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
      DEPENDENCY_SERVICE: openstack:memcached,openstack:mariadb
      DEPENDENCY_JOBS: keystone-db-sync,keystone-credential-setup,keystone-fernet-setup,keystone-rabbit-init
      DEPENDENCY_DAEMONSET:
      DEPENDENCY_CONTAINER:
      DEPENDENCY_POD_JSON:
      COMMAND: echo done
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
Containers:
  keystone-api:
    Container ID: docker://f46b82b140c05bac21e5df5e26737de162833bdcae91e617914867030aceb9f4
    Image: 192.168.204.2:9001/starlingx/stx-keystone:dev-centos-master-latest
    Image ID: docker-pullable://192.168.204.2:9001/starlingx/stx-keystone@sha256:f465fa9d7f0947d2a6634b7706a2b19eb3fc3290faa0f5f037f97fe378c1c08c
    Port: 5000/TCP
    Host Port: 0/TCP
    Command:
      /tmp/keystone-api.sh
      start
    State: Running
      Started: Thu, 04 Apr 2019 15:02:11 +0000
    Ready: False
    Restart Count: 0
    Readiness: http-get http://:5000/ delay=15s timeout=1s period=10s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /etc/apache2/conf-enabled/wsgi-keystone.conf from keystone-etc (ro)
      /etc/apache2/mods-available/mpm_event.conf from keystone-etc (ro)
      /etc/apache2/ports.conf from keystone-etc (ro)
      /etc/keystone from etckeystone (rw)
      /etc/keystone/credential-keys/ from keystone-credential-keys (rw)
      /etc/keystone/fernet-keys/ from keystone-fernet-keys (rw)
      /etc/keystone/keystone-paste.ini from keystone-etc (ro)
      /etc/keystone/keystone.conf from keystone-etc (ro)
      /etc/keystone/logging.conf from keystone-etc (ro)
      /etc/keystone/policy.json from keystone-etc (ro)
      /etc/keystone/sso_callback_template.html from keystone-etc (ro)
      /tmp/keystone-api.sh from keystone-bin (ro)
      /var/log/apache2 from logs-apache (rw)
      /var/run/apache2 from run-apache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
      /var/www/cgi-bin/keystone from wsgi-keystone (rw)
Conditions:
  Type Status
  Initialized True
  Ready False
  ContainersReady False
  PodScheduled True
Volumes:
  etckeystone:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  wsgi-keystone:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  logs-apache:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  run-apache:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  keystone-etc:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-etc
    Optional: false
  keystone-bin:
    Type: ConfigMap (a volume populated by a ConfigMap)
    Name: keystone-bin
    Optional: false
  keystone-fernet-keys:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-fernet-keys
    Optional: false
  keystone-credential-keys:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-credential-keys
    Optional: false
  keystone-api-token-5zkkq:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-api-token-5zkkq
    Optional: false
QoS Class: BestEffort
Node-Selectors: openstack-control-plane=enabled
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 3s (x21 over 72m) kubelet, controller-1 Readiness probe failed: Get http://172.16.1.193:5000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following logs are repeating forever in the keystone-api pod while the problem is occurring:

2019-04-04 16:13:20.903261 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:12:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"

When the keystone-api pod eventually recovered (on its own), I saw the following logs:

128.224.151.193 - - [04/Apr/2019:16:20:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
2019-04-04 16:21:25.474504 2019-04-04 16:21:25,474.474 10 WARNING keystone.common.fernet_utils [req-200f6682-2c23-43b2-850d-3f21aadddaf2 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/
2019-04-04 16:21:30.908476 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:20:30 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:11:51 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 10176 "-" "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.14.2 CPython/2.7.5"
2019-04-04 16:21:37.606921 2019-04-04 16:21:37,606.606 10 WARNING keystone.common.fernet_utils [req-16ae6f28-0cc4-4ac7-8b86-85611d5ac019 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
2019-04-04 16:21:37.824104 2019-04-04 16:21:37,823.823 10 WARNING keystone.common.fernet_utils [req-5c4e7de0-1379-48cb-9b98-9f696ea42662 46f26a09dbaa48e4a89e3be5e5e95019 788a279244f44f39bdff6cb636798ea7 - 82e9626778234640b09d12423097edb9 82e9626778234640b09d12423097edb9] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
128.224.151.193 - - [04/Apr/2019:16:20:40 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:20:50 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"

It is very suspicious that one of the logs (the POST) has a timestamp almost 9 minutes earlier than the surrounding logs. I wonder if this POST request had somehow hung and blocked the readiness probes somehow?

Test Activity
-------------
Developer Testing

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; serious issue, but not seen frequently. Needs further investigation as it's a robustness issue.

tags: added: stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2.0
Changed in starlingx:
assignee: nobody → Tao Liu (tliu88)
Revision history for this message
Bill Zvonar (billzvonar) wrote :

Assigned to Bruce for re-assignment.

Changed in starlingx:
assignee: Tao Liu (tliu88) → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Revision history for this message
yong hu (yhu6) wrote :

The issue was observed from an old dev-build:
 - BUILD_HOST="yow-bwensley-lx-vm2"
 - BUILD_DATE="2019-03-20 08:05:38 -0500"

@Bart, please update whether or not we can still see this kind of issue.

tags: added: stx.retestneeded
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :

Yong - as I said in the description, the problem only happened in about 10% of active controller reboots. I have not been testing this recently, so I don't know if the problem still occurs. I suggest that you configure a 2+2 system and then attempt to reproduce this yourself.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Bart Wensley (bartwensley) → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :
Download full text (3.4 KiB)

With 0718 Cengn build, I tried the test (brutally restarting active controller) over 15 times on multi-nodes (2+2), this issue was NOT seen once. And the re-starting process for the pods also makes sense:
step #1: terminating the pod which is running on the rebooted active controller. In the test, it was pod keystone-api-5db97d877d-9zkbk.
step #2: generating another pod and putting it pending, here it was pod keystone-api-5db97d877d-7v6nd

step #3: bringing up the pod "keystone-api-5db97d877d-7v6nd" after the rebooted controller (which became to the standby) was ready.

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 4m55s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 87m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m5s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 87m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m16s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 88m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m26s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 88m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 21m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m36s

keystone-api-5db97d877d-5pwrx 1/1 Running 0 21m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m46s

keystone-api-5db97d877d-5pwrx 1/1 Running 0 22m
keystone-api-5db97d877d-7v6nd 1/1 Running 0 6m59s

[sysadmin@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-1 ~(keystone_admin)]$ date
Mon Jul 22 15:38:23 UTC 2019

[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info | grep BUILD_ID
BUILD_ID="20190718T013000Z"

Additionally, in my test setup with 30+GB for controllers, it norm...

Read more...

Revision history for this message
yong hu (yhu6) wrote :

keep running over 30 times, and 2 controllers run steadily with expected correct behavior, on BUILD_ID="20190718T013000Z"

Revision history for this message
yong hu (yhu6) wrote :

with recent build 0802 cengn build, the same test was performed with 27+ reboots on both controllers and the system worked well. See the attachments.

In addition, every time after reboot, it did take time (about 8~10 mins) to get the openstack pods recovered.
Actually keystone-api pods were not the ones which took the longest time, instead mariadb-server pods took the longest time to get recovered, and similarly nova pods also took longer time.

So, if *every time* we wait for enough time till all OpenStack services/pods back to normal running state, the system can work normally and ready to get into next round of testing (of reboot) without the negative impacts.

Revision history for this message
yong hu (yhu6) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
yong hu (yhu6) wrote :

I will run another round stress test with RC branch. If it is still functional correctly, I will close this issue, based on the rationale the recent enhancement patches of "application-apply" or "-reapply" take good effect.

The original issue was not particular in keystone-api pod, instead it was other pods like mariadb-server failed to run.

Revision history for this message
yong hu (yhu6) wrote :

after running reboot-active-controller test for over 20 times, K8S pods still kept running normally.
See the attachments (screen shots)

so I would close this LP, with the belief that those fixes on a few corner cases did bring improvements on StarlingX reliability.

Revision history for this message
yong hu (yhu6) wrote :
Revision history for this message
yong hu (yhu6) wrote :
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
yong hu (yhu6) wrote :

one more thing: the testing was done on 2.0 release candidate image.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Tested on a Build with train, issue NOT reproduced

###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191122T023000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="327"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-11-22 02:30:00 +0000"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

LP was verified a long time ago; removing the stx.retestneeded label

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.