Brief Description
-----------------
After force reboot of active controller keystone-api pod sometimes hangs
Severity
--------
Minor - problem is major impact, but only happens rarely
Steps to Reproduce
------------------
1. Install stx-openstack application on a system with two controllers (I saw this on a 2+2 config).
2. Force reboot the active controller.
I saw the issue on 2 out of about 20 reboots of the active controller.
Expected Behavior
------------------
All of the stx-openstack application's pods should return to service in a reasonable amount of time.
Actual Behavior
----------------
Sometimes the readiness probe fails in the keystone-api pod (on the non-rebooting controller) and it stops functioning. This means keystone is unavailable until the rebooting controller finishes its reboot.
Reproducibility
---------------
Intermittent (seen on 2 out of about 20 active controller reboots)
System Configuration
--------------------
Seen on a 2+2 system.
Lab was windriver ironpass-1-4
Branch/Pull Time/Commit
-----------------------
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"
JOB="n/a"
BUILD_BY="bwensley"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-bwensley-lx-vm2"
BUILD_DATE="2019-03-20 08:05:38 -0500"
BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"
Last Pass
---------
This is new testing on the containers deployment.
Timestamp/Logs
--------------
Here are the stx-openstack pods while controller-0 was rebooting, showing the keystone-api pod on controller-1 has 0/1 ready:
[root@controller-1 ~(keystone_admin)]# kubectl -n openstack get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
aodh-api-56fc6554df-lcplh 1/1 Running 0 83m 172.16.1.194 controller-1 <none>
aodh-api-56fc6554df-mvnj6 1/1 Unknown 0 20h 172.16.0.65 controller-0 <none>
aodh-api-56fc6554df-rjlqc 0/1 Pending 0 3m24s <none> <none> <none>
aodh-evaluator-9cc59d7f6-5sr7k 1/1 Unknown 0 20h 172.16.0.41 controller-0 <none>
aodh-evaluator-9cc59d7f6-d7qtj 0/1 Pending 0 3m24s <none> <none> <none>
aodh-evaluator-9cc59d7f6-gj6tr 1/1 Running 0 83m 172.16.1.166 controller-1 <none>
aodh-listener-65df7d6bbf-6bfpk 0/1 Pending 0 3m24s <none> <none> <none>
aodh-listener-65df7d6bbf-8zbnf 1/1 Unknown 0 20h 172.16.0.66 controller-0 <none>
aodh-listener-65df7d6bbf-t6s8m 1/1 Running 0 83m 172.16.1.187 controller-1 <none>
aodh-notifier-9bfd474f4-cx7rd 1/1 Unknown 0 20h 172.16.0.57 controller-0 <none>
aodh-notifier-9bfd474f4-tnt7f 0/1 Pending 0 3m24s <none> <none> <none>
aodh-notifier-9bfd474f4-wbt72 1/1 Running 0 83m 172.16.1.176 controller-1 <none>
barbican-api-54f7966ccf-m7zb8 1/1 Unknown 0 20h 172.16.0.49 controller-0 <none>
barbican-api-54f7966ccf-wmvdt 0/1 Pending 0 3m24s <none> <none> <none>
barbican-api-54f7966ccf-xmfh9 1/1 Running 0 83m 172.16.1.189 controller-1 <none>
ceilometer-central-84f98b7d77-l4nxl 0/1 Pending 0 3m24s <none> <none> <none>
ceilometer-central-84f98b7d77-qp24p 1/1 Running 0 83m 172.16.1.168 controller-1 <none>
ceilometer-central-84f98b7d77-vppxv 1/1 Unknown 0 20h 172.16.0.54 controller-0 <none>
ceilometer-compute-2g8tj 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
ceilometer-compute-df27r 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
ceilometer-notification-c4dc7d9c7-hpp6v 1/1 Unknown 0 8h 172.16.0.74 controller-0 <none>
ceilometer-notification-c4dc7d9c7-rdpv4 1/1 Running 0 3m24s 172.16.1.244 controller-1 <none>
ceph-pools-audit-1554393900-2hcl2 0/1 Completed 0 11m 172.16.1.236 controller-1 <none>
ceph-pools-audit-1554394200-tp7pg 0/1 Completed 0 6m21s 172.16.1.239 controller-1 <none>
ceph-pools-audit-1554394500-qqxh6 0/1 Completed 0 87s 172.16.1.251 controller-1 <none>
cinder-api-55948fcbf-4xb82 0/1 Pending 0 3m24s <none> <none> <none>
cinder-api-55948fcbf-88kk7 1/1 Unknown 0 20h 172.16.0.50 controller-0 <none>
cinder-api-55948fcbf-k46g2 1/1 Running 0 83m 172.16.1.195 controller-1 <none>
cinder-backup-788646f4b6-5cfz2 0/1 Pending 0 3m24s <none> <none> <none>
cinder-backup-788646f4b6-5wgf9 1/1 Unknown 0 20h 172.16.0.62 controller-0 <none>
cinder-backup-788646f4b6-h2hst 1/1 Running 0 83m 172.16.1.197 controller-1 <none>
cinder-scheduler-77f7c847fc-kzksb 1/1 Running 0 83m 172.16.1.196 controller-1 <none>
cinder-scheduler-77f7c847fc-tcj5s 1/1 Unknown 0 20h 172.16.0.60 controller-0 <none>
cinder-scheduler-77f7c847fc-xc8kv 0/1 Pending 0 3m24s <none> <none> <none>
cinder-volume-5874867c6d-8qmd4 1/1 Unknown 0 20h 172.16.0.58 controller-0 <none>
cinder-volume-5874867c6d-h7vgh 0/1 Pending 0 3m24s <none> <none> <none>
cinder-volume-5874867c6d-qrt2h 1/1 Running 0 83m 172.16.1.190 controller-1 <none>
cinder-volume-usage-audit-1554393600-f48lk 0/1 Completed 0 16m 172.16.1.234 controller-1 <none>
cinder-volume-usage-audit-1554393900-lcrzl 0/1 Completed 0 11m 172.16.1.237 controller-1 <none>
cinder-volume-usage-audit-1554394200-c2pwk 0/1 Completed 0 6m21s 172.16.1.240 controller-1 <none>
cinder-volume-usage-audit-1554394500-865ms 0/1 Init:0/1 0 87s 172.16.1.252 controller-1 <none>
glance-api-5476f44c75-hw9ff 1/1 Running 0 83m 172.16.1.191 controller-1 <none>
glance-api-5476f44c75-m42qc 0/1 Pending 0 3m24s <none> <none> <none>
glance-api-5476f44c75-vsbkb 1/1 Unknown 0 20h 172.16.0.52 controller-0 <none>
gnocchi-api-6bcc9d9975-dgp57 1/1 Running 0 83m 172.16.1.172 controller-1 <none>
gnocchi-api-6bcc9d9975-jfrfv 0/1 Pending 0 3m23s <none> <none> <none>
gnocchi-api-6bcc9d9975-kmchm 1/1 Unknown 0 20h 172.16.0.53 controller-0 <none>
gnocchi-metricd-hxd5b 1/1 NodeLost 7 7d20h 172.16.0.30 controller-0 <none>
gnocchi-metricd-t5fcc 1/1 Running 0 74m 172.16.1.170 controller-1 <none>
heat-api-cdd6fb9f4-hl6jv 0/1 Pending 0 3m23s <none> <none> <none>
heat-api-cdd6fb9f4-s4zhn 1/1 Unknown 0 20h 172.16.0.46 controller-0 <none>
heat-api-cdd6fb9f4-xqgj7 1/1 Running 0 83m 172.16.1.178 controller-1 <none>
heat-cfn-6fddc4f786-2vbwm 1/1 Running 0 83m 172.16.1.184 controller-1 <none>
heat-cfn-6fddc4f786-6vbn6 1/1 Unknown 0 20h 172.16.0.47 controller-0 <none>
heat-cfn-6fddc4f786-vx96p 0/1 Pending 0 3m23s <none> <none> <none>
heat-engine-546cf796d7-6zr4d 1/1 Running 0 83m 172.16.1.173 controller-1 <none>
heat-engine-546cf796d7-7prqj 0/1 Pending 0 3m23s <none> <none> <none>
heat-engine-546cf796d7-ld7kq 1/1 Unknown 0 20h 172.16.0.33 controller-0 <none>
heat-engine-cleaner-1554393600-27z2l 0/1 Completed 0 16m 172.16.1.235 controller-1 <none>
heat-engine-cleaner-1554393900-szh24 0/1 Completed 0 11m 172.16.1.238 controller-1 <none>
heat-engine-cleaner-1554394200-qsjnd 0/1 Completed 0 6m21s 172.16.1.241 controller-1 <none>
heat-engine-cleaner-1554394500-dbzz2 0/1 Init:0/1 0 87s 172.16.1.253 controller-1 <none>
horizon-6b786bc8df-9fkft 1/1 Running 0 74m 172.16.1.181 controller-1 <none>
ingress-56dd5dbcd4-rzrp2 0/1 Pending 0 3m23s <none> <none> <none>
ingress-56dd5dbcd4-v5h95 1/1 Unknown 0 20h 172.16.0.55 controller-0 <none>
ingress-56dd5dbcd4-xsztr 1/1 Running 0 83m 172.16.1.180 controller-1 <none>
ingress-error-pages-6f58cd4f8f-chqtc 1/1 Running 0 83m 172.16.1.179 controller-1 <none>
ingress-error-pages-6f58cd4f8f-rvm5p 1/1 Unknown 0 20h 172.16.0.59 controller-0 <none>
ingress-error-pages-6f58cd4f8f-rxmtr 0/1 Pending 0 3m22s <none> <none> <none>
keystone-api-56b76565d8-4w6n2 0/1 Pending 0 3m22s <none> <none> <none>
keystone-api-56b76565d8-65nns 1/1 Unknown 0 20h 172.16.0.43 controller-0 <none>
keystone-api-56b76565d8-mm2qb 0/1 Running 0 83m 172.16.1.193 controller-1 <none>
libvirt-qjdpp 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
libvirt-t7sh8 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
mariadb-ingress-5b8584d855-q2t47 0/1 Pending 0 3m22s <none> <none> <none>
mariadb-ingress-5b8584d855-wbhd4 1/1 Running 0 83m 172.16.1.175 controller-1 <none>
mariadb-ingress-5b8584d855-x5shd 1/1 Unknown 0 20h 172.16.0.51 controller-0 <none>
mariadb-ingress-error-pages-66f4f495b9-bvffq 1/1 Running 0 3m22s 172.16.1.245 controller-1 <none>
mariadb-ingress-error-pages-66f4f495b9-l2r9n 1/1 Unknown 0 83m 172.16.0.9 controller-0 <none>
mariadb-server-0 1/1 Unknown 3 20h 172.16.0.63 controller-0 <none>
mariadb-server-1 1/1 Running 3 83m 172.16.1.199 controller-1 <none>
neutron-dhcp-agent-compute-0-75ea0372-xsml8 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-dhcp-agent-compute-1-eae26dba-dhcwg 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-l3-agent-compute-0-31b0f4b3-rqn6b 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-l3-agent-compute-1-eae26dba-rszqz 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-metadata-agent-compute-0-75ea0372-r4vsm 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-metadata-agent-compute-1-3dfb81d6-dksr7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-ovs-agent-compute-0-75ea0372-8mw9x 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-ovs-agent-compute-1-eae26dba-mqv5d 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
neutron-server-85dd569cbf-4s8vh 1/1 Running 0 83m 172.16.1.183 controller-1 <none>
neutron-server-85dd569cbf-7hsp7 0/1 Pending 0 3m22s <none> <none> <none>
neutron-server-85dd569cbf-fb8zx 1/1 Unknown 0 20h 172.16.0.35 controller-0 <none>
neutron-sriov-agent-compute-0-75ea0372-8rw6t 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
neutron-sriov-agent-compute-1-eae26dba-fmmz7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
nova-api-metadata-8579687b7b-lnpg2 1/1 Unknown 0 20h 172.16.0.44 controller-0 <none>
nova-api-metadata-8579687b7b-v7rcx 0/1 Pending 0 3m22s <none> <none> <none>
nova-api-metadata-8579687b7b-v87nh 1/1 Running 0 83m 172.16.1.169 controller-1 <none>
nova-api-osapi-bf7cc6977-mjxgz 0/1 Pending 0 3m22s <none> <none> <none>
nova-api-osapi-bf7cc6977-nh982 1/1 Running 0 83m 172.16.1.192 controller-1 <none>
nova-api-osapi-bf7cc6977-zqpgw 1/1 Unknown 0 20h 172.16.0.36 controller-0 <none>
nova-api-proxy-5bfcc78cdd-7ngz5 1/1 Unknown 0 8h 172.16.0.78 controller-0 <none>
nova-api-proxy-5bfcc78cdd-v7vg8 1/1 Running 0 3m21s 172.16.1.246 controller-1 <none>
nova-compute-compute-0-75ea0372-7mj62 2/2 Running 0 7d20h 192.168.204.72 compute-0 <none>
nova-compute-compute-1-3dfb81d6-2lwwn 2/2 Running 0 7d20h 192.168.204.39 compute-1 <none>
nova-conductor-5675c54cfd-5dzfj 1/1 Running 0 83m 172.16.1.177 controller-1 <none>
nova-conductor-5675c54cfd-8cngs 1/1 Unknown 0 20h 172.16.0.48 controller-0 <none>
nova-conductor-5675c54cfd-ngh7h 0/1 Pending 0 3m21s <none> <none> <none>
nova-consoleauth-b5fb8679d-nh82k 1/1 Unknown 0 20h 172.16.0.56 controller-0 <none>
nova-consoleauth-b5fb8679d-t6dhk 1/1 Running 0 83m 172.16.1.188 controller-1 <none>
nova-consoleauth-b5fb8679d-zhwhv 0/1 Pending 0 3m21s <none> <none> <none>
nova-novncproxy-75444ff54b-6hfch 1/1 Unknown 0 20h 192.168.204.3 controller-0 <none>
nova-novncproxy-75444ff54b-q5l86 1/1 Running 0 83m 192.168.204.4 controller-1 <none>
nova-novncproxy-75444ff54b-zbnvs 0/1 Pending 0 3m21s <none> <none> <none>
nova-placement-api-5d8c995b48-wb9k8 1/1 Running 0 83m 172.16.1.167 controller-1 <none>
nova-placement-api-5d8c995b48-wt8lt 1/1 Unknown 0 20h 172.16.0.34 controller-0 <none>
nova-placement-api-5d8c995b48-xjjrh 0/1 Pending 0 3m21s <none> <none> <none>
nova-scheduler-64f854d9f8-4glrn 1/1 Running 0 83m 172.16.1.186 controller-1 <none>
nova-scheduler-64f854d9f8-9f5wf 1/1 Unknown 0 20h 172.16.0.31 controller-0 <none>
nova-scheduler-64f854d9f8-j8nrk 0/1 Pending 0 3m21s <none> <none> <none>
osh-openstack-garbd-garbd-85677d9d98-42fjk 1/1 Running 0 7d20h 172.16.2.2 compute-1 <none>
osh-openstack-memcached-memcached-d8866584f-995kt 1/1 Unknown 0 83m 172.16.0.10 controller-0 <none>
osh-openstack-memcached-memcached-d8866584f-lw59f 1/1 Running 0 3m20s 172.16.1.247 controller-1 <none>
osh-openstack-memcached-memcached-exporter-7c86db4f64-bjkhm 1/1 Unknown 0 83m 172.16.0.11 controller-0 <none>
osh-openstack-memcached-memcached-exporter-7c86db4f64-bpwzm 1/1 Running 0 3m20s 172.16.1.248 controller-1 <none>
osh-openstack-rabbitmq-rabbitmq-0 1/1 Running 0 83m 172.16.1.198 controller-1 <none>
osh-openstack-rabbitmq-rabbitmq-1 1/1 Unknown 0 20h 172.16.0.45 controller-0 <none>
osh-openstack-rabbitmq-rabbitmq-exporter-7748d8c78b-84dfl 1/1 Unknown 0 83m 172.16.0.12 controller-0 <none>
osh-openstack-rabbitmq-rabbitmq-exporter-7748d8c78b-h2rqq 1/1 Running 0 3m20s 172.16.1.249 controller-1 <none>
panko-api-75bf686957-bwb5x 0/1 Pending 0 3m20s <none> <none> <none>
panko-api-75bf686957-jww2q 1/1 Unknown 0 20h 172.16.0.37 controller-0 <none>
panko-api-75bf686957-v9w4h 1/1 Running 0 83m 172.16.1.174 controller-1 <none>
panko-events-cleaner-1554390600-d7fnr 0/1 Completed 0 66m 172.16.1.205 controller-1 <none>
panko-events-cleaner-1554394200-r6hn7 0/1 Completed 0 6m21s 172.16.1.242 controller-1 <none>
prometheus-mysql-exporter-5fb97d88c6-2jj6t 1/1 Unknown 0 83m 172.16.0.13 controller-0 <none>
prometheus-mysql-exporter-5fb97d88c6-l9d9n 1/1 Running 0 3m20s 172.16.1.250 controller-1 <none>
rbd-provisioner-64f89b8c69-5ncgq 1/1 Running 0 83m 172.16.1.171 controller-1 <none>
rbd-provisioner-64f89b8c69-cncwb 1/1 Unknown 0 20h 172.16.0.64 controller-0 <none>
rbd-provisioner-64f89b8c69-rbvf8 0/1 Pending 0 3m19s <none> <none> <none>
Here is the pod information showing the readiness probe has failed:
[root@controller-1 ~(keystone_admin)]# kubectl -n openstack describe pod keystone-api-56b76565d8-mm2qb
Name: keystone-api-56b76565d8-mm2qb
Namespace: openstack
Priority: 0
PriorityClassName: <none>
Node: controller-1/192.168.204.4
Start Time: Thu, 04 Apr 2019 15:01:36 +0000
Labels: application=keystone
component=api
pod-template-hash=56b76565d8
release_group=osh-openstack-keystone
Annotations: cni.projectcalico.org/podIP: 172.16.1.193/32
configmap-bin-hash: 3680c591230b30a54bc86b1a039f3405131d642f3523aca50eceb78733481280
configmap-etc-hash: bdac7cd8a260f3f95025abc4f68496e11f5c4580e4992f0062787ed0464cc807
Status: Running
IP: 172.16.1.193
Controlled By: ReplicaSet/keystone-api-56b76565d8
Init Containers:
init:
Container ID: docker://44e64a3f55bc7fd6136c8fd314f917f6a8cceb0ab73cd1edf674a03b10f88891
Image: 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
Image ID: docker-pullable://192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint@sha256:2cad635d9079ee130f7f288b2ffb9c65c90fc7711364dcf438973ddf579b7fed
Port: <none>
Host Port: <none>
Command:
kubernetes-entrypoint
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 04 Apr 2019 15:01:55 +0000
Finished: Thu, 04 Apr 2019 15:01:57 +0000
Ready: True
Restart Count: 0
Environment:
POD_NAME: keystone-api-56b76565d8-mm2qb (v1:metadata.name)
NAMESPACE: openstack (v1:metadata.namespace)
INTERFACE_NAME: eth0
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
DEPENDENCY_SERVICE: openstack:memcached,openstack:mariadb
DEPENDENCY_JOBS: keystone-db-sync,keystone-credential-setup,keystone-fernet-setup,keystone-rabbit-init
DEPENDENCY_DAEMONSET:
DEPENDENCY_CONTAINER:
DEPENDENCY_POD_JSON:
COMMAND: echo done
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
Containers:
keystone-api:
Container ID: docker://f46b82b140c05bac21e5df5e26737de162833bdcae91e617914867030aceb9f4
Image: 192.168.204.2:9001/starlingx/stx-keystone:dev-centos-master-latest
Image ID: docker-pullable://192.168.204.2:9001/starlingx/stx-keystone@sha256:f465fa9d7f0947d2a6634b7706a2b19eb3fc3290faa0f5f037f97fe378c1c08c
Port: 5000/TCP
Host Port: 0/TCP
Command:
/tmp/keystone-api.sh
start
State: Running
Started: Thu, 04 Apr 2019 15:02:11 +0000
Ready: False
Restart Count: 0
Readiness: http-get http://:5000/ delay=15s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/apache2/conf-enabled/wsgi-keystone.conf from keystone-etc (ro)
/etc/apache2/mods-available/mpm_event.conf from keystone-etc (ro)
/etc/apache2/ports.conf from keystone-etc (ro)
/etc/keystone from etckeystone (rw)
/etc/keystone/credential-keys/ from keystone-credential-keys (rw)
/etc/keystone/fernet-keys/ from keystone-fernet-keys (rw)
/etc/keystone/keystone-paste.ini from keystone-etc (ro)
/etc/keystone/keystone.conf from keystone-etc (ro)
/etc/keystone/logging.conf from keystone-etc (ro)
/etc/keystone/policy.json from keystone-etc (ro)
/etc/keystone/sso_callback_template.html from keystone-etc (ro)
/tmp/keystone-api.sh from keystone-bin (ro)
/var/log/apache2 from logs-apache (rw)
/var/run/apache2 from run-apache (rw)
/var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
/var/www/cgi-bin/keystone from wsgi-keystone (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
etckeystone:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
wsgi-keystone:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
logs-apache:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
run-apache:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
keystone-etc:
Type: Secret (a volume populated by a Secret)
SecretName: keystone-etc
Optional: false
keystone-bin:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: keystone-bin
Optional: false
keystone-fernet-keys:
Type: Secret (a volume populated by a Secret)
SecretName: keystone-fernet-keys
Optional: false
keystone-credential-keys:
Type: Secret (a volume populated by a Secret)
SecretName: keystone-credential-keys
Optional: false
keystone-api-token-5zkkq:
Type: Secret (a volume populated by a Secret)
SecretName: keystone-api-token-5zkkq
Optional: false
QoS Class: BestEffort
Node-Selectors: openstack-control-plane=enabled
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3s (x21 over 72m) kubelet, controller-1 Readiness probe failed: Get http://172.16.1.193:5000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The following logs are repeating forever in the keystone-api pod while the problem is occurring:
2019-04-04 16:13:20.903261 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:12:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
When the keystone-api pod eventually recovered (on its own), I saw the following logs:
128.224.151.193 - - [04/Apr/2019:16:20:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
2019-04-04 16:21:25.474504 2019-04-04 16:21:25,474.474 10 WARNING keystone.common.fernet_utils [req-200f6682-2c23-43b2-850d-3f21aadddaf2 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/
2019-04-04 16:21:30.908476 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:20:30 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:11:51 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 10176 "-" "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.14.2 CPython/2.7.5"
2019-04-04 16:21:37.606921 2019-04-04 16:21:37,606.606 10 WARNING keystone.common.fernet_utils [req-16ae6f28-0cc4-4ac7-8b86-85611d5ac019 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
2019-04-04 16:21:37.824104 2019-04-04 16:21:37,823.823 10 WARNING keystone.common.fernet_utils [req-5c4e7de0-1379-48cb-9b98-9f696ea42662 46f26a09dbaa48e4a89e3be5e5e95019 788a279244f44f39bdff6cb636798ea7 - 82e9626778234640b09d12423097edb9 82e9626778234640b09d12423097edb9] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
128.224.151.193 - - [04/Apr/2019:16:20:40 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:20:50 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"
It is very suspicious that one of the logs (the POST) has a timestamp almost 9 minutes earlier than the surrounding logs. I wonder if this POST request had somehow hung and blocked the readiness probes somehow?
Test Activity
-------------
Developer Testing
Marking as release gating; serious issue, but not seen frequently. Needs further investigation as it's a robustness issue.