StarlingX

After force reboot of active controller keystone-api pod sometimes hangs

Bug #1823375 reported by Bart Wensley on 2019-04-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	yong hu

Bug Description

Brief Description
-----------------
After force reboot of active controller keystone-api pod sometimes hangs

Severity
--------
Minor - problem is major impact, but only happens rarely

Steps to Reproduce
------------------
1. Install stx-openstack application on a system with two controllers (I saw this on a 2+2 config).
2. Force reboot the active controller.
I saw the issue on 2 out of about 20 reboots of the active controller.

Expected Behavior
------------------
All of the stx-openstack application's pods should return to service in a reasonable amount of time.

Actual Behavior
----------------
Sometimes the readiness probe fails in the keystone-api pod (on the non-rebooting controller) and it stops functioning. This means keystone is unavailable until the rebooting controller finishes its reboot.

Reproducibility
---------------
Intermittent (seen on 2 out of about 20 active controller reboots)

System Configuration
--------------------
Seen on a 2+2 system.
Lab was windriver ironpass-1-4

Branch/Pull Time/Commit
-----------------------
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="bwensley"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-bwensley-lx-vm2"
BUILD_DATE="2019-03-20 08:05:38 -0500"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/bwensley/starlingx-0/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Last Pass
---------
This is new testing on the containers deployment.

Timestamp/Logs
--------------

Here are the stx-openstack pods while controller-0 was rebooting, showing the keystone-api pod on controller-1 has 0/1 ready:

[root@controller-1 ~(keystone_admin)]# NAME aodh-api-56fc6554df-lcplh aodh-api-56fc6554df-mvnj6 aodh-api-56fc6554df-rjlqc aodh-evaluator-9cc59d7f6-5sr7k aodh-evaluator-9cc59d7f6-d7qtj aodh-evaluator-9cc59d7f6-gj6tr aodh-listener-65df7d6bbf-6bfpk aodh-listener-65df7d6bbf-8zbnf aodh-listener-65df7d6bbf-t6s8m aodh-notifier-9bfd474f4-cx7rd aodh-notifier-9bfd474f4-tnt7f aodh-notifier-9bfd474f4-wbt72 barbican-api-54f7966ccf-m7zb8 barbican-api-54f7966ccf-wmvdt barbican-api-54f7966ccf-xmfh9 ceilometer-central-84f98b7d77-l4nxl ceilometer-central-84f98b7d77-qp24p ceilometer-central-84f98b7d77-vppxv ceilometer-compute-2g8tj ceilometer-compute-df27r ceilometer-notification-c4dc7d9c7-notification-c4dc7d9c7-audit-1554393900-2hcl2 ceph-pools-audit-1554394200-tp7pg ceph-pools-audit-1554394500-qqxh6 cinder-api-55948fcbf-4xb82 cinder-api-55948fcbf-88kk7 cinder-api-55948fcbf-k46g2 cinder-backup-788646f4b6-5cfz2 cinder-backup-788646f4b6-5wgf9 cinder-backup-788646f4b6-h2hst cinder-scheduler-77f7c847fc-kzksb cinder-scheduler-77f7c847fc-tcj5s cinder-scheduler-77f7c847fc-xc8kv cinder-volume-5874867c6d-8qmd4 cinder-volume-5874867c6d-h7vgh cinder-volume-5874867c6d-qrt2h cinder-volume-usage-audit-volume-usage-audit-volume-usage-audit-volume-usage-audit-api-5476f44c75-hw9ff glance-api-5476f44c75-m42qc glance-api-5476f44c75-vsbkb gnocchi-api-6bcc9d9975-dgp57 gnocchi-api-6bcc9d9975-jfrfv gnocchi-api-6bcc9d9975-kmchm gnocchi-metricd-hxd5b gnocchi-metricd-t5fcc heat-api-cdd6fb9f4-hl6jv heat-api-cdd6fb9f4-s4zhn heat-api-cdd6fb9f4-xqgj7 heat-cfn-6fddc4f786-2vbwm heat-cfn-6fddc4f786-6vbn6 heat-cfn-6fddc4f786-vx96p heat-engine-546cf796d7-6zr4d heat-engine-546cf796d7-7prqj heat-engine-546cf796d7-ld7kq heat-engine-cleaner-1554393600-27z2l heat-engine-cleaner-1554393900-szh24 heat-engine-cleaner-1554394200-qsjnd heat-engine-cleaner-1554394500-dbzz2 horizon-6b786bc8df-9fkft ingress-56dd5dbcd4-rzrp2 ingress-56dd5dbcd4-v5h95 ingress-56dd5dbcd4-xsztr ingress-error-pages-6f58cd4f8f-chqtc ingress-error-pages-6f58cd4f8f-rvm5p ingress-error-pages-6f58cd4f8f-rxmtr keystone-api-56b76565d8-4w6n2 keystone-api-56b76565d8-65nns keystone-api-56b76565d8-mm2qb libvirt-qjdpp libvirt-t7sh8 mariadb-ingress-5b8584d855-q2t47 mariadb-ingress-5b8584d855-wbhd4 mariadb-ingress-5b8584d855-x5shd mariadb-ingress-error-pages-ingress-error-pages-dhcp-agent-compute-dhcp-agent-compute-l3-agent-compute-l3-agent-compute-metadata-agent-compute-metadata-agent-compute-ovs-agent-compute-ovs-agent-compute-server-85dd569cbf-4s8vh neutron-server-85dd569cbf-7hsp7 neutron-server-85dd569cbf-fb8zx neutron-sriov-agent-compute-sriov-agent-compute-metadata-8579687b7b-lnpg2 nova-api-metadata-8579687b7b-v7rcx nova-api-metadata-8579687b7b-v87nh nova-api-osapi-bf7cc6977-mjxgz nova-api-osapi-bf7cc6977-nh982 nova-api-osapi-bf7cc6977-zqpgw nova-api-proxy-5bfcc78cdd-7ngz5 nova-api-proxy-5bfcc78cdd-v7vg8 nova-compute-compute-0-75ea0372-7mj62 nova-compute-compute-1-3dfb81d6-2lwwn nova-conductor-5675c54cfd-5dzfj nova-conductor-5675c54cfd-8cngs nova-conductor-5675c54cfd-ngh7h nova-consoleauth-b5fb8679d-nh82k nova-consoleauth-b5fb8679d-t6dhk nova-consoleauth-b5fb8679d-zhwhv nova-novncproxy-75444ff54b-6hfch nova-novncproxy-75444ff54b-q5l86 nova-novncproxy-75444ff54b-zbnvs nova-placement-api-5d8c995b48-wb9k8 nova-placement-api-5d8c995b48-wt8lt nova-placement-api-5d8c995b48-xjjrh nova-scheduler-64f854d9f8-4glrn nova-scheduler-64f854d9f8-9f5wf nova-scheduler-64f854d9f8-j8nrk osh-openstack-garbd-garbd-memcached-memcached-memcached-memcached-memcached-memcached-memcached-memcached-rabbitmq-rabbitmq-0 osh-openstack-rabbitmq-rabbitmq-1 osh-openstack-rabbitmq-rabbitmq-rabbitmq-rabbitmq-75bf686957-bwb5x panko-api-75bf686957-jww2q panko-api-75bf686957-v9w4h panko-events-cleaner-1554390600-d7fnr panko-events-cleaner-1554394200-r6hn7 prometheus-mysql-exporter-mysql-exporter--64f89b8c69-5ncgq rbd-provisioner-64f89b8c69-cncwb rbd-provisioner-64f89b8c69-rbvf8 kubectl -n openstack get pods -o wide
READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
1/1 Running 0 83m 172.16.1.194 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.65 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.41 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Running 0 83m 172.16.1.166 controller-1 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.66 controller-0 <none>
1/1 Running 0 83m 172.16.1.187 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.57 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Running 0 83m 172.16.1.176 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.49 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Running 0 83m 172.16.1.189 controller-1 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Running 0 83m 172.16.1.168 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.54 controller-0 <none>
1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
/>hpp6v 1/1 Unknown 0 8h 172.16.0.74 controller-0 <none>
/>rdpv4 1/1 Running 0 3m24s 172.16.1.244 controller-1 <none>
0/1 Completed 0 11m 172.16.1.236 controller-1 <none>
0/1 Completed 0 6m21s 172.16.1.239 controller-1 <none>
0/1 Completed 0 87s 172.16.1.251 controller-1 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.50 controller-0 <none>
1/1 Running 0 83m 172.16.1.195 controller-1 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.62 controller-0 <none>
1/1 Running 0 83m 172.16.1.197 controller-1 <none>
1/1 Running 0 83m 172.16.1.196 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.60 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.58 controller-0 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Running 0 83m 172.16.1.190 controller-1 <none>
/>1554393600-f48lk 0/1 Completed 0 16m 172.16.1.234 controller-1 <none>
/>1554393900-lcrzl 0/1 Completed 0 11m 172.16.1.237 controller-1 <none>
/>1554394200-c2pwk 0/1 Completed 0 6m21s 172.16.1.240 controller-1 <none>
/>1554394500-865ms 0/1 Init:0/1 0 87s 172.16.1.252 controller-1 <none>
1/1 Running 0 83m 172.16.1.191 controller-1 <none>
0/1 Pending 0 3m24s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.52 controller-0 <none>
1/1 Running 0 83m 172.16.1.172 controller-1 <none>
0/1 Pending 0 3m23s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.53 controller-0 <none>
1/1 NodeLost 7 7d20h 172.16.0.30 controller-0 <none>
1/1 Running 0 74m 172.16.1.170 controller-1 <none>
0/1 Pending 0 3m23s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.46 controller-0 <none>
1/1 Running 0 83m 172.16.1.178 controller-1 <none>
1/1 Running 0 83m 172.16.1.184 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.47 controller-0 <none>
0/1 Pending 0 3m23s <none> <none> <none>
1/1 Running 0 83m 172.16.1.173 controller-1 <none>
0/1 Pending 0 3m23s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.33 controller-0 <none>
0/1 Completed 0 16m 172.16.1.235 controller-1 <none>
0/1 Completed 0 11m 172.16.1.238 controller-1 <none>
0/1 Completed 0 6m21s 172.16.1.241 controller-1 <none>
0/1 Init:0/1 0 87s 172.16.1.253 controller-1 <none>
1/1 Running 0 74m 172.16.1.181 controller-1 <none>
0/1 Pending 0 3m23s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.55 controller-0 <none>
1/1 Running 0 83m 172.16.1.180 controller-1 <none>
1/1 Running 0 83m 172.16.1.179 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.59 controller-0 <none>
0/1 Pending 0 3m22s <none> <none> <none>
0/1 Pending 0 3m22s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.43 controller-0 <none>
0/1 Running 0 83m 172.16.1.193 controller-1 <none>
1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
0/1 Pending 0 3m22s <none> <none> <none>
1/1 Running 0 83m 172.16.1.175 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.51 controller-0 <none>
/>66f4f495b9-bvffq 1/1 Running 0 3m22s 172.16.1.245 controller-1 <none>
/>66f4f495b9-l2r9n 1/1 Unknown 0 83m 172.16.0.9 controller-0 <none>
1/1 Unknown 3 20h 172.16.0.63 controller-0 <none>
1/1 Running 3 83m 172.16.1.199 controller-1 <none>
/>0-75ea0372-xsml8 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
/>1-eae26dba-dhcwg 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
/>0-31b0f4b3-rqn6b 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
/>1-eae26dba-rszqz 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
/>0-75ea0372-r4vsm 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
/>1-3dfb81d6-dksr7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
/>0-75ea0372-8mw9x 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
/>1-eae26dba-mqv5d 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
1/1 Running 0 83m 172.16.1.183 controller-1 <none>
0/1 Pending 0 3m22s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.35 controller-0 <none>
/>0-75ea0372-8rw6t 1/1 Running 0 7d20h 192.168.204.72 compute-0 <none>
/>1-eae26dba-fmmz7 1/1 Running 0 7d20h 192.168.204.39 compute-1 <none>
1/1 Unknown 0 20h 172.16.0.44 controller-0 <none>
0/1 Pending 0 3m22s <none> <none> <none>
1/1 Running 0 83m 172.16.1.169 controller-1 <none>
0/1 Pending 0 3m22s <none> <none> <none>
1/1 Running 0 83m 172.16.1.192 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.36 controller-0 <none>
1/1 Unknown 0 8h 172.16.0.78 controller-0 <none>
1/1 Running 0 3m21s 172.16.1.246 controller-1 <none>
2/2 Running 0 7d20h 192.168.204.72 compute-0 <none>
2/2 Running 0 7d20h 192.168.204.39 compute-1 <none>
1/1 Running 0 83m 172.16.1.177 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.48 controller-0 <none>
0/1 Pending 0 3m21s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.56 controller-0 <none>
1/1 Running 0 83m 172.16.1.188 controller-1 <none>
0/1 Pending 0 3m21s <none> <none> <none>
1/1 Unknown 0 20h 192.168.204.3 controller-0 <none>
1/1 Running 0 83m 192.168.204.4 controller-1 <none>
0/1 Pending 0 3m21s <none> <none> <none>
1/1 Running 0 83m 172.16.1.167 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.34 controller-0 <none>
0/1 Pending 0 3m21s <none> <none> <none>
1/1 Running 0 83m 172.16.1.186 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.31 controller-0 <none>
0/1 Pending 0 3m21s <none> <none> <none>
/>85677d9d98-42fjk 1/1 Running 0 7d20h 172.16.2.2 compute-1 <none>
/>d8866584f-995kt 1/1 Unknown 0 83m 172.16.0.10 controller-0 <none>
/>d8866584f-lw59f 1/1 Running 0 3m20s 172.16.1.247 controller-1 <none>
/>exporter-7c86db4f64-bjkhm 1/1 Unknown 0 83m 172.16.0.11 controller-0 <none>
/>exporter-7c86db4f64-bpwzm 1/1 Running 0 3m20s 172.16.1.248 controller-1 <none>
1/1 Running 0 83m 172.16.1.198 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.45 controller-0 <none>
/>exporter-7748d8c78b-84dfl 1/1 Unknown 0 83m 172.16.0.12 controller-0 <none>
/>exporter-7748d8c78b-h2rqq 1/1 Running 0 3m20s 172.16.1.249 controller-1 <none>
0/1 Pending 0 3m20s <none> <none> <none>
1/1 Unknown 0 20h 172.16.0.37 controller-0 <none>
1/1 Running 0 83m 172.16.1.174 controller-1 <none>
0/1 Completed 0 66m 172.16.1.205 controller-1 <none>
0/1 Completed 0 6m21s 172.16.1.242 controller-1 <none>
/>5fb97d88c6-2jj6t 1/1 Unknown 0 83m 172.16.0.13 controller-0 <none>
/>5fb97d88c6-l9d9n 1/1 Running 0 3m20s 172.16.1.250 controller-1 <none>
1/1 Running 0 83m 172.16.1.171 controller-1 <none>
1/1 Unknown 0 20h 172.16.0.64 controller-0 <none>
0/1 Pending 0 3m19s <none> <none> <none>

Here is the pod information showing the readiness probe has failed:

[root@controller-1 ~(keystone_admin)]# kubectl -n openstack describe pod keystone-api-56b76565d8-mm2qb
Name: keystone-api-56b76565d8-mm2qb
Namespace: openstack
Priority: 0
PriorityClassName: <none>
Node: controller-1/192.168.204.4
Start Time: Thu, 04 Apr 2019 15:01:36 +0000
Labels: application=keystone
                    component=api
                    pod-template-hash=56b76565d8
                    release_group=osh-openstack-keystone
Annotations: cni.projectcalico.org/podIP: 172.16.1.193/32
                    configmap-bin-hash: 3680c591230b30a54bc86b1a039f3405131d642f3523aca50eceb78733481280
                    configmap-etc-hash: bdac7cd8a260f3f95025abc4f68496e11f5c4580e4992f0062787ed0464cc807
Status: Running
IP: 172.16.1.193
Controlled By: ReplicaSet/keystone-api-56b76565d8
Init Containers:
  init:
    Container ID: docker://44e64a3f55bc7fd6136c8fd314f917f6a8cceb0ab73cd1edf674a03b10f88891
    Image: 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1
    Image ID: docker-pullable://192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint@sha256:2cad635d9079ee130f7f288b2ffb9c65c90fc7711364dcf438973ddf579b7fed
    Port: <none>
    Host Port: <none>
    Command:
      kubernetes-entrypoint
    State: Terminated
      Reason: Completed
      Exit Code: 0
      Started: Thu, 04 Apr 2019 15:01:55 +0000
      Finished: Thu, 04 Apr 2019 15:01:57 +0000
    Ready: True
    Restart Count: 0
    Environment:
      POD_NAME: keystone-api-56b76565d8-mm2qb (v1:metadata.name)
      NAMESPACE: openstack (v1:metadata.namespace)
      INTERFACE_NAME: eth0
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
      DEPENDENCY_SERVICE: openstack:memcached,openstack:mariadb
      DEPENDENCY_JOBS: keystone-db-sync,keystone-credential-setup,keystone-fernet-setup,keystone-rabbit-init
      DEPENDENCY_DAEMONSET:
      DEPENDENCY_CONTAINER:
      DEPENDENCY_POD_JSON:
      COMMAND: echo done
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
Containers:
  keystone-api:
    Container ID: docker://f46b82b140c05bac21e5df5e26737de162833bdcae91e617914867030aceb9f4
    Image: 192.168.204.2:9001/starlingx/stx-keystone:dev-centos-master-latest
    Image ID: docker-pullable://192.168.204.2:9001/starlingx/stx-keystone@sha256:f465fa9d7f0947d2a6634b7706a2b19eb3fc3290faa0f5f037f97fe378c1c08c
    Port: 5000/TCP
    Host Port: 0/TCP
    Command:
      /tmp/keystone-api.sh
      start
    State: Running
      Started: Thu, 04 Apr 2019 15:02:11 +0000
    Ready: False
    Restart Count: 0
    Readiness: http-get http://:5000/ delay=15s timeout=1s period=10s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /etc/apache2/conf-enabled/wsgi-keystone.conf from keystone-etc (ro)
      /etc/apache2/mods-available/mpm_event.conf from keystone-etc (ro)
      /etc/apache2/ports.conf from keystone-etc (ro)
      /etc/keystone from etckeystone (rw)
      /etc/keystone/credential-keys/ from keystone-credential-keys (rw)
      /etc/keystone/fernet-keys/ from keystone-fernet-keys (rw)
      /etc/keystone/keystone-paste.ini from keystone-etc (ro)
      /etc/keystone/keystone.conf from keystone-etc (ro)
      /etc/keystone/logging.conf from keystone-etc (ro)
      /etc/keystone/policy.json from keystone-etc (ro)
      /etc/keystone/sso_callback_template.html from keystone-etc (ro)
      /tmp/keystone-api.sh from keystone-bin (ro)
      /var/log/apache2 from logs-apache (rw)
      /var/run/apache2 from run-apache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from keystone-api-token-5zkkq (ro)
      /var/www/cgi-bin/keystone from wsgi-keystone (rw)
Conditions:
  Type Status
  Initialized True
  Ready False
  ContainersReady False
  PodScheduled True
Volumes:
  etckeystone:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  wsgi-keystone:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  logs-apache:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  run-apache:
    Type: EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  keystone-etc:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-etc
    Optional: false
  keystone-bin:
    Type: ConfigMap (a volume populated by a ConfigMap)
    Name: keystone-bin
    Optional: false
  keystone-fernet-keys:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-fernet-keys
    Optional: false
  keystone-credential-keys:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-credential-keys
    Optional: false
  keystone-api-token-5zkkq:
    Type: Secret (a volume populated by a Secret)
    SecretName: keystone-api-token-5zkkq
    Optional: false
QoS Class: BestEffort
Node-Selectors: openstack-control-plane=enabled
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 3s (x21 over 72m) kubelet, controller-1 Readiness probe failed: Get http://172.16.1.193:5000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following logs are repeating forever in the keystone-api pod while the problem is occurring:

2019-04-04 16:13:20.903261 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:12:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"

When the keystone-api pod eventually recovered (on its own), I saw the following logs:

128.224.151.193 - - [04/Apr/2019:16:20:20 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
2019-04-04 16:21:25.474504 2019-04-04 16:21:25,474.474 10 WARNING keystone.common.fernet_utils [req-200f6682-2c23-43b2-850d-3f21aadddaf2 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/
2019-04-04 16:21:30.908476 Script timed out before returning headers: keystone-wsgi-public
128.224.151.193 - - [04/Apr/2019:16:20:30 +0000] "GET / HTTP/1.1" 500 527 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:11:51 +0000] "POST /v3/auth/tokens HTTP/1.1" 201 10176 "-" "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.14.2 CPython/2.7.5"
2019-04-04 16:21:37.606921 2019-04-04 16:21:37,606.606 10 WARNING keystone.common.fernet_utils [req-16ae6f28-0cc4-4ac7-8b86-85611d5ac019 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
2019-04-04 16:21:37.824104 2019-04-04 16:21:37,823.823 10 WARNING keystone.common.fernet_utils [req-5c4e7de0-1379-48cb-9b98-9f696ea42662 46f26a09dbaa48e4a89e3be5e5e95019 788a279244f44f39bdff6cb636798ea7 - 82e9626778234640b09d12423097edb9 82e9626778234640b09d12423097edb9] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException
128.224.151.193 - - [04/Apr/2019:16:20:40 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"
128.224.151.193 - - [04/Apr/2019:16:20:50 +0000] "GET / HTTP/1.1" 300 267 "-" "kube-probe/1.12"

It is very suspicious that one of the logs (the POST) has a timestamp almost 9 minutes earlier than the surrounding logs. I wonder if this POST request had somehow hung and blocked the readiness probes somehow?

Test Activity
-------------
Developer Testing

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-05:

Marking as release gating; serious issue, but not seen frequently. Needs further investigation as it's a robustness issue.

tags:	added: stx.containers
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2.0
Changed in starlingx:
assignee:	nobody → Tao Liu (tliu88)

Revision history for this message

Bill Zvonar (billzvonar) wrote on 2019-06-21:

Assigned to Bruce for re-assignment.

Changed in starlingx:
assignee:	Tao Liu (tliu88) → Bruce Jones (brucej)

Bruce Jones (brucej) on 2019-06-21

Changed in starlingx:
assignee:	Bruce Jones (brucej) → Cindy Xie (xxie1)

Revision history for this message

yong hu (yhu6) wrote on 2019-07-08:

The issue was observed from an old dev-build:
- BUILD_HOST="yow-bwensley-lx-vm2"
- BUILD_DATE="2019-03-20 08:05:38 -0500"

@Bart, please update whether or not we can still see this kind of issue.

tags:	added: stx.retestneeded
Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Bart Wensley (bartwensley)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-08:

Yong - as I said in the description, the problem only happened in about 10% of active controller reboots. I have not been testing this recently, so I don't know if the problem still occurs. I suggest that you configure a 2+2 system and then attempt to reproduce this yourself.

Ghada Khalil (gkhalil) on 2019-07-08

Changed in starlingx:
assignee:	Bart Wensley (bartwensley) → yong hu (yhu6)

Revision history for this message

yong hu (yhu6) wrote on 2019-07-22:

Download full text (3.4 KiB)

With 0718 Cengn build, I tried the test (brutally restarting active controller) over 15 times on multi-nodes (2+2), this issue was NOT seen once. And the re-starting process for the pods also makes sense:
step #1: terminating the pod which is running on the rebooted active controller. In the test, it was pod keystone-api-5db97d877d-9zkbk.
step #2: generating another pod and putting it pending, here it was pod keystone-api-5db97d877d-7v6nd

step #3: bringing up the pod "keystone-api-5db97d877d-7v6nd" after the rebooted controller (which became to the standby) was ready.

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 4m55s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 87m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m5s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 87m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m16s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 88m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 20m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m26s
keystone-api-5db97d877d-9zkbk 0/1 Terminating 0 88m

keystone-api-5db97d877d-5pwrx 1/1 Running 0 21m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m36s

keystone-api-5db97d877d-5pwrx 1/1 Running 0 21m
keystone-api-5db97d877d-7v6nd 0/1 Pending 0 5m46s

keystone-api-5db97d877d-5pwrx 1/1 Running 0 22m
keystone-api-5db97d877d-7v6nd 1/1 Running 0 6m59s

[sysadmin@controller-1 ~(keystone_admin)]$ date
Mon Jul 22 15:38:23 UTC 2019

[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info | grep BUILD_ID
BUILD_ID="20190718T013000Z"

Additionally, in my test setup with 30+GB for controllers, it norm...

step #3: bringing up the pod "keystone-api-5db97d877d-7v6nd" after the rebooted controller (which became to the standby) was ready.

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          20m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          4m55s
keystone-api-5db97d877d-9zkbk                        0/1     Terminating   0          87m

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          20m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          5m5s
keystone-api-5db97d877d-9zkbk                        0/1     Terminating   0          87m

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          20m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          5m16s
keystone-api-5db97d877d-9zkbk                        0/1     Terminating   0          88m

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          20m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          5m26s
keystone-api-5db97d877d-9zkbk                        0/1     Terminating   0          88m

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          21m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          5m36s

keystone-api-5db97d877d-5pwrx                        1/1     Running       0          21m
keystone-api-5db97d877d-7v6nd                        0/1     Pending       0          5m46s

keystone-api-5db97d877d-5pwrx                        1/1     Running            0          22m
keystone-api-5db97d877d-7v6nd                        1/1     Running            0          6m59s

[sysadmin@controller-1 ~(keystone_admin)]$ date
Mon Jul 22 15:38:23 UTC 2019

[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info | grep BUILD_ID
BUILD_ID="20190718T013000Z"
   
Additionally, in my test setup with 30+GB for controllers, it normally took 6~7 minutes to get the controller fully back to "available" from "fail" (which was caused by reboot). If the controller has different size of RAM, this whole procedure might take longer time.

In short, this issue was not reproduced, and I will test it with another build in this week.

Revision history for this message

yong hu (yhu6) wrote on 2019-07-22:

keep running over 30 times, and 2 controllers run steadily with expected correct behavior, on BUILD_ID="20190718T013000Z"

Revision history for this message

yong hu (yhu6) wrote on 2019-08-07:

controller-0-reboot-20+times.png Edit (381.5 KiB, image/png)

with recent build 0802 cengn build, the same test was performed with 27+ reboots on both controllers and the system worked well. See the attachments.

In addition, every time after reboot, it did take time (about 8~10 mins) to get the openstack pods recovered.
Actually keystone-api pods were not the ones which took the longest time, instead mariadb-server pods took the longest time to get recovered, and similarly nova pods also took longer time.

So, if *every time* we wait for enough time till all OpenStack services/pods back to normal running state, the system can work normally and ready to get into next round of testing (of reboot) without the negative impacts.

Revision history for this message

yong hu (yhu6) wrote on 2019-08-07:

controller-1-reboot-20+times.png Edit (388.8 KiB, image/png)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

yong hu (yhu6) wrote on 2019-08-23:

#10

I will run another round stress test with RC branch. If it is still functional correctly, I will close this issue, based on the rationale the recent enhancement patches of "application-apply" or "-reapply" take good effect.

The original issue was not particular in keystone-api pod, instead it was other pods like mariadb-server failed to run.

Revision history for this message

yong hu (yhu6) wrote on 2019-09-16:

#11

controller-0-screenshot Edit (417.0 KiB, image/png)

after running reboot-active-controller test for over 20 times, K8S pods still kept running normally.
See the attachments (screen shots)

so I would close this LP, with the belief that those fixes on a few corner cases did bring improvements on StarlingX reliability.

Revision history for this message

yong hu (yhu6) wrote on 2019-09-16:

#12

controller-0-screenshot-01 Edit (440.8 KiB, image/png)

Revision history for this message

yong hu (yhu6) wrote on 2019-09-16:

#13

controller-1 screenshot-00 Edit (323.7 KiB, image/png)

Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

yong hu (yhu6) wrote on 2019-09-16:

#14

one more thing: the testing was done on 2.0 release candidate image.

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-11-25:

#15

Tested on a Build with train, issue NOT reproduced

###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191122T023000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="327"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-11-22 02:30:00 +0000"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-27:

#16

LP was verified a long time ago; removing the stx.retestneeded label

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.