StarlingX

standby controller reboot and become available but many pods are not -ready/unreachable

Bug #1840688 reported by Anujeyan Manokeran on 2019-08-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Alexander Kozyrev

Bug Description

Brief Description
-----------------
4 mins after standby controller reboot and become available/online again, many pods were still in pending status because node.kubernetes.io not-ready/unreachable. Detail pos display attached in text after standby controller was available.
19-08-18 05:52:32,829] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-18 05:52:34,457] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | compute-3 | worker | unlocked | enabled | available |
| 6 | compute-4 | worker | unlocked | enabled | available |
| 7 | controller-1 | controller | unlocked | enabled | available |
| 8 | storage-0 | storage | unlocked | enabled | available |
| 9 | storage-1 | storage | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+----
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
[2019-08-18 05:55:00,088] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-18 05:55:00,088] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-08-18 05:55:00,323] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-5c4849b47c-v2tml 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
kube-system ingress-6c47p 0/1 Init:0/1 0 2m34s 192.168.222.4 controller-1 <none> <none>
kube-system ingress-error-pages-84d44558cf-t5n5q 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
kube-system kube-multus-ds-amd64-phfhk 0/1 Pending 0 55s <none> controller-1 <none> <none>
kube-system rbd-provisioner-557fcf8c7d-r2dzp 0/1 Pending 0 7m45s <none> controller-1 <none> <none>
openstack cinder-api-69bbb7fcdb-vvjc2 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack cinder-backup-6459859855-tcrnq 0/1 Init:0/4 0 7m46s <none> controller-1 <none> <none>
openstack cinder-scheduler-77758958d7-jcmmq 0/1 Init:0/2 0 7m46s <none> controller-1 <none> <none>
openstack cinder-volume-758b9d8d98-5gp2v 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack glance-api-5b6dc8869b-x7zdb 0/1 Init:0/3 0 7m46s <none> controller-1 <none> <none>
openstack heat-api-5955d8c6bb-vwdzm 0/1 Init:0/1 0 7m46s <none> controller-1 <none> <none>
openstack ingress-error-pages-69cb7f954c-t24rh 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack keystone-api-79f86f86cf-g5lzl 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack neutron-server-54667f9684-dgrbc 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack nova-api-metadata-7c769f9986-94tmx 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-api-osapi-77fcdbf4b5-rm5xw 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack nova-api-proxy-b84549cf-2w4j9 0/1 Init:0/1 0 7m46s <none> controller-1 <none> <none>
openstack nova-conductor-6dfd99bb6-dcs44 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-novncproxy-75bf9b895d-6cv4v 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-scheduler-5897b7f7dc-f4zwk 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack osh-openstack-rabbitmq-rabbitmq-1 0/1 Pending 0 101s <none> controller-1 <none> <none>
openstack placement-api-7f9985946f-z9xgx 0/1 Pending 0 7m45s <none> controller-1 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-08-18 05:55:00,323] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-18 05:55:00,426] 423 DEBUG MainThread ssh.expect :: Output:
0

Severity
--------
Major

Steps to Reproduce
------------------
1. Make sure system is installed and good health . No alarms.

2. force reboot standby controller
3. check pod status

Expected Behavior
------------------
all pod recovered

Actual Behavior
----------------
many pods in pending status

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system
Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
2019-08-16_20-59-00
Last Pass
---------
2019-08-09_20-59-00
Timestamp/Logs
--------------
2019-08-18 05:52:34,561]

Tags: