standby controller reboot and become available but many pods are not -ready/unreachable

Bug #1840688 reported by Anujeyan Manokeran
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Alexander Kozyrev

Bug Description

Brief Description
-----------------
4 mins after standby controller reboot and become available/online again, many pods were still in pending status because node.kubernetes.io not-ready/unreachable. Detail pos display attached in text after standby controller was available.
19-08-18 05:52:32,829] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-08-18 05:52:34,457] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | compute-3 | worker | unlocked | enabled | available |
| 6 | compute-4 | worker | unlocked | enabled | available |
| 7 | controller-1 | controller | unlocked | enabled | available |
| 8 | storage-0 | storage | unlocked | enabled | available |
| 9 | storage-1 | storage | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+----
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
 [2019-08-18 05:55:00,088] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-18 05:55:00,088] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-08-18 05:55:00,323] 423 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-5c4849b47c-v2tml 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
kube-system ingress-6c47p 0/1 Init:0/1 0 2m34s 192.168.222.4 controller-1 <none> <none>
kube-system ingress-error-pages-84d44558cf-t5n5q 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
kube-system kube-multus-ds-amd64-phfhk 0/1 Pending 0 55s <none> controller-1 <none> <none>
kube-system rbd-provisioner-557fcf8c7d-r2dzp 0/1 Pending 0 7m45s <none> controller-1 <none> <none>
openstack cinder-api-69bbb7fcdb-vvjc2 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack cinder-backup-6459859855-tcrnq 0/1 Init:0/4 0 7m46s <none> controller-1 <none> <none>
openstack cinder-scheduler-77758958d7-jcmmq 0/1 Init:0/2 0 7m46s <none> controller-1 <none> <none>
openstack cinder-volume-758b9d8d98-5gp2v 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack glance-api-5b6dc8869b-x7zdb 0/1 Init:0/3 0 7m46s <none> controller-1 <none> <none>
openstack heat-api-5955d8c6bb-vwdzm 0/1 Init:0/1 0 7m46s <none> controller-1 <none> <none>
openstack ingress-error-pages-69cb7f954c-t24rh 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack keystone-api-79f86f86cf-g5lzl 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack neutron-server-54667f9684-dgrbc 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack nova-api-metadata-7c769f9986-94tmx 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-api-osapi-77fcdbf4b5-rm5xw 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack nova-api-proxy-b84549cf-2w4j9 0/1 Init:0/1 0 7m46s <none> controller-1 <none> <none>
openstack nova-conductor-6dfd99bb6-dcs44 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-novncproxy-75bf9b895d-6cv4v 0/1 Pending 0 7m44s <none> controller-1 <none> <none>
openstack nova-scheduler-5897b7f7dc-f4zwk 0/1 Pending 0 7m46s <none> controller-1 <none> <none>
openstack osh-openstack-rabbitmq-rabbitmq-1 0/1 Pending 0 101s <none> controller-1 <none> <none>
openstack placement-api-7f9985946f-z9xgx 0/1 Pending 0 7m45s <none> controller-1 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$
[2019-08-18 05:55:00,323] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-18 05:55:00,426] 423 DEBUG MainThread ssh.expect :: Output:
0

Severity
--------
Major

Steps to Reproduce
------------------
1. Make sure system is installed and good health . No alarms.

2. force reboot standby controller
3. check pod status

Expected Behavior
------------------
all pod recovered

Actual Behavior
----------------
many pods in pending status

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system
Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
2019-08-16_20-59-00
Last Pass
---------
2019-08-09_20-59-00
Timestamp/Logs
--------------
2019-08-18 05:52:34,561]

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Appears to be reporting the same issue as: https://bugs.launchpad.net/starlingx/+bug/1836787

tags: added: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 given the issue is intermittent. Expect that the pods eventually recover, but they just take longer.

tags: added: stx.3.0
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Alex Kozyrev (akozyrev)
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

unable to reproduce, have you seen this issue recently?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the issue was not reproduced recently. There has also been a k8s upversion, so suggest to re-test and provide new data if seen again.

Changed in starlingx:
status: Triaged → Invalid
Yang Liu (yliu12)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.