One hypothesis that Bart had is that since memcached can not be active on the 2 controllers it is taking time to detect and bring it on the new active controllers:
# memcached pod is deleted (was running on controller-1) (--pod-eviction-timeout=30s)
I0829 13:55:44.058062 1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: openstack/osh-openstack-memcached-memcached-b56979599-sqlmc
# memcached pod is created on controller-0
I0829 13:55:45.368259 1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openstack", Name:"osh-openstack-memcached-memcached-b56979599", UID:"74939a3f-c8fc-11e9-b005-90e2ba39b2fc", APIVersion:"apps/v1", ResourceVersion:"568559", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: osh-openstack-memcached-memcached-b56979599-sjdf7
# memcached init container
Started: Thu, 29 Aug 2019 13:55:52 +0000
Finished: Thu, 29 Aug 2019 13:55:52 +0000
# memcached main container
Started: Thu, 29 Aug 2019 13:55:53 +0000
After doing a second review of this LP with Frank, here is the debug history we had for other scenario
The system set up is
[sysadmin@ controller- 0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 19.09
###
SW_VERSION="19.09" "2019-08- 21_20-18- 00"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
SRC_BUILD_ID="433"
JOB="StarlingX_ Upstream_ build" "yow-cgts1- lx" "2019-08- 21 20:20:12 -0400"
BUILD_BY="jenkins"
BUILD_NUMBER="433"
BUILD_HOST=
BUILD_DATE=
Controller-1 active and controller-0 is standby
We check the reboot time:
reboot -f:
Thu Aug 29 13:54:50 UTC 2019
openstack command started working at:
Thu Aug 29 13:55:57 UTC 2019
In this case it takes:
1 minute 7 seconds
From the moment that we reboot to the comment we have all the openstack services ready
The we analyze the logs from the SM:
Logs: 29T13:54: 54.038 | 406 | neighbor-scn | controller-1 | full | down | Neighbor dead 29T13:55: 05.524 | 539 | service-group-scn | controller-services | go-active | active |
sm-customer.log:
| 2019-08-
| 2019-08-
SM takes 15 seconds to bring up all services on newly active controller (controller-0)
The next level of debug is k8:
# kube-controller -manager was leader on controller-0 before the reboot
We found that took 20s to declare controller-1 as NodeNotReady (--node- monitor- grace-period= 20s)
I0829 13:55:09.664106 1 event.go:221] Event(v1. ObjectReference {Kind:" Node", Namespace:"", Name:"controlle r-1", UID:"8b224632- c8f7-11e9- b005-90e2ba39b2 fc", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node controller-1 status is now: NodeNotReady
One hypothesis that Bart had is that since memcached can not be active on the 2 controllers it is taking time to detect and bring it on the new active controllers:
# memcached pod is deleted (was running on controller-1) (--pod- eviction- timeout= 30s) go:102] NoExecuteTaintM anager is deleting Pod: openstack/ osh-openstack- memcached- memcached- b56979599- sqlmc
I0829 13:55:44.058062 1 taint_manager.
# memcached pod is created on controller-0 ObjectReference {Kind:" ReplicaSet" , Namespace: "openstack" , Name:"osh- openstack- memcached- memcached- b56979599" , UID:"74939a3f- c8fc-11e9- b005-90e2ba39b2 fc", APIVersion: "apps/v1" , ResourceVersion :"568559" , FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: osh-openstack- memcached- memcached- b56979599- sjdf7
I0829 13:55:45.368259 1 event.go:221] Event(v1.
# memcached init container
Started: Thu, 29 Aug 2019 13:55:52 +0000
Finished: Thu, 29 Aug 2019 13:55:52 +0000
# memcached main container
Started: Thu, 29 Aug 2019 13:55:53 +0000
The next steps:
1) Investigate way more about memcached https:/ /www.linuxjourn al.com/ article/ 7451 www.bootchart. org/images/ bootchart. initial. png
2) Try to get a graphical tool to see the logs , things like : http://
3) have fun !