StarlingX

Bug #1838411
Comment #8

Comment 8 for bug 1838411

Revision history for this message

Victor Manuel Rodriguez Bahena (vm-rod25) wrote on 2019-08-29:

After doing a second review of this LP with Frank, here is the debug history we had for other scenario

The system set up is

[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 19.09
###

SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-08-21_20-18-00"
SRC_BUILD_ID="433"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="433"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2019-08-21 20:20:12 -0400"

Controller-1 active and controller-0 is standby

We check the reboot time:

reboot -f:
Thu Aug 29 13:54:50 UTC 2019

openstack command started working at:
Thu Aug 29 13:55:57 UTC 2019

In this case it takes:

1 minute 7 seconds

From the moment that we reboot to the comment we have all the openstack services ready

The we analyze the logs from the SM:

SM takes 15 seconds to bring up all services on newly active controller (controller-0)

The next level of debug is k8:

# kube-controller-manager was leader on controller-0 before the reboot

We found that took 20s to declare controller-1 as NodeNotReady (--node-monitor-grace-period=20s)

I0829 13:55:09.664106 1 event.go:221] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"controller-1", UID:"8b224632-c8f7-11e9-b005-90e2ba39b2fc", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node controller-1 status is now: NodeNotReady

One hypothesis that Bart had is that since memcached can not be active on the 2 controllers it is taking time to detect and bring it on the new active controllers:

# memcached pod is deleted (was running on controller-1) (--pod-eviction-timeout=30s)
I0829 13:55:44.058062 1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: openstack/osh-openstack-memcached-memcached-b56979599-sqlmc

# memcached pod is created on controller-0
I0829 13:55:45.368259 1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openstack", Name:"osh-openstack-memcached-memcached-b56979599", UID:"74939a3f-c8fc-11e9-b005-90e2ba39b2fc", APIVersion:"apps/v1", ResourceVersion:"568559", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: osh-openstack-memcached-memcached-b56979599-sjdf7

# memcached init container
Started: Thu, 29 Aug 2019 13:55:52 +0000
Finished: Thu, 29 Aug 2019 13:55:52 +0000

# memcached main container
Started: Thu, 29 Aug 2019 13:55:53 +0000

The next steps:

1) Investigate way more about memcached https://www.linuxjournal.com/article/7451
2) Try to get a graphical tool to see the logs , things like : http://www.bootchart.org/images/bootchart.initial.png
3) have fun !

After doing a second review of this LP with Frank, here is the debug history we had for other scenario

The system set up is

[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
###     Release 19.09
###

SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-08-21_20-18-00"
SRC_BUILD_ID="433"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="433"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2019-08-21 20:20:12 -0400"

Controller-1 active and controller-0 is standby

We check the reboot time:

reboot -f:
Thu Aug 29 13:54:50 UTC 2019

openstack command started working at:
Thu Aug 29 13:55:57 UTC 2019

In this case it takes:

1 minute 7 seconds

From the moment that we reboot to the comment we have all the openstack services ready

The we analyze the logs from the SM:

SM takes 15 seconds to bring up all services on newly active controller (controller-0)

The next level of debug is k8:

# kube-controller-manager was leader on controller-0 before the reboot

We found that took 20s to declare controller-1 as NodeNotReady (--node-monitor-grace-period=20s)

I0829 13:55:09.664106       1 event.go:221] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"controller-1", UID:"8b224632-c8f7-11e9-b005-90e2ba39b2fc", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node controller-1 status is now: NodeNotReady

One hypothesis that Bart had is that since memcached can not be active on the 2 controllers it is taking time to detect and bring it on the new active controllers:

# memcached pod is deleted (was running on controller-1) (--pod-eviction-timeout=30s)
I0829 13:55:44.058062       1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: openstack/osh-openstack-memcached-memcached-b56979599-sqlmc

# memcached pod is created on controller-0
I0829 13:55:45.368259       1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openstack", Name:"osh-openstack-memcached-memcached-b56979599", UID:"74939a3f-c8fc-11e9-b005-90e2ba39b2fc", APIVersion:"apps/v1", ResourceVersion:"568559", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: osh-openstack-memcached-memcached-b56979599-sjdf7

# memcached init container
      Started:      Thu, 29 Aug 2019 13:55:52 +0000
      Finished:     Thu, 29 Aug 2019 13:55:52 +0000

# memcached main container
      Started:      Thu, 29 Aug 2019 13:55:53 +0000

The next steps: