Uncontrolled swact on standard system is slow

Bug #1838411 reported by Yang Liu on 2019-07-30
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
It takes about 1.5 minutes for 'openstack server list' to recover after a uncontrolled swact on a standard system

Severity
--------
Major

Steps to Reproduce
------------------
Test pre-conditions:
- ensure stx-openstack application is applied via system application-list
- helm list –a shows all releases (eg: garbd, cinder, nova, etc) are in status = DEPLOYED state
- kubectl get pods -n openstack -o wide -w |grep -iEv "running|completed|cinder-volume-usage-audit|heat-engine-cleaner" shows no pods
- openstack server list --all-projects command works

Test steps:
# On active controller (it was controller-0 for the test):
$ sudo su
$ date;reboot -f
Fri Jul 26 15:38:56 UTC 2019
# On controller-1:
$ while true; do date; timeout 10s openstack server list --all-projects; done
Fri Jul 26 15:39:33 UTC 2019
Fri Jul 26 15:39:43 UTC 2019
Fri Jul 26 15:39:53 UTC 2019
Fri Jul 26 15:40:03 UTC 2019
Fri Jul 26 15:40:13 UTC 2019
Fri Jul 26 15:40:23 UTC 2019
+--------------------------------------+-----------------+--------+-----------------------------------------------------------------------------------------+-------+--------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+-----------------------------------------------------------------------------------------+-------+--------+
| d1aa58d2-7238-4781-a7d3-aef712222980 | tenant1-virtio1 | ACTIVE | internal0-net0-2=10.2.2.207; tenant1-mgmt-net=192.168.116.53; tenant1-net2=172.16.2.220 | | small |
+--------------------------------------+-----------------+--------+-----------------------------------------------------------------------------------------+-------+--------+
Fri Jul 26 15:40:25 UTC 2019

TC-name:
KPI - uncontrolled swact

Expected Behavior
------------------
Less than 1 minute

Actual Behavior
----------------
Total time for 'openstack server list' to recover after uncontrolled swact: ~1m29s

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Standard system
Lab-name:
ironpass-33-36

Branch/Pull Time/Commit
-----------------------
stx master as of 20190725T013000Z

Last Pass
---------
STX R1

Timestamp/Logs
--------------
See Test Steps for timestamps

Test Activity
-------------
Regression Testing

Yang Liu (yliu12) wrote :
description: updated
tags: added: stx.regression
Numan Waheed (nwaheed) on 2019-07-31
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 for now / medium priority given the swact does complete although it's slow.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Tao Liu (tliu88)
Frank Miller (sensfan22) on 2019-08-01
Changed in starlingx:
assignee: Tao Liu (tliu88) → Daniel Badea (daniel.badea)
Daniel Badea (daniel.badea) wrote :

The test procedure was:
1. run openstack service list in a loop
2. run system host-swact $(hostname)
3. monitor: kubernetes pods, openstack ingress docker logs, DNS availability,
   sm-customer.log

What I found so far:
1. controlled swact causes a 20-25s disruption in openstack rest api
   availability. Here's an example (with comments):
   +00.000s controller-1 status: active go-standby
   +00.653s keystone.openstack.svc.cluster.local status: no DNS (rest api
            unavailabile: host name does not resolve to an ip address)
   +02.523s controller-1 status: go-standby disabling
   +14.108s controller-1 status: disabling disabled
   +14.416s controller-0 status: standby go-active
   +17.625s controller-1 status: disabled standby
   +19.134s controller-1 status: standby standby-degraded
   +21.407s controller-1 status: standby-degraded standby
   +21.810s I0806 38 leaderelection.go:205] attempting to acquire leader lease
            openstack/osh-openstack-ingress-nginx...
            (ingress pod fails leader election)
   +21.810s I0806 38 leaderelection.go:249] failed to renew lease
            openstack/osh-openstack-ingress-nginx: failed to tryAcquireOrRenew
            context deadline exceeded
            (ingress pod fails leader election)
   +21.862s keystone.openstack.svc.cluster.local status: 10.100.157.117
            (dns available)
   +21.872s 10.100.157.117 http status: UP
   +21.880s openstack server list: OK (rest api available)
   +22.127s 192.168.206.3 - [192.168.206.3] "GET /v3 HTTP/1.1" 200 222 "-"
            "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.21.0
            CPython/2.7.5" 236 0.003 [openstack-keystone-api-ks-pub]
            172.16.193.49:5000 271 0.002 200 d8d497e2622420863667621c52607d23
            (ingress forwards rest api request to keystone)
Note:
  Time stamps related to rest api and dns are captured when the command is
  issued, not when it returns. This explains why apparently "openstack server
  list" is OK before the request is handled by ingress container.

2. kubernetes is not running in HA configuration:
   a. there is only one etcd process. Its data directory is drbd synchronized
      between controllers
   b. etcd process is stopped on the source controller and then started on
      the target controller after swact
   c. kube-controller-manager is restarted on the controller that becomes active
   d. kube-scheduler is restarted on the controller that becomes standby

3. openstack ingress containers are not restarted but they are using leader
   election and are impacted by kubernetes cluster glitch caused by swact

My guess for what happens in a controlled swact with an openstack
rest api request:
1. dns is unavailable while resolver floating ip moves between controllers
2. calico networking is already setup on the controller that becomes active
   (no delay caused by container/pod/cluster networking setup)
3. rest api request reaches openstack ingress container but it's not serviced
   because pod detects leader election issue
4. when kubernetes-controller-manager is back ingress resumes forwarding
   requests to openstack keystone and rest api is successful

Daniel Badea (daniel.badea) wrote :

In the next experiment I increased etcd cluster to 3 nodes:
one on controllers (floating) and one on each compute
(this is a 2+2 setup).

With a 3 node etcd cluster in place when changing active
controller from controller-1 to controller-0:

  2019-08-08T23:44:41.228 keystone.openstack.svc.cluster.local status: no DNS
  2019-08-08T23:45:02.547 keystone.openstack.svc.cluster.local status: 10.101.173.208
  2019-08-08T23:45:03.658 10.101.173.208 http status: UP
  2019-08-08T23:45:03.665 openstack server list: OK

it takes ~23 seconds to recover. Etcd nodes installed on
computes report:

  2019-08-08 23:44:40.859360 I | rafthttp: peer 8e9e05c52164694d became inactive
  2019-08-08 23:44:57.875170 I | rafthttp: peer 8e9e05c52164694d became active

that's 17 seconds for etcd to restart on target controller.

controller-1 sm-customer log reports:

  2019-08-08T23:44:38.103 | 334 | node-scn | controller-1 | | swact | issued against host controller-1
  2019-08-08T23:44:40.477 | 393 | service-group-scn | oam-services | go-standby | standby |
  2019-08-08T23:44:56.610 | 529 | service-group-scn | oam-services | go-active | active |

controller-0 sm-customer log reports:

  2019-08-08T23:44:38.104 | 441 | node-scn | controller-0 | | swact | issued against host controller-1 |
  2019-08-08T23:44:55.121 | 442 | service-group-scn | vim-services | standby | go-active | |

with a 15 seconds pause before it goes active that can be partially
explained by looking at drbd activity in kern.log (lines filtered):

  2019-08-08T23:44:44.527 drbd drbd-dockerdistribution: conn( TearDown -> Unconnected )
  2019-08-08T23:44:44.594 drbd drbd-etcd: conn( TearDown -> Unconnected )
  2019-08-08T23:44:44.661 drbd drbd-extension: conn( TearDown -> Unconnected )
  2019-08-08T23:44:45.531 drbd drbd-cgcs: conn( TearDown -> Unconnected )
  2019-08-08T23:44:51.553 drbd drbd-platform: conn( TearDown -> Unconnected )
  2019-08-08T23:44:52.055 drbd drbd-pgsql: conn( TearDown -> Unconnected )
  2019-08-08T23:44:54.072 drbd drbd-rabbit: conn( TearDown -> Unconnected )
  2019-08-08T23:44:55.425 block drbd8: role( Secondary -> Primary )
  2019-08-08T23:44:55.924 block drbd5: role( Secondary -> Primary )
  2019-08-08T23:44:55.924 block drbd7: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd1: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd3: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd0: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd2: role( Secondary -> Primary )

that's 11 seconds in switching drbd roles.

To recap, for a controlled swact on a 2+2 with 3-node etcd cluster, we have:
 + 4s 23:44:40 - 24:44:44 unaccounted
      from oam-services go-standby/standby to drbd TearDown on controller-0
 +11s 23:44:44 - 23:44:55 drbd primary role
 + 1s 23:44:55 - 23:44:56 oam-services go active
 + 1s 23:44:56 - 23:44:57 etcd peer reported active
 + 6s 23:44:57 - 23:45:03 openstack server list starts working again

Frank Miller (sensfan22) wrote :

Assigning to Bart to take over the investigation as Daniel is now on vacation.

Changed in starlingx:
assignee: Daniel Badea (daniel.badea) → Bart Wensley (bartwensley)
Bart Wensley (bartwensley) wrote :

Here is a summary of the time taken to recover from an active controller reset (force reboot). Some notes:
- I did my testing on a 2+2 lab (IP_33-36 in Ottawa) with the 20190811T053000Z load and the stx-openstack application installed.
- Recovery time was measured using “openstack server list”, with a timeout of 15s.

Recovery times range from about 90s to 120s, with an average of about 105s. Here is a rough summary of where I think the time is going (times can be +/- 5s or so):

0s - force reboot
4s - SM begins enabling services
15s - all SM services enabled
25s - kube-controller-manager becomes leader (leader election time is 15s but requires etcd to be running)
40s - kube-controller-manager finishes recovering system state and starts timing for NodeNotReady
60s - kube-controller-manager declares that formerly active controller is NodeNotReady (20s node-monitor-grace-period) and removes the endpoints for pods running on that node

At this point (roughly 60s) the flock services and kubernetes components should be available and have an up-to-date view of the system state. However, it takes another 30-60s for the “openstack server list” command to start functioning again. I’m not 100% sure about this but I think it is related to the endpoints for some openstack pods not being available yet. I see two pods in particular that have issues:
- The osh-openstack-memcached-memcached pod is not replicated, so it takes time (30s or longer) for this pod to be re-created on the newly active controller. Some notes:
  - Without this pod, none of the openstack APIs are working (I confirmed this by removing the pod by setting the replicas to 0).
  - By default, kubernetes waits 30s to delete the pod and re-create it on the newly active controller (due to the “node.kubernetes.io/unreachable:NoExecute for 30s” toleration). I tried to set this toleration to 5s, but that didn’t improve things (not sure why).
  - I also tried to set the replicas to 2 for this pod and it also didn’t help. I’m not sure this pod supports more than 1 replica - see https://blueprints.launchpad.net/openstack-helm/+spec/memcached-resiliency for a proposed change to support memcached replication.
- The keystone-api pod often fails its liveness probe on the newly active controller, which causes the container to be restarted and the keystone-api endpoint to disappear for a significant period of time. I don’t know why this is happening.

Frank Miller (sensfan22) wrote :

Changed the gate to stx.3.0. The swacts are functional for stx.2.0 and options to reduce the downtime over a swact will be looked at for stx.3.0

tags: added: stx.3.0
removed: stx.2.0

After doing a second review of this LP with Frank, here is the debug history we had for other scenario

The system set up is

[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 19.09
###

SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-08-21_20-18-00"
SRC_BUILD_ID="433"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="433"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2019-08-21 20:20:12 -0400"

Controller-1 active and controller-0 is standby

We check the reboot time:

reboot -f:
Thu Aug 29 13:54:50 UTC 2019

openstack command started working at:
Thu Aug 29 13:55:57 UTC 2019

In this case it takes:

1 minute 7 seconds

From the moment that we reboot to the comment we have all the openstack services ready

The we analyze the logs from the SM:

Logs:
sm-customer.log:
| 2019-08-29T13:54:54.038 | 406 | neighbor-scn | controller-1 | full | down | Neighbor dead
| 2019-08-29T13:55:05.524 | 539 | service-group-scn | controller-services | go-active | active |

SM takes 15 seconds to bring up all services on newly active controller (controller-0)

The next level of debug is k8:

# kube-controller-manager was leader on controller-0 before the reboot

We found that took 20s to declare controller-1 as NodeNotReady (--node-monitor-grace-period=20s)

I0829 13:55:09.664106 1 event.go:221] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"controller-1", UID:"8b224632-c8f7-11e9-b005-90e2ba39b2fc", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node controller-1 status is now: NodeNotReady

One hypothesis that Bart had is that since memcached can not be active on the 2 controllers it is taking time to detect and bring it on the new active controllers:

# memcached pod is deleted (was running on controller-1) (--pod-eviction-timeout=30s)
I0829 13:55:44.058062 1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: openstack/osh-openstack-memcached-memcached-b56979599-sqlmc

# memcached pod is created on controller-0
I0829 13:55:45.368259 1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"openstack", Name:"osh-openstack-memcached-memcached-b56979599", UID:"74939a3f-c8fc-11e9-b005-90e2ba39b2fc", APIVersion:"apps/v1", ResourceVersion:"568559", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: osh-openstack-memcached-memcached-b56979599-sjdf7

# memcached init container
      Started: Thu, 29 Aug 2019 13:55:52 +0000
      Finished: Thu, 29 Aug 2019 13:55:52 +0000

# memcached main container
      Started: Thu, 29 Aug 2019 13:55:53 +0000

The next steps:

1) Investigate way more about memcached https://www.linuxjournal.com/article/7451
2) Try to get a graphical tool to see the logs , things like : http://www.bootchart.org/images/bootchart.initial.png
3) have fun !

Elio Martinez (elio1979) wrote :

Trying to get some swact metrics on:
OS="centos"
SW_VERSION="19.08"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.2.0"

JOB="STX_BUILD_2.0"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-26 23:30:00 +0000"

I found that "openstack server list" still working on both controllers for a long time until the system recognizes the inactive controller. This issue is random.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers