StarlingX

Bug #1838411
Comment #4

Comment 4 for bug 1838411

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-08-09:

In the next experiment I increased etcd cluster to 3 nodes:
one on controllers (floating) and one on each compute
(this is a 2+2 setup).

With a 3 node etcd cluster in place when changing active
controller from controller-1 to controller-0:

  2019-08-08T23:44:41.228 keystone.openstack.svc.cluster.local status: no DNS
  2019-08-08T23:45:02.547 keystone.openstack.svc.cluster.local status: 10.101.173.208
  2019-08-08T23:45:03.658 10.101.173.208 http status: UP
  2019-08-08T23:45:03.665 openstack server list: OK

it takes ~23 seconds to recover. Etcd nodes installed on
computes report:

2019-08-08 23:44:40.859360 I | rafthttp: peer 8e9e05c52164694d became inactive
2019-08-08 23:44:57.875170 I | rafthttp: peer 8e9e05c52164694d became active

that's 17 seconds for etcd to restart on target controller.

controller-1 sm-customer log reports:

controller-0 sm-customer log reports:

with a 15 seconds pause before it goes active that can be partially
explained by looking at drbd activity in kern.log (lines filtered):

  2019-08-08T23:44:44.527 drbd drbd-dockerdistribution: conn( TearDown -> Unconnected )
  2019-08-08T23:44:44.594 drbd drbd-etcd: conn( TearDown -> Unconnected )
  2019-08-08T23:44:44.661 drbd drbd-extension: conn( TearDown -> Unconnected )
  2019-08-08T23:44:45.531 drbd drbd-cgcs: conn( TearDown -> Unconnected )
  2019-08-08T23:44:51.553 drbd drbd-platform: conn( TearDown -> Unconnected )
  2019-08-08T23:44:52.055 drbd drbd-pgsql: conn( TearDown -> Unconnected )
  2019-08-08T23:44:54.072 drbd drbd-rabbit: conn( TearDown -> Unconnected )
  2019-08-08T23:44:55.425 block drbd8: role( Secondary -> Primary )
  2019-08-08T23:44:55.924 block drbd5: role( Secondary -> Primary )
  2019-08-08T23:44:55.924 block drbd7: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd1: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd3: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd0: role( Secondary -> Primary )
  2019-08-08T23:44:55.925 block drbd2: role( Secondary -> Primary )

that's 11 seconds in switching drbd roles.

To recap, for a controlled swact on a 2+2 with 3-node etcd cluster, we have:
+ 4s 23:44:40 - 24:44:44 unaccounted
from oam-services go-standby/standby to drbd TearDown on controller-0
+11s 23:44:44 - 23:44:55 drbd primary role
+ 1s 23:44:55 - 23:44:56 oam-services go active
+ 1s 23:44:56 - 23:44:57 etcd peer reported active
+ 6s 23:44:57 - 23:45:03 openstack server list starts working again

In the next experiment I increased etcd cluster to 3 nodes:
one on controllers (floating) and one on each compute 
(this is a 2+2 setup).

With a 3 node etcd cluster in place when changing active
controller from controller-1 to controller-0:

it takes ~23 seconds to recover. Etcd nodes installed on
computes report:

2019-08-08 23:44:40.859360 I | rafthttp: peer 8e9e05c52164694d became inactive
  2019-08-08 23:44:57.875170 I | rafthttp: peer 8e9e05c52164694d became active

that's 17 seconds for etcd to restart on target controller.

controller-1 sm-customer log reports:

with a 15 seconds pause before it goes active that can be partially
explained by looking at drbd activity in kern.log (lines filtered):

2019-08-08T23:44:44.527 drbd drbd-dockerdistribution: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:44.594 drbd drbd-etcd: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:44.661 drbd drbd-extension: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:45.531 drbd drbd-cgcs: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:51.553 drbd drbd-platform: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:52.055 drbd drbd-pgsql: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:54.072 drbd drbd-rabbit: conn( TearDown -> Unconnected ) 
  2019-08-08T23:44:55.425 block drbd8: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.924 block drbd5: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.924 block drbd7: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.925 block drbd1: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.925 block drbd3: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.925 block drbd0: role( Secondary -> Primary ) 
  2019-08-08T23:44:55.925 block drbd2: role( Secondary -> Primary )

that's 11 seconds in switching drbd roles.

To recap, for a controlled swact on a 2+2 with 3-node etcd cluster, we have:
 + 4s 23:44:40 - 24:44:44 unaccounted
      from oam-services go-standby/standby to drbd TearDown on controller-0
 +11s 23:44:44 - 23:44:55 drbd primary role 
 + 1s 23:44:55 - 23:44:56 oam-services go active
 + 1s 23:44:56 - 23:44:57 etcd peer reported active
 + 6s 23:44:57 - 23:45:03 openstack server list starts working again