In the next experiment I increased etcd cluster to 3 nodes:
one on controllers (floating) and one on each compute
(this is a 2+2 setup).
With a 3 node etcd cluster in place when changing active
controller from controller-1 to controller-0:
2019-08-08T23:44:41.228 keystone.openstack.svc.cluster.local status: no DNS
2019-08-08T23:45:02.547 keystone.openstack.svc.cluster.local status: 10.101.173.208
2019-08-08T23:45:03.658 10.101.173.208 http status: UP
2019-08-08T23:45:03.665 openstack server list: OK
it takes ~23 seconds to recover. Etcd nodes installed on
computes report:
2019-08-08 23:44:40.859360 I | rafthttp: peer 8e9e05c52164694d became inactive
2019-08-08 23:44:57.875170 I | rafthttp: peer 8e9e05c52164694d became active
that's 17 seconds for etcd to restart on target controller.
To recap, for a controlled swact on a 2+2 with 3-node etcd cluster, we have:
+ 4s 23:44:40 - 24:44:44 unaccounted
from oam-services go-standby/standby to drbd TearDown on controller-0
+11s 23:44:44 - 23:44:55 drbd primary role
+ 1s 23:44:55 - 23:44:56 oam-services go active
+ 1s 23:44:56 - 23:44:57 etcd peer reported active
+ 6s 23:44:57 - 23:45:03 openstack server list starts working again
In the next experiment I increased etcd cluster to 3 nodes:
one on controllers (floating) and one on each compute
(this is a 2+2 setup).
With a 3 node etcd cluster in place when changing active
controller from controller-1 to controller-0:
2019- 08-08T23: 44:41.228 keystone. openstack. svc.cluster. local status: no DNS 08-08T23: 45:02.547 keystone. openstack. svc.cluster. local status: 10.101.173.208 08-08T23: 45:03.658 10.101.173.208 http status: UP 08-08T23: 45:03.665 openstack server list: OK
2019-
2019-
2019-
it takes ~23 seconds to recover. Etcd nodes installed on
computes report:
2019-08-08 23:44:40.859360 I | rafthttp: peer 8e9e05c52164694d became inactive
2019-08-08 23:44:57.875170 I | rafthttp: peer 8e9e05c52164694d became active
that's 17 seconds for etcd to restart on target controller.
controller-1 sm-customer log reports:
2019- 08-08T23: 44:38.103 | 334 | node-scn | controller-1 | | swact | issued against host controller-1 08-08T23: 44:40.477 | 393 | service-group-scn | oam-services | go-standby | standby | 08-08T23: 44:56.610 | 529 | service-group-scn | oam-services | go-active | active |
2019-
2019-
controller-0 sm-customer log reports:
2019- 08-08T23: 44:38.104 | 441 | node-scn | controller-0 | | swact | issued against host controller-1 | 08-08T23: 44:55.121 | 442 | service-group-scn | vim-services | standby | go-active | |
2019-
with a 15 seconds pause before it goes active that can be partially
explained by looking at drbd activity in kern.log (lines filtered):
2019- 08-08T23: 44:44.527 drbd drbd-dockerdist ribution: conn( TearDown -> Unconnected ) 08-08T23: 44:44.594 drbd drbd-etcd: conn( TearDown -> Unconnected ) 08-08T23: 44:44.661 drbd drbd-extension: conn( TearDown -> Unconnected ) 08-08T23: 44:45.531 drbd drbd-cgcs: conn( TearDown -> Unconnected ) 08-08T23: 44:51.553 drbd drbd-platform: conn( TearDown -> Unconnected ) 08-08T23: 44:52.055 drbd drbd-pgsql: conn( TearDown -> Unconnected ) 08-08T23: 44:54.072 drbd drbd-rabbit: conn( TearDown -> Unconnected ) 08-08T23: 44:55.425 block drbd8: role( Secondary -> Primary ) 08-08T23: 44:55.924 block drbd5: role( Secondary -> Primary ) 08-08T23: 44:55.924 block drbd7: role( Secondary -> Primary ) 08-08T23: 44:55.925 block drbd1: role( Secondary -> Primary ) 08-08T23: 44:55.925 block drbd3: role( Secondary -> Primary ) 08-08T23: 44:55.925 block drbd0: role( Secondary -> Primary ) 08-08T23: 44:55.925 block drbd2: role( Secondary -> Primary )
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
2019-
that's 11 seconds in switching drbd roles.
To recap, for a controlled swact on a 2+2 with 3-node etcd cluster, we have:
+ 4s 23:44:40 - 24:44:44 unaccounted
from oam-services go-standby/standby to drbd TearDown on controller-0
+11s 23:44:44 - 23:44:55 drbd primary role
+ 1s 23:44:55 - 23:44:56 oam-services go active
+ 1s 23:44:56 - 23:44:57 etcd peer reported active
+ 6s 23:44:57 - 23:45:03 openstack server list starts working again