Analysis: - [1] was done to ensure that fixed IPs are used for the cluster and mgmt networks when multiple (floating) IPs are present when in an IPv6 config - We don't do this for the OAM network so after controller-0 unlocks we end up using the OAM floating address instead of the fixed host address when containerd starts. - When controller-1 is unlocked, it is the standby controller with only one OAM specific host address so containerd starts successfully with the correct IP - The containerd template does not set the stream_server_address [2]. - Should we be providing an address here that aligns to the containerd address that we want it to listen to (i.e. the OAM host specific address) - Do we want to use the OAM address here? Should this be the MGMT host specific address or loopback? I suspect that if you restarted containerd on controller-0 (currently standby) it would get the correct address and things would work correctly. We should specify a specific address for containerd to bind to avoid IP discovery when starting up the containerd process. [1] https://review.opendev.org/#/c/715120/ [2] https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/templates/config.toml.erb#L29 Look at the lab and related logs: # Ansible sets up the following: external_oam_subnet: 2620:10a:a001:a103::6:0/64 external_oam_gateway_address: 2620:10a:a001:a103::6:0 external_oam_floating_address: 2620:10A:A001:A103::11 external_oam_node_0_address: 2620:10A:A001:A103::8 external_oam_node_1_address: 2620:10A:A001:A103::9 # Controller-0 (standby) after ansible, unlocked, and swact to controller-1: # Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8 controller-0:~$ sudo netstat -tulpn6 | grep contain tcp6 0 0 2620:10a:a001:a10:40870 :::* LISTEN 89515/containerd controller-0:~$ sudo netstat -tulp6 | grep contain tcp6 0 0 oamcontroller:40870 [::]:* LISTEN 89515/containerd controller-0:~$ cat /etc/resolv.conf nameserver face::1 nameserver 2620:10a:a001:a103::2 controller-0:~$ cat /etc/hosts | grep oamcontroller 2620:10a:a001:a103::11 oamcontroller controller-0:~$ ip a 3: eno1: mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff inet6 2620:10a:a001:a103::8/64 scope global <------- containerd should be listening here. valid_lft forever preferred_lft forever inet6 fe80::1a66:daff:feaf:b588/64 scope link valid_lft forever preferred_lft forever 14: vlan157@pxeboot0: mtu 1500 qdisc htb state UP group default qlen 1000 link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff inet6 face::2/64 scope global valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe25:d5c0/64 scope link valid_lft forever preferred_lft forever 15: vlan158@pxeboot0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff inet6 feed:beef::2/64 scope site valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe25:d5c0/64 scope link valid_lft forever preferred_lft forever # Controller-1 (active) after unlocked, and swact to from controller-0: # Containerd is listening on the correct address: 2620:10A:A001:A103::9 controller-1:~$ sudo netstat -tulpn6 | grep contain tcp6 0 0 2620:10a:a001:a10:39273 :::* LISTEN 93111/containerd controller-1:~$ sudo netstat -tulp6 | grep contain tcp6 0 0 v6-150-9.yow.lab.:39273 [::]:* LISTEN 93111/containerd rchurch@yow-tuxlab2:~$ nslookup v6-150-9 Server: 127.0.0.1 Address: 127.0.0.1#53 Name: v6-150-9.yow.lab.wrs.com Address: 2620:10a:a001:a103::9 controller-1:~$ cat /etc/resolv.conf nameserver face::1 nameserver 2620:10a:a001:a103::2 controller-1:~$ cat /etc/hosts | grep oamcontroller 2620:10a:a001:a103::11 oamcontroller controller-1:~$ ip a 3: eno1: mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 18:66:da:af:bc:d2 brd ff:ff:ff:ff:ff:ff inet6 2620:10a:a001:a103::11/64 scope global valid_lft forever preferred_lft forever <----- Missing “0sec” on floating OAM IP inet6 2620:10a:a001:a103::9/64 scope global <----- containerd should be and IS listening here valid_lft forever preferred_lft forever inet6 fe80::1a66:daff:feaf:bcd2/64 scope link valid_lft forever preferred_lft forever 15: vlan157@pxeboot0: mtu 1500 qdisc htb state UP group default qlen 1000 link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff inet6 face::4/64 scope global deprecated valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/ inet6 face::1/64 scope global deprecated valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/ inet6 face::3/64 scope global valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe25:d700/64 scope link valid_lft forever preferred_lft forever 16: vlan158@pxeboot0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff inet6 feed:beef::1/64 scope site deprecated valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/ inet6 feed:beef::3/64 scope site valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe25:d700/64 scope link valid_lft forever preferred_lft forever # Tiller is accessible from the mgmt network floating IP controller-1:~$ kubectl get ep -n kube-system tiller-deploy NAME ENDPOINTS AGE tiller-deploy [face::2]:44134 14h # Last message from tiller probably aligned with when controller-0 was active 2020-04-28T18:29:39.174576873Z stderr F [tiller] 2020/04/28 18:29:39 executing 0 post-delete hooks for oidc-oidc-client 2020-04-28T18:29:39.17459409Z stderr F [tiller] 2020/04/28 18:29:39 hooks complete for post-delete oidc-oidc-client 2020-04-28T18:29:39.17460406Z stderr F [tiller] 2020/04/28 18:29:39 purge requested for oidc-oidc-client 2020-04-28T18:29:39.174611595Z stderr F [storage] 2020/04/28 18:29:39 deleting release "oidc-oidc-client.v1" 2020-04-28T18:34:53.429325711Z stderr F [storage] 2020/04/28 18:34:53 listing all releases with filter # From sm-customer.log, so yes controller-0 was active at the the time of the last tiller logs | 2020-04-28T18:00:56.910 | 196 | service-group-scn | controller-services | standby-degraded | standby | | 2020-04-28T19:14:08.477 | 197 | node-scn | controller-1 | | swact | issued against host controller-0 # Tiller pod is on controller-0 controller-1:~$ kubectl get pods --all-namespaces -o wide -w | grep tiller kube-system tiller-deploy-5c8dd9fb56-lgldr 1/1 Running 0 11h face::2 controller-0 # Get the container id controller-1:~$ kubectl describe pods -n kube-system tiller-deploy-5c8dd9fb56-lgldr | grep containerd Container ID: containerd://bb7618a8a6305780322c804ef0991e102596d0d8448c9387e6ad4cbeedd28185 # Can't connect from controller-1: OAM floating address is assigned to controller-1 but # controller-0 has containerd listening on that that same address controller-1:~$ kubectl exec -it -n kube-system tiller-deploy-5c8dd9fb56-lgldr -- ls error: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out # Can't connect from controller-0: OAM floating address is assigned to controller-1 but # controller-0 has containerd listening on that that same address controller-0:~# DEBUG=1 crictl exec -it bb7618a8a6305 ls FATA[0127] execing command in container failed: error sending request: Post "http://[2620:10a:a001:a103::11]:40870/exec/va5vLJhQ": dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out