Comment 3 for bug 1875891

Revision history for this message
Bob Church (rchurch) wrote :

Analysis:
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
   networks when multiple (floating) IPs are present when in an IPv6 config
   - We don't do this for the OAM network so after controller-0 unlocks we end up
     using the OAM floating address instead of the fixed host address when
     containerd starts.
   - When controller-1 is unlocked, it is the standby controller with only one
    OAM specific host address so containerd starts successfully with the
     correct IP
 - The containerd template does not set the stream_server_address [2].
   - Should we be providing an address here that aligns to the containerd
     address that we want it to listen to (i.e. the OAM host specific address)
   - Do we want to use the OAM address here? Should this be the MGMT host
     specific address or loopback?

I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.

We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.

[1] https://review.opendev.org/#/c/715120/
[2] https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/templates/config.toml.erb#L29

Look at the lab and related logs:

# Ansible sets up the following:
external_oam_subnet: 2620:10a:a001:a103::6:0/64
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10A:A001:A103::11
external_oam_node_0_address: 2620:10A:A001:A103::8
external_oam_node_1_address: 2620:10A:A001:A103::9

# Controller-0 (standby) after ansible, unlocked, and swact to controller-1:
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:40870 :::* LISTEN 89515/containerd

controller-0:~$ sudo netstat -tulp6 | grep contain
tcp6 0 0 oamcontroller:40870 [::]:* LISTEN 89515/containerd

controller-0:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2

controller-0:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11 oamcontroller

controller-0:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::8/64 scope global <------- containerd should be listening here.
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:b588/64 scope link
       valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 face::2/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
    inet6 feed:beef::2/64 scope site
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
       valid_lft forever preferred_lft forever

# Controller-1 (active) after unlocked, and swact to from controller-0:
# Containerd is listening on the correct address: 2620:10A:A001:A103::9
controller-1:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:39273 :::* LISTEN 93111/containerd

controller-1:~$ sudo netstat -tulp6 | grep contain
tcp6 0 0 v6-150-9.yow.lab.:39273 [::]:* LISTEN 93111/containerd

rchurch@yow-tuxlab2:~$ nslookup v6-150-9
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: v6-150-9.yow.lab.wrs.com
Address: 2620:10a:a001:a103::9

controller-1:~$ cat /etc/resolv.conf
nameserver face::1
nameserver 2620:10a:a001:a103::2

controller-1:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11 oamcontroller
controller-1:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 18:66:da:af:bc:d2 brd ff:ff:ff:ff:ff:ff
    inet6 2620:10a:a001:a103::11/64 scope global
       valid_lft forever preferred_lft forever <----- Missing “0sec” on floating OAM IP
    inet6 2620:10a:a001:a103::9/64 scope global <----- containerd should be and IS listening here
       valid_lft forever preferred_lft forever
    inet6 fe80::1a66:daff:feaf:bcd2/64 scope link
       valid_lft forever preferred_lft forever
15: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
    inet6 face::4/64 scope global deprecated
       valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
    inet6 face::1/64 scope global deprecated
       valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
    inet6 face::3/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d700/64 scope link
       valid_lft forever preferred_lft forever
16: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
    inet6 feed:beef::1/64 scope site deprecated
       valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
    inet6 feed:beef::3/64 scope site
       valid_lft forever preferred_lft forever
    inet6 fe80::3efd:feff:fe25:d700/64 scope link
       valid_lft forever preferred_lft forever

# Tiller is accessible from the mgmt network floating IP
controller-1:~$ kubectl get ep -n kube-system tiller-deploy
NAME ENDPOINTS AGE
tiller-deploy [face::2]:44134 14h

# Last message from tiller probably aligned with when controller-0 was active
2020-04-28T18:29:39.174576873Z stderr F [tiller] 2020/04/28 18:29:39 executing 0 post-delete hooks for oidc-oidc-client
2020-04-28T18:29:39.17459409Z stderr F [tiller] 2020/04/28 18:29:39 hooks complete for post-delete oidc-oidc-client
2020-04-28T18:29:39.17460406Z stderr F [tiller] 2020/04/28 18:29:39 purge requested for oidc-oidc-client
2020-04-28T18:29:39.174611595Z stderr F [storage] 2020/04/28 18:29:39 deleting release "oidc-oidc-client.v1"
2020-04-28T18:34:53.429325711Z stderr F [storage] 2020/04/28 18:34:53 listing all releases with filter

# From sm-customer.log, so yes controller-0 was active at the the time of the last tiller logs
| 2020-04-28T18:00:56.910 | 196 | service-group-scn | controller-services | standby-degraded | standby |
| 2020-04-28T19:14:08.477 | 197 | node-scn | controller-1 | | swact | issued against host controller-0

# Tiller pod is on controller-0
controller-1:~$ kubectl get pods --all-namespaces -o wide -w | grep tiller
kube-system tiller-deploy-5c8dd9fb56-lgldr 1/1 Running 0 11h face::2 controller-0 <none> <none>

# Get the container id
controller-1:~$ kubectl describe pods -n kube-system tiller-deploy-5c8dd9fb56-lgldr | grep containerd
    Container ID: containerd://bb7618a8a6305780322c804ef0991e102596d0d8448c9387e6ad4cbeedd28185

# Can't connect from controller-1: OAM floating address is assigned to controller-1 but
# controller-0 has containerd listening on that that same address
controller-1:~$ kubectl exec -it -n kube-system tiller-deploy-5c8dd9fb56-lgldr -- ls
error: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out

# Can't connect from controller-0: OAM floating address is assigned to controller-1 but
# controller-0 has containerd listening on that that same address
controller-0:~# DEBUG=1 crictl exec -it bb7618a8a6305 ls
FATA[0127] execing command in container failed: error sending request: Post "http://[2620:10a:a001:a103::11]:40870/exec/va5vLJhQ": dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out