Analysis:
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
networks when multiple (floating) IPs are present when in an IPv6 config
- We don't do this for the OAM network so after controller-0 unlocks we end up
using the OAM floating address instead of the fixed host address when
containerd starts.
- When controller-1 is unlocked, it is the standby controller with only one
OAM specific host address so containerd starts successfully with the
correct IP
- The containerd template does not set the stream_server_address [2].
- Should we be providing an address here that aligns to the containerd
address that we want it to listen to (i.e. the OAM host specific address)
- Do we want to use the OAM address here? Should this be the MGMT host
specific address or loopback?
I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.
We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.
# Ansible sets up the following:
external_oam_subnet: 2620:10a:a001:a103::6:0/64
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10A:A001:A103::11
external_oam_node_0_address: 2620:10A:A001:A103::8
external_oam_node_1_address: 2620:10A:A001:A103::9
# Controller-0 (standby) after ansible, unlocked, and swact to controller-1:
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:A001:A103::8
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:40870 :::* LISTEN 89515/containerd
controller-0:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
inet6 2620:10a:a001:a103::8/64 scope global <------- containerd should be listening here.
valid_lft forever preferred_lft forever
inet6 fe80::1a66:daff:feaf:b588/64 scope link
valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
inet6 face::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
inet6 feed:beef::2/64 scope site
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe25:d5c0/64 scope link
valid_lft forever preferred_lft forever
# Controller-1 (active) after unlocked, and swact to from controller-0:
# Containerd is listening on the correct address: 2620:10A:A001:A103::9
controller-1:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:a001:a10:39273 :::* LISTEN 93111/containerd
controller-1:~$ cat /etc/hosts | grep oamcontroller
2620:10a:a001:a103::11 oamcontroller
controller-1:~$ ip a
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 18:66:da:af:bc:d2 brd ff:ff:ff:ff:ff:ff
inet6 2620:10a:a001:a103::11/64 scope global
valid_lft forever preferred_lft forever <----- Missing “0sec” on floating OAM IP
inet6 2620:10a:a001:a103::9/64 scope global <----- containerd should be and IS listening here
valid_lft forever preferred_lft forever
inet6 fe80::1a66:daff:feaf:bcd2/64 scope link
valid_lft forever preferred_lft forever
15: vlan157@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP group default qlen 1000
link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
inet6 face::4/64 scope global deprecated
valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
inet6 face::1/64 scope global deprecated
valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
inet6 face::3/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe25:d700/64 scope link
valid_lft forever preferred_lft forever
16: vlan158@pxeboot0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
inet6 feed:beef::1/64 scope site deprecated
valid_lft forever preferred_lft 0sec <----- Per https://review.opendev.org/#/c/715120/
inet6 feed:beef::3/64 scope site
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe25:d700/64 scope link
valid_lft forever preferred_lft forever
# Tiller is accessible from the mgmt network floating IP
controller-1:~$ kubectl get ep -n kube-system tiller-deploy
NAME ENDPOINTS AGE
tiller-deploy [face::2]:44134 14h
# Last message from tiller probably aligned with when controller-0 was active
2020-04-28T18:29:39.174576873Z stderr F [tiller] 2020/04/28 18:29:39 executing 0 post-delete hooks for oidc-oidc-client
2020-04-28T18:29:39.17459409Z stderr F [tiller] 2020/04/28 18:29:39 hooks complete for post-delete oidc-oidc-client
2020-04-28T18:29:39.17460406Z stderr F [tiller] 2020/04/28 18:29:39 purge requested for oidc-oidc-client
2020-04-28T18:29:39.174611595Z stderr F [storage] 2020/04/28 18:29:39 deleting release "oidc-oidc-client.v1"
2020-04-28T18:34:53.429325711Z stderr F [storage] 2020/04/28 18:34:53 listing all releases with filter
# From sm-customer.log, so yes controller-0 was active at the the time of the last tiller logs
| 2020-04-28T18:00:56.910 | 196 | service-group-scn | controller-services | standby-degraded | standby |
| 2020-04-28T19:14:08.477 | 197 | node-scn | controller-1 | | swact | issued against host controller-0
# Tiller pod is on controller-0
controller-1:~$ kubectl get pods --all-namespaces -o wide -w | grep tiller
kube-system tiller-deploy-5c8dd9fb56-lgldr 1/1 Running 0 11h face::2 controller-0 <none> <none>
# Get the container id
controller-1:~$ kubectl describe pods -n kube-system tiller-deploy-5c8dd9fb56-lgldr | grep containerd
Container ID: containerd://bb7618a8a6305780322c804ef0991e102596d0d8448c9387e6ad4cbeedd28185
# Can't connect from controller-1: OAM floating address is assigned to controller-1 but
# controller-0 has containerd listening on that that same address
controller-1:~$ kubectl exec -it -n kube-system tiller-deploy-5c8dd9fb56-lgldr -- ls
error: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out
# Can't connect from controller-0: OAM floating address is assigned to controller-1 but
# controller-0 has containerd listening on that that same address
controller-0:~# DEBUG=1 crictl exec -it bb7618a8a6305 ls
FATA[0127] execing command in container failed: error sending request: Post "http://[2620:10a:a001:a103::11]:40870/exec/va5vLJhQ": dial tcp [2620:10a:a001:a103::11]:40870: connect: connection timed out
Analysis: server_ address [2].
- [1] was done to ensure that fixed IPs are used for the cluster and mgmt
networks when multiple (floating) IPs are present when in an IPv6 config
- We don't do this for the OAM network so after controller-0 unlocks we end up
using the OAM floating address instead of the fixed host address when
containerd starts.
- When controller-1 is unlocked, it is the standby controller with only one
OAM specific host address so containerd starts successfully with the
correct IP
- The containerd template does not set the stream_
- Should we be providing an address here that aligns to the containerd
address that we want it to listen to (i.e. the OAM host specific address)
- Do we want to use the OAM address here? Should this be the MGMT host
specific address or loopback?
I suspect that if you restarted containerd on controller-0 (currently standby)
it would get the correct address and things would work correctly.
We should specify a specific address for containerd to bind to avoid IP
discovery when starting up the containerd process.
[1] https:/ /review. opendev. org/#/c/ 715120/ /opendev. org/starlingx/ stx-puppet/ src/branch/ master/ puppet- manifests/ src/modules/ platform/ templates/ config. toml.erb# L29
[2] https:/
Look at the lab and related logs:
# Ansible sets up the following: oam_subnet: 2620:10a: a001:a103: :6:0/64 oam_gateway_ address: 2620:10a: a001:a103: :6:0 oam_floating_ address: 2620:10A: A001:A103: :11 oam_node_ 0_address: 2620:10A: A001:A103: :8 oam_node_ 1_address: 2620:10A: A001:A103: :9
external_
external_
external_
external_
external_
# Controller-0 (standby) after ansible, unlocked, and swact to controller-1: A001:A103: :8 a001:a10: 40870 :::* LISTEN 89515/containerd
# Problem: Containerd is listening on the OAM controller address -> should be the node address 2620:10A:
controller-0:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:
controller-0:~$ sudo netstat -tulp6 | grep contain
tcp6 0 0 oamcontroller:40870 [::]:* LISTEN 89515/containerd
controller-0:~$ cat /etc/resolv.conf a001:a103: :2
nameserver face::1
nameserver 2620:10a:
controller-0:~$ cat /etc/hosts | grep oamcontroller a001:a103: :11 oamcontroller
2620:10a:
controller-0:~$ ip a MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc mq state UP group default qlen 1000 a001:a103: :8/64 scope global <------- containerd should be listening here. daff:feaf: b588/64 scope link MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc htb state UP group default qlen 1000 feff:fe25: d5c0/64 scope link MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 feff:fe25: d5c0/64 scope link
3: eno1: <BROADCAST,
link/ether 18:66:da:af:b5:88 brd ff:ff:ff:ff:ff:ff
inet6 2620:10a:
valid_lft forever preferred_lft forever
inet6 fe80::1a66:
valid_lft forever preferred_lft forever
14: vlan157@pxeboot0: <BROADCAST,
link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
inet6 face::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::3efd:
valid_lft forever preferred_lft forever
15: vlan158@pxeboot0: <BROADCAST,
link/ether 3c:fd:fe:25:d5:c0 brd ff:ff:ff:ff:ff:ff
inet6 feed:beef::2/64 scope site
valid_lft forever preferred_lft forever
inet6 fe80::3efd:
valid_lft forever preferred_lft forever
# Controller-1 (active) after unlocked, and swact to from controller-0: A001:A103: :9 a001:a10: 39273 :::* LISTEN 93111/containerd
# Containerd is listening on the correct address: 2620:10A:
controller-1:~$ sudo netstat -tulpn6 | grep contain
tcp6 0 0 2620:10a:
controller-1:~$ sudo netstat -tulp6 | grep contain 9.yow.lab. :39273 [::]:* LISTEN 93111/containerd
tcp6 0 0 v6-150-
rchurch@ yow-tuxlab2: ~$ nslookup v6-150-9 9.yow.lab. wrs.com a001:a103: :9
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: v6-150-
Address: 2620:10a:
controller-1:~$ cat /etc/resolv.conf a001:a103: :2
nameserver face::1
nameserver 2620:10a:
controller-1:~$ cat /etc/hosts | grep oamcontroller a001:a103: :11 oamcontroller MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc mq state UP group default qlen 1000 a001:a103: :11/64 scope global a001:a103: :9/64 scope global <----- containerd should be and IS listening here daff:feaf: bcd2/64 scope link MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc htb state UP group default qlen 1000 /review. opendev. org/#/c/ 715120/ /review. opendev. org/#/c/ 715120/ feff:fe25: d700/64 scope link MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 /review. opendev. org/#/c/ 715120/ feff:fe25: d700/64 scope link
2620:10a:
controller-1:~$ ip a
3: eno1: <BROADCAST,
link/ether 18:66:da:af:bc:d2 brd ff:ff:ff:ff:ff:ff
inet6 2620:10a:
valid_lft forever preferred_lft forever <----- Missing “0sec” on floating OAM IP
inet6 2620:10a:
valid_lft forever preferred_lft forever
inet6 fe80::1a66:
valid_lft forever preferred_lft forever
15: vlan157@pxeboot0: <BROADCAST,
link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
inet6 face::4/64 scope global deprecated
valid_lft forever preferred_lft 0sec <----- Per https:/
inet6 face::1/64 scope global deprecated
valid_lft forever preferred_lft 0sec <----- Per https:/
inet6 face::3/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::3efd:
valid_lft forever preferred_lft forever
16: vlan158@pxeboot0: <BROADCAST,
link/ether 3c:fd:fe:25:d7:00 brd ff:ff:ff:ff:ff:ff
inet6 feed:beef::1/64 scope site deprecated
valid_lft forever preferred_lft 0sec <----- Per https:/
inet6 feed:beef::3/64 scope site
valid_lft forever preferred_lft forever
inet6 fe80::3efd:
valid_lft forever preferred_lft forever
# Tiller is accessible from the mgmt network floating IP
controller-1:~$ kubectl get ep -n kube-system tiller-deploy
NAME ENDPOINTS AGE
tiller-deploy [face::2]:44134 14h
# Last message from tiller probably aligned with when controller-0 was active 28T18:29: 39.174576873Z stderr F [tiller] 2020/04/28 18:29:39 executing 0 post-delete hooks for oidc-oidc-client 28T18:29: 39.17459409Z stderr F [tiller] 2020/04/28 18:29:39 hooks complete for post-delete oidc-oidc-client 28T18:29: 39.17460406Z stderr F [tiller] 2020/04/28 18:29:39 purge requested for oidc-oidc-client 28T18:29: 39.174611595Z stderr F [storage] 2020/04/28 18:29:39 deleting release "oidc-oidc- client. v1" 28T18:34: 53.429325711Z stderr F [storage] 2020/04/28 18:34:53 listing all releases with filter
2020-04-
2020-04-
2020-04-
2020-04-
2020-04-
# From sm-customer.log, so yes controller-0 was active at the the time of the last tiller logs 28T18:00: 56.910 | 196 | service-group-scn | controller-services | standby-degraded | standby | 28T19:14: 08.477 | 197 | node-scn | controller-1 | | swact | issued against host controller-0
| 2020-04-
| 2020-04-
# Tiller pod is on controller-0 deploy- 5c8dd9fb56- lgldr 1/1 Running 0 11h face::2 controller-0 <none> <none>
controller-1:~$ kubectl get pods --all-namespaces -o wide -w | grep tiller
kube-system tiller-
# Get the container id deploy- 5c8dd9fb56- lgldr | grep containerd //bb7618a8a6305 780322c804ef099 1e102596d0d8448 c9387e6ad4cbeed d28185
controller-1:~$ kubectl describe pods -n kube-system tiller-
Container ID: containerd:
# Can't connect from controller-1: OAM floating address is assigned to controller-1 but deploy- 5c8dd9fb56- lgldr -- ls a001:a103: :11]:40870: connect: connection timed out
# controller-0 has containerd listening on that that same address
controller-1:~$ kubectl exec -it -n kube-system tiller-
error: unable to upgrade connection: error dialing backend: dial tcp [2620:10a:
# Can't connect from controller-0: OAM floating address is assigned to controller-1 but a001:a103: :11]:40870/ exec/va5vLJhQ" : dial tcp [2620:10a: a001:a103: :11]:40870: connect: connection timed out
# controller-0 has containerd listening on that that same address
controller-0:~# DEBUG=1 crictl exec -it bb7618a8a6305 ls
FATA[0127] execing command in container failed: error sending request: Post "http://[2620:10a: