StarlingX

Bug #1918130
Comment #0

Comment 0 for bug 1918130

Revision history for this message

Mihnea Saracin (msaracin) wrote on 2021-03-08:

Brief Description
-----------------

Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files.

Steps to Reproduce
------------------

Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet'

Expected Behavior
------------------
Bootstrap playbook completes successfully

Actual Behavior
----------------
Bootstrap playbook fails

Reproducibility
---------------
9/9

System Configuration
--------------------
Standard System

Branch/Pull Time/Commit
-----------------------
stx master build on "2021-03-01"

Last Pass
---------
N/A

Timestamp/Logs
--------------
TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***
E fatal: [localhost]: FAILED! =>

{"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]}

From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on:

controller-0:~$ cat /etc/kubernetes/kubeadm.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.206.3
nodeRegistration:
criSocket: "/var/run/containerd/containerd.sock"
—
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
certSANs:
- 192.168.206.2
- 127.0.0.1
- 128.224.150.54
- 128.224.150.219
- 128.224.150.212
extraArgs:
default-not-ready-toleration-seconds: "30"
default-unreachable-toleration-seconds: "30"
feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"
event-ttl: "24h"
encryption-provider-config: /etc/kubernetes/encryption-provider.yaml
extraVolumes:
- name: "encryption-config"
hostPath: /etc/kubernetes/encryption-provider.yaml
mountPath: /etc/kubernetes/encryption-provider.yaml
readOnly: true
pathType: File
controllerManager:
extraArgs:
node-monitor-period: "2s"
node-monitor-grace-period: "20s"
pod-eviction-timeout: "30s"
feature-gates: "TTLAfterFinished=true"
flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
controlPlaneEndpoint: 192.168.206.2
etcd:
external:
endpoints:
- https://192.168.206.2:2379
caFile: /etc/kubernetes/pki/ca.crt
certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
imageRepository: "registry.local:9001/k8s.gcr.io"
kubernetesVersion: v1.18.1
networking:
dnsDomain: cluster.local
podSubnet: 172.16.0.0/16
serviceSubnet: 10.96.0.0/12
—
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
nodeStatusUpdateFrequency: "4s"
featureGates:
HugePageStorageMediumSize: true
failSwapOn: false
cgroupRoot: "/k8s-infra"

#########################3

controller-0:/etc/etcd# cat /etc/etcd/etcd.conf
#[member]
ETCD_NAME="controller"
ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"
ETCD_SNAPSHOT_COUNT=10000
ETCD_HEARTBEAT_INTERVAL=100
ETCD_ELECTION_TIMEOUT=1000
ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"
ETCD_MAX_SNAPSHOTS=5
ETCD_MAX_WALS=5
ETCD_ENABLE_V2=true
#
#[proxy]
ETCD_PROXY="off"
ETCD_PROXY_FAILURE_WAIT=5000
ETCD_PROXY_REFRESH_INTERVAL=30000
ETCD_PROXY_DIAL_TIMEOUT=1000
ETCD_PROXY_WRITE_TIMEOUT=5000
ETCD_PROXY_READ_TIMEOUT=0
#

#[security]
ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"
ETCD_KEY_FILE="/etc/etcd/etcd-server.key"
ETCD_CLIENT_CERT_AUTH=true
ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"
ETCD_PEER_CLIENT_CERT_AUTH=false
#
#[logging]
ETCD_DEBUG=false

######################

controller-0:/etc/etcd# cat /etc/etcd/etcd.yml
# Managed by Puppet
# Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample
# This is the configuration file for the etcd server.

# Human-readable name for this member.
name: "controller"

# Path to the data directory.
data-dir: "/opt/etcd/20.12/controller.etcd"

# Number of committed transactions to trigger a snapshot to disk.
snapshot-count: 10000

# Time (in milliseconds) of a heartbeat interval.
heartbeat-interval: 100

# Time (in milliseconds) for an election to timeout.
election-timeout: 1000

# Raise alarms when backend size exceeds the given quota. 0 means use the
# default quota.
quota-backend-bytes: 0

# List of comma separated URLs to listen on for client traffic.
listen-client-urls: "https://192.168.206.1:2379"

# Maximum number of snapshot files to retain (0 is unlimited).
max-snapshots: 5

# Maximum number of wal files to retain (0 is unlimited).
max-wals: 5

# List of this member's client URLs to advertise to the public.
# The URLs needed to be a comma-separated list.
advertise-client-urls: "https://192.168.206.1:2379"

# Accept etcd V2 client requests
enable-v2: true

# Valid values include 'on', 'readonly', 'off'
proxy: "off"

# Time (in milliseconds) an endpoint will be held in a failed state.
proxy-failure-wait: 5000

# Time (in milliseconds) of the endpoints refresh interval.
proxy-refresh-interval: 30000

# Time (in milliseconds) for a dial to timeout.
proxy-dial-timeout: 1000

# Time (in milliseconds) for a write to timeout.
proxy-write-timeout: 5000

# Time (in milliseconds) for a read to timeout.
proxy-read-timeout: 0

client-transport-security:
# Path to the client server TLS cert file.
cert-file: "/etc/etcd/etcd-server.crt"

# Path to the client server TLS key file.
key-file:

# Enable client cert authentication.
client-cert-auth: true

# Path to the client server TLS trusted CA key file.
trusted-ca-file: "/etc/etcd/ca.crt"

# Enable debug-level logging for etcd.
debug: false

######################

In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379

I think the commit that introduced this is:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39

This change in particular:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml

#####

The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet':

The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:
- The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)

- The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file.

We can see in the ansible.log:

The bind_address

2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1)

The ETCD_ENDPONT

2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the

In the auth.log

auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py

Test Activity
-------------
Developer Testing

Brief Description
-----------------

Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files.

Steps to Reproduce
 ------------------

Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet'

Expected Behavior
 ------------------
 Bootstrap playbook completes successfully

Actual Behavior
 ----------------
 Bootstrap playbook fails

Reproducibility
 ---------------
 9/9

System Configuration
 --------------------
 Standard System

Branch/Pull Time/Commit
 -----------------------
stx master build on "2021-03-01"

Last Pass
 ---------
 N/A

Timestamp/Logs
 --------------
TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***
 E fatal: [localhost]: FAILED! =>

From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on:

controller-0:~$ cat /etc/kubernetes/kubeadm.yaml 
 apiVersion: kubeadm.k8s.io/v1beta2
 kind: InitConfiguration
 localAPIEndpoint:
 advertiseAddress: 192.168.206.3
 nodeRegistration:
 criSocket: "/var/run/containerd/containerd.sock"
 —
 apiVersion: kubeadm.k8s.io/v1beta2
 kind: ClusterConfiguration
 apiServer:
 certSANs:
 - 192.168.206.2
 - 127.0.0.1
 - 128.224.150.54
 - 128.224.150.219
 - 128.224.150.212
 extraArgs:
 default-not-ready-toleration-seconds: "30"
 default-unreachable-toleration-seconds: "30"
 feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"
 event-ttl: "24h"
 encryption-provider-config: /etc/kubernetes/encryption-provider.yaml
 extraVolumes:
 - name: "encryption-config"
 hostPath: /etc/kubernetes/encryption-provider.yaml
 mountPath: /etc/kubernetes/encryption-provider.yaml
 readOnly: true
 pathType: File
 controllerManager:
 extraArgs:
 node-monitor-period: "2s"
 node-monitor-grace-period: "20s"
 pod-eviction-timeout: "30s"
 feature-gates: "TTLAfterFinished=true"
 flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
 controlPlaneEndpoint: 192.168.206.2
 etcd:
 external:
 endpoints:
 - https://192.168.206.2:2379
 caFile: /etc/kubernetes/pki/ca.crt
 certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
 keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
 imageRepository: "registry.local:9001/k8s.gcr.io"
 kubernetesVersion: v1.18.1
 networking:
 dnsDomain: cluster.local
 podSubnet: 172.16.0.0/16
 serviceSubnet: 10.96.0.0/12
 —
 kind: KubeletConfiguration
 apiVersion: kubelet.config.k8s.io/v1beta1
 nodeStatusUpdateFrequency: "4s"
 featureGates:
 HugePageStorageMediumSize: true
 failSwapOn: false
 cgroupRoot: "/k8s-infra"

#########################3

controller-0:/etc/etcd# cat /etc/etcd/etcd.conf 
 #[member]
 ETCD_NAME="controller"
 ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"
 ETCD_SNAPSHOT_COUNT=10000
 ETCD_HEARTBEAT_INTERVAL=100
 ETCD_ELECTION_TIMEOUT=1000
 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"
 ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"
 ETCD_MAX_SNAPSHOTS=5
 ETCD_MAX_WALS=5
 ETCD_ENABLE_V2=true
 #
 #[proxy]
 ETCD_PROXY="off"
 ETCD_PROXY_FAILURE_WAIT=5000
 ETCD_PROXY_REFRESH_INTERVAL=30000
 ETCD_PROXY_DIAL_TIMEOUT=1000
 ETCD_PROXY_WRITE_TIMEOUT=5000
 ETCD_PROXY_READ_TIMEOUT=0
 #

#[security]
 ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"
 ETCD_KEY_FILE="/etc/etcd/etcd-server.key"
 ETCD_CLIENT_CERT_AUTH=true
 ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"
 ETCD_PEER_CLIENT_CERT_AUTH=false
 #
 #[logging]
 ETCD_DEBUG=false

######################

controller-0:/etc/etcd# cat /etc/etcd/etcd.yml
 # Managed by Puppet
 # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample
 # This is the configuration file for the etcd server.

# Human-readable name for this member.
 name: "controller"

# Path to the data directory.
 data-dir: "/opt/etcd/20.12/controller.etcd"

# Number of committed transactions to trigger a snapshot to disk.
 snapshot-count: 10000

# Time (in milliseconds) of a heartbeat interval.
 heartbeat-interval: 100

# Time (in milliseconds) for an election to timeout.
 election-timeout: 1000

# Raise alarms when backend size exceeds the given quota. 0 means use the
 # default quota.
 quota-backend-bytes: 0

# List of comma separated URLs to listen on for client traffic.
 listen-client-urls: "https://192.168.206.1:2379"

# Maximum number of snapshot files to retain (0 is unlimited).
 max-snapshots: 5

# Maximum number of wal files to retain (0 is unlimited).
 max-wals: 5

# List of this member's client URLs to advertise to the public.
 # The URLs needed to be a comma-separated list.
 advertise-client-urls: "https://192.168.206.1:2379"

# Accept etcd V2 client requests
 enable-v2: true

# Valid values include 'on', 'readonly', 'off'
 proxy: "off"

# Time (in milliseconds) an endpoint will be held in a failed state.
 proxy-failure-wait: 5000

# Time (in milliseconds) of the endpoints refresh interval.
 proxy-refresh-interval: 30000

# Time (in milliseconds) for a dial to timeout.
 proxy-dial-timeout: 1000

# Time (in milliseconds) for a write to timeout.
 proxy-write-timeout: 5000

# Time (in milliseconds) for a read to timeout.
 proxy-read-timeout: 0

client-transport-security:
 # Path to the client server TLS cert file.
 cert-file: "/etc/etcd/etcd-server.crt"

# Path to the client server TLS key file.
 key-file:

# Enable client cert authentication.
 client-cert-auth: true

# Path to the client server TLS trusted CA key file.
 trusted-ca-file: "/etc/etcd/ca.crt"

# Enable debug-level logging for etcd.
 debug: false

######################

In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379

I think the commit that introduced this is:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39

This change in particular:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml

#####

The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet':

The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:
 - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)

We can see in the ansible.log:

The bind_address

2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1)

The ETCD_ENDPONT

2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the

In the auth.log

Test Activity
 -------------
 Developer Testing