Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files.
Steps to Reproduce
------------------
Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet'
{"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]}
From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on:
# Human-readable name for this member.
name: "controller"
# Path to the data directory.
data-dir: "/opt/etcd/20.12/controller.etcd"
# Number of committed transactions to trigger a snapshot to disk.
snapshot-count: 10000
# Time (in milliseconds) of a heartbeat interval.
heartbeat-interval: 100
# Time (in milliseconds) for an election to timeout.
election-timeout: 1000
# Raise alarms when backend size exceeds the given quota. 0 means use the
# default quota.
quota-backend-bytes: 0
# List of comma separated URLs to listen on for client traffic.
listen-client-urls: "https://192.168.206.1:2379"
# Maximum number of snapshot files to retain (0 is unlimited).
max-snapshots: 5
# Maximum number of wal files to retain (0 is unlimited).
max-wals: 5
# List of this member's client URLs to advertise to the public.
# The URLs needed to be a comma-separated list.
advertise-client-urls: "https://192.168.206.1:2379"
# Accept etcd V2 client requests
enable-v2: true
# Valid values include 'on', 'readonly', 'off'
proxy: "off"
# Time (in milliseconds) an endpoint will be held in a failed state.
proxy-failure-wait: 5000
# Time (in milliseconds) of the endpoints refresh interval.
proxy-refresh-interval: 30000
# Time (in milliseconds) for a dial to timeout.
proxy-dial-timeout: 1000
# Time (in milliseconds) for a write to timeout.
proxy-write-timeout: 5000
# Time (in milliseconds) for a read to timeout.
proxy-read-timeout: 0
client-transport-security:
# Path to the client server TLS cert file.
cert-file: "/etc/etcd/etcd-server.crt"
# Path to the client server TLS key file.
key-file:
Brief Description
-----------------
Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files.
Steps to Reproduce
------------------
Deploy a Standard system with the 'cluster_ host_start_ address' defined in localhost.yml and different than the first address of the 'cluster_ host_subnet'
Expected Behavior
------------------
Bootstrap playbook completes successfully
Actual Behavior
----------------
Bootstrap playbook fails
Reproducibility
---------------
9/9
System Configuration ------- -------
------
Standard System
Branch/Pull Time/Commit ------- ------- ---
------
stx master build on "2021-03-01"
Last Pass
---------
N/A
Timestamp/Logs bringup- essential- services : Initializing Kubernetes master] ***
--------------
TASK [bootstrap/
E fatal: [localhost]: FAILED! =>
{"changed": true, "cmd": ["kubeadm", "init", "--ignore- preflight- errors= DirAvailable- -var-lib- etcd", "--config= /etc/kubernetes /kubeadm. yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet. config. k8s.io kubeproxy. config. k8s.io] \nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVer sion]: Get https:/ /192.168. 206.2:2379/ version: dial tcp 192.168.206.2:2379: connect: connection refused\ n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore- preflight- errors= ...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet. config. k8s.io kubeproxy. config. k8s.io] ", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVer sion]: Get https:/ /192.168. 206.2:2379/ version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore- preflight- errors= ...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\ n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]}
From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on:
controller-0:~$ cat /etc/kubernetes /kubeadm. yaml k8s.io/ v1beta2 containerd/ containerd. sock" k8s.io/ v1beta2 ation not-ready- toleration- seconds: "30" unreachable- toleration- seconds: "30" true,TTLAfterFi nished= true,HugePageSt orageMediumSize =true" provider- config: /etc/kubernetes /encryption- provider. yaml /encryption- provider. yaml /encryption- provider. yaml period: "2s" grace-period: "20s" timeout: "30s" ed=true" plugin- dir: /usr/libexec/ kubernetes/ kubelet- plugins/ volume/ exec/ dpoint: 192.168.206.2 /192.168. 206.2:2379 /pki/ca. crt /pki/apiserver- etcd-client. crt /pki/apiserver- etcd-client. key local:9001/ k8s.gcr. io" ation config. k8s.io/ v1beta1 teFrequency: "4s" eMediumSize: true
apiVersion: kubeadm.
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.206.3
nodeRegistration:
criSocket: "/var/run/
—
apiVersion: kubeadm.
kind: ClusterConfigur
apiServer:
certSANs:
- 192.168.206.2
- 127.0.0.1
- 128.224.150.54
- 128.224.150.219
- 128.224.150.212
extraArgs:
default-
default-
feature-gates: "SCTPSupport=
event-ttl: "24h"
encryption-
extraVolumes:
- name: "encryption-config"
hostPath: /etc/kubernetes
mountPath: /etc/kubernetes
readOnly: true
pathType: File
controllerManager:
extraArgs:
node-monitor-
node-monitor-
pod-eviction-
feature-gates: "TTLAfterFinish
flex-volume-
controlPlaneEn
etcd:
external:
endpoints:
- https:/
caFile: /etc/kubernetes
certFile: /etc/kubernetes
keyFile: /etc/kubernetes
imageRepository: "registry.
kubernetesVersion: v1.18.1
networking:
dnsDomain: cluster.local
podSubnet: 172.16.0.0/16
serviceSubnet: 10.96.0.0/12
—
kind: KubeletConfigur
apiVersion: kubelet.
nodeStatusUpda
featureGates:
HugePageStorag
failSwapOn: false
cgroupRoot: "/k8s-infra"
####### ####### ####### ####3
controller- 0:/etc/ etcd# cat /etc/etcd/etcd.conf "controller" DIR="/opt/ etcd/20. 12/controller. etcd" COUNT=10000 _INTERVAL= 100 TIMEOUT= 1000 CLIENT_ URLS="https:/ /192.168. 206.1:2379" _CLIENT_ URLS="https:/ /192.168. 206.1:2379" SNAPSHOTS= 5 V2=true FAILURE_ WAIT=5000 REFRESH_ INTERVAL= 30000 DIAL_TIMEOUT= 1000 WRITE_TIMEOUT= 5000 READ_TIMEOUT= 0
#[member]
ETCD_NAME=
ETCD_DATA_
ETCD_SNAPSHOT_
ETCD_HEARTBEAT
ETCD_ELECTION_
ETCD_LISTEN_
ETCD_ADVERTISE
ETCD_MAX_
ETCD_MAX_WALS=5
ETCD_ENABLE_
#
#[proxy]
ETCD_PROXY="off"
ETCD_PROXY_
ETCD_PROXY_
ETCD_PROXY_
ETCD_PROXY_
ETCD_PROXY_
#
#[security] FILE="/ etc/etcd/ etcd-server. crt" FILE="/ etc/etcd/ etcd-server. key" CERT_AUTH= true CA_FILE= "/etc/etcd/ ca.crt" CLIENT_ CERT_AUTH= false
ETCD_CERT_
ETCD_KEY_
ETCD_CLIENT_
ETCD_TRUSTED_
ETCD_PEER_
#
#[logging]
ETCD_DEBUG=false
####### ####### ####### #
controller- 0:/etc/ etcd# cat /etc/etcd/etcd.yml /raw.githubuser content. com/coreos/ etcd/master/ etcd.conf. yml.sample
# Managed by Puppet
# Source URL: https:/
# This is the configuration file for the etcd server.
# Human-readable name for this member.
name: "controller"
# Path to the data directory. 20.12/controlle r.etcd"
data-dir: "/opt/etcd/
# Number of committed transactions to trigger a snapshot to disk.
snapshot-count: 10000
# Time (in milliseconds) of a heartbeat interval. interval: 100
heartbeat-
# Time (in milliseconds) for an election to timeout.
election-timeout: 1000
# Raise alarms when backend size exceeds the given quota. 0 means use the backend- bytes: 0
# default quota.
quota-
# List of comma separated URLs to listen on for client traffic. client- urls: "https:/ /192.168. 206.1:2379"
listen-
# Maximum number of snapshot files to retain (0 is unlimited).
max-snapshots: 5
# Maximum number of wal files to retain (0 is unlimited).
max-wals: 5
# List of this member's client URLs to advertise to the public. client- urls: "https:/ /192.168. 206.1:2379"
# The URLs needed to be a comma-separated list.
advertise-
# Accept etcd V2 client requests
enable-v2: true
# Valid values include 'on', 'readonly', 'off'
proxy: "off"
# Time (in milliseconds) an endpoint will be held in a failed state. failure- wait: 5000
proxy-
# Time (in milliseconds) of the endpoints refresh interval. refresh- interval: 30000
proxy-
# Time (in milliseconds) for a dial to timeout. dial-timeout: 1000
proxy-
# Time (in milliseconds) for a write to timeout. write-timeout: 5000
proxy-
# Time (in milliseconds) for a read to timeout. read-timeout: 0
proxy-
client- transport- security: etcd-server. crt"
# Path to the client server TLS cert file.
cert-file: "/etc/etcd/
# Path to the client server TLS key file.
key-file:
# Enable client cert authentication.
client-cert-auth: true
# Path to the client server TLS trusted CA key file.
trusted-ca-file: "/etc/etcd/ca.crt"
# Enable debug-level logging for etcd.
debug: false
####### ####### ####### #
In the kubeadm file we have https:/ /192.168. 206.2:2379 but in the etcd config files we have https:/ /192.168. 206.1:2379
I think the commit that introduced this is:
https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/760512/ 39
This change in particular:
https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/760512/ 39/playbookconf ig/src/ playbooks/ roles/bootstrap /apply- bootstrap- manifest/ tasks/main. yml
#####
The issue reproduces if the 'cluster_ host_start_ address' defined in localhost.yml it's different than the first address of the 'cluster_ host_subnet' :
The cluster_ floating_ address: "{{ address_ pairs[' cluster_ host'][ 'start' ] }}" will be different than the 'default_ cluster_ host_start_ address' : "{{ (cluster_ host_subnet | ipaddr( 1)).split( '/')[0] }} so it will propagate to: :etcd:: params: :bind_address: \{{ default_ cluster_ host_start_ address }}" (I think the one that gets inserted in the etcd config) (https:/ /opendev. org/starlingx/ ansible- playbooks/ src/commit/ 820f347324d4fb7 d17396d5d21f98e b2b674d23a/ playbookconfig/ src/playbooks/ roles/bootstrap /apply- bootstrap- manifest/ tasks/main. yml#L157)
- The platform:
- The ETCD_ENDPOINT: "https://\{{ cluster_ floating_ address | ipwrap }}:2379" (https:/ /opendev. org/starlingx/ ansible- playbooks/ src/commit/ 820f347324d4fb7 d17396d5d21f98e b2b674d23a/ playbookconfig/ src/playbooks/ roles/bootstrap /bringup- essential- services/ tasks/bringup_ kubemaster. yml#L151) which gets into the kubeadm file.
We can see in the ansible.log:
The bind_address
2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform: :etcd:: params: :bind_address: 192.168.206.1)
The ETCD_ENDPONT
2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ ENDPOINT' |g' /etc/kubernetes /kubeadm. yaml)in the
In the auth.log
auth.log: 2021-03- 02T18:45: 11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/ share/ansible/ stx-ansible/ playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME- SUCCESS- llfipwqxlrukgtj rwlrelhqnuejcnv vd; POD_NETWORK_ CIDR=172. 16.0.0/ 16 CONTROLPLANE_ ENDPOINT= 192.168. 206.2 VOLUME_ PLUGIN_ DIR=/usr/ libexec/ kubernetes/ kubelet- plugins/ volume/ exec/ ETCD_ENDPOINT=https:/ /192.168. 206.2:2379 APISERVER_ ADVERTISE_ ADDRESS= 192.168. 206.3 SERVICE_ NETWORK_ CIDR=10. 96.0.0/ 12 /usr/bin/python /tmp/.ansible- sysadmin/ tmp/ansible- tmp-1614710711. 05-299487229683 06/AnsiballZ_ command. py
Test Activity
-------------
Developer Testing