StarlingX

Bug #1918130
Activity log

Activity log for bug #1918130

Date	Who	What changed	Old value	New value	Message
2021-03-08 11:46:21	Mihnea Saracin	bug			added bug
2021-03-08 11:46:28	Mihnea Saracin	starlingx: assignee		Mihnea Saracin (msaracin)
2021-03-08 11:47:29	Mihnea Saracin	description	Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass --------- N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] *** E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.206.3 nodeRegistration: criSocket: "/var/run/containerd/containerd.sock" — apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiServer: certSANs: - 192.168.206.2 - 127.0.0.1 - 128.224.150.54 - 128.224.150.219 - 128.224.150.212 extraArgs: default-not-ready-toleration-seconds: "30" default-unreachable-toleration-seconds: "30" feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true" event-ttl: "24h" encryption-provider-config: /etc/kubernetes/encryption-provider.yaml extraVolumes: - name: "encryption-config" hostPath: /etc/kubernetes/encryption-provider.yaml mountPath: /etc/kubernetes/encryption-provider.yaml readOnly: true pathType: File controllerManager: extraArgs: node-monitor-period: "2s" node-monitor-grace-period: "20s" pod-eviction-timeout: "30s" feature-gates: "TTLAfterFinished=true" flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ controlPlaneEndpoint: 192.168.206.2 etcd: external: endpoints: - https://192.168.206.2:2379 caFile: /etc/kubernetes/pki/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key imageRepository: "registry.local:9001/k8s.gcr.io" kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local podSubnet: 172.16.0.0/16 serviceSubnet: 10.96.0.0/12 — kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 nodeStatusUpdateFrequency: "4s" featureGates: HugePageStorageMediumSize: true failSwapOn: false cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf #[member] ETCD_NAME="controller" ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd" ETCD_SNAPSHOT_COUNT=10000 ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=1000 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379" ETCD_MAX_SNAPSHOTS=5 ETCD_MAX_WALS=5 ETCD_ENABLE_V2=true # #[proxy] ETCD_PROXY="off" ETCD_PROXY_FAILURE_WAIT=5000 ETCD_PROXY_REFRESH_INTERVAL=30000 ETCD_PROXY_DIAL_TIMEOUT=1000 ETCD_PROXY_WRITE_TIMEOUT=5000 ETCD_PROXY_READ_TIMEOUT=0 # #[security] ETCD_CERT_FILE="/etc/etcd/etcd-server.crt" ETCD_KEY_FILE="/etc/etcd/etcd-server.key" ETCD_CLIENT_CERT_AUTH=true ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt" ETCD_PEER_CLIENT_CERT_AUTH=false # #[logging] ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml # Managed by Puppet # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample # This is the configuration file for the etcd server. # Human-readable name for this member. name: "controller" # Path to the data directory. data-dir: "/opt/etcd/20.12/controller.etcd" # Number of committed transactions to trigger a snapshot to disk. snapshot-count: 10000 # Time (in milliseconds) of a heartbeat interval. heartbeat-interval: 100 # Time (in milliseconds) for an election to timeout. election-timeout: 1000 # Raise alarms when backend size exceeds the given quota. 0 means use the # default quota. quota-backend-bytes: 0 # List of comma separated URLs to listen on for client traffic. listen-client-urls: "https://192.168.206.1:2379" # Maximum number of snapshot files to retain (0 is unlimited). max-snapshots: 5 # Maximum number of wal files to retain (0 is unlimited). max-wals: 5 # List of this member's client URLs to advertise to the public. # The URLs needed to be a comma-separated list. advertise-client-urls: "https://192.168.206.1:2379" # Accept etcd V2 client requests enable-v2: true # Valid values include 'on', 'readonly', 'off' proxy: "off" # Time (in milliseconds) an endpoint will be held in a failed state. proxy-failure-wait: 5000 # Time (in milliseconds) of the endpoints refresh interval. proxy-refresh-interval: 30000 # Time (in milliseconds) for a dial to timeout. proxy-dial-timeout: 1000 # Time (in milliseconds) for a write to timeout. proxy-write-timeout: 5000 # Time (in milliseconds) for a read to timeout. proxy-read-timeout: 0 client-transport-security: # Path to the client server TLS cert file. cert-file: "/etc/etcd/etcd-server.crt" # Path to the client server TLS key file. key-file: # Enable client cert authentication. client-cert-auth: true # Path to the client server TLS trusted CA key file. trusted-ca-file: "/etc/etcd/ca.crt" # Enable debug-level logging for etcd. debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet \| ipaddr(1)).split('/')[0] }} so it will propagate to: - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157) - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address \| ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin \| changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin \| changed: [localhost] => (item=sed -i -e 's\|<%= @etcd_endpoint %>\|'$ETCD_ENDPOINT'\|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing	Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass --------- N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] *** E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.206.3 nodeRegistration: criSocket: "/var/run/containerd/containerd.sock" — apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiServer: certSANs: - 192.168.206.2 - 127.0.0.1 - 128.224.150.54 - 128.224.150.219 - 128.224.150.212 extraArgs: default-not-ready-toleration-seconds: "30" default-unreachable-toleration-seconds: "30" feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true" event-ttl: "24h" encryption-provider-config: /etc/kubernetes/encryption-provider.yaml extraVolumes: - name: "encryption-config" hostPath: /etc/kubernetes/encryption-provider.yaml mountPath: /etc/kubernetes/encryption-provider.yaml readOnly: true pathType: File controllerManager: extraArgs: node-monitor-period: "2s" node-monitor-grace-period: "20s" pod-eviction-timeout: "30s" feature-gates: "TTLAfterFinished=true" flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ controlPlaneEndpoint: 192.168.206.2 etcd: external: endpoints: - https://192.168.206.2:2379 caFile: /etc/kubernetes/pki/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key imageRepository: "registry.local:9001/k8s.gcr.io" kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local podSubnet: 172.16.0.0/16 serviceSubnet: 10.96.0.0/12 — kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 nodeStatusUpdateFrequency: "4s" featureGates: HugePageStorageMediumSize: true failSwapOn: false cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf #[member] ETCD_NAME="controller" ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd" ETCD_SNAPSHOT_COUNT=10000 ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=1000 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379" ETCD_MAX_SNAPSHOTS=5 ETCD_MAX_WALS=5 ETCD_ENABLE_V2=true # #[proxy] ETCD_PROXY="off" ETCD_PROXY_FAILURE_WAIT=5000 ETCD_PROXY_REFRESH_INTERVAL=30000 ETCD_PROXY_DIAL_TIMEOUT=1000 ETCD_PROXY_WRITE_TIMEOUT=5000 ETCD_PROXY_READ_TIMEOUT=0 # #[security] ETCD_CERT_FILE="/etc/etcd/etcd-server.crt" ETCD_KEY_FILE="/etc/etcd/etcd-server.key" ETCD_CLIENT_CERT_AUTH=true ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt" ETCD_PEER_CLIENT_CERT_AUTH=false # #[logging] ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml # Managed by Puppet # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample # This is the configuration file for the etcd server. # Human-readable name for this member. name: "controller" # Path to the data directory. data-dir: "/opt/etcd/20.12/controller.etcd" # Number of committed transactions to trigger a snapshot to disk. snapshot-count: 10000 # Time (in milliseconds) of a heartbeat interval. heartbeat-interval: 100 # Time (in milliseconds) for an election to timeout. election-timeout: 1000 # Raise alarms when backend size exceeds the given quota. 0 means use the # default quota. quota-backend-bytes: 0 # List of comma separated URLs to listen on for client traffic. listen-client-urls: "https://192.168.206.1:2379" # Maximum number of snapshot files to retain (0 is unlimited). max-snapshots: 5 # Maximum number of wal files to retain (0 is unlimited). max-wals: 5 # List of this member's client URLs to advertise to the public. # The URLs needed to be a comma-separated list. advertise-client-urls: "https://192.168.206.1:2379" # Accept etcd V2 client requests enable-v2: true # Valid values include 'on', 'readonly', 'off' proxy: "off" # Time (in milliseconds) an endpoint will be held in a failed state. proxy-failure-wait: 5000 # Time (in milliseconds) of the endpoints refresh interval. proxy-refresh-interval: 30000 # Time (in milliseconds) for a dial to timeout. proxy-dial-timeout: 1000 # Time (in milliseconds) for a write to timeout. proxy-write-timeout: 5000 # Time (in milliseconds) for a read to timeout. proxy-read-timeout: 0 client-transport-security: # Path to the client server TLS cert file. cert-file: "/etc/etcd/etcd-server.crt" # Path to the client server TLS key file. key-file: # Enable client cert authentication. client-cert-auth: true # Path to the client server TLS trusted CA key file. trusted-ca-file: "/etc/etcd/ca.crt" # Enable debug-level logging for etcd. debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet \| ipaddr(1)).split('/')[0] }} so it will propagate to: - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157) - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address \| ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin \| changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin \| changed: [localhost] => (item=sed -i -e 's\|<%= @etcd_endpoint %>\|'$ETCD_ENDPOINT'\|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing
2021-03-08 11:48:04	Mihnea Saracin	description	Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass --------- N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] *** E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.206.3 nodeRegistration: criSocket: "/var/run/containerd/containerd.sock" — apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiServer: certSANs: - 192.168.206.2 - 127.0.0.1 - 128.224.150.54 - 128.224.150.219 - 128.224.150.212 extraArgs: default-not-ready-toleration-seconds: "30" default-unreachable-toleration-seconds: "30" feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true" event-ttl: "24h" encryption-provider-config: /etc/kubernetes/encryption-provider.yaml extraVolumes: - name: "encryption-config" hostPath: /etc/kubernetes/encryption-provider.yaml mountPath: /etc/kubernetes/encryption-provider.yaml readOnly: true pathType: File controllerManager: extraArgs: node-monitor-period: "2s" node-monitor-grace-period: "20s" pod-eviction-timeout: "30s" feature-gates: "TTLAfterFinished=true" flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ controlPlaneEndpoint: 192.168.206.2 etcd: external: endpoints: - https://192.168.206.2:2379 caFile: /etc/kubernetes/pki/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key imageRepository: "registry.local:9001/k8s.gcr.io" kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local podSubnet: 172.16.0.0/16 serviceSubnet: 10.96.0.0/12 — kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 nodeStatusUpdateFrequency: "4s" featureGates: HugePageStorageMediumSize: true failSwapOn: false cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf #[member] ETCD_NAME="controller" ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd" ETCD_SNAPSHOT_COUNT=10000 ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=1000 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379" ETCD_MAX_SNAPSHOTS=5 ETCD_MAX_WALS=5 ETCD_ENABLE_V2=true # #[proxy] ETCD_PROXY="off" ETCD_PROXY_FAILURE_WAIT=5000 ETCD_PROXY_REFRESH_INTERVAL=30000 ETCD_PROXY_DIAL_TIMEOUT=1000 ETCD_PROXY_WRITE_TIMEOUT=5000 ETCD_PROXY_READ_TIMEOUT=0 # #[security] ETCD_CERT_FILE="/etc/etcd/etcd-server.crt" ETCD_KEY_FILE="/etc/etcd/etcd-server.key" ETCD_CLIENT_CERT_AUTH=true ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt" ETCD_PEER_CLIENT_CERT_AUTH=false # #[logging] ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml # Managed by Puppet # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample # This is the configuration file for the etcd server. # Human-readable name for this member. name: "controller" # Path to the data directory. data-dir: "/opt/etcd/20.12/controller.etcd" # Number of committed transactions to trigger a snapshot to disk. snapshot-count: 10000 # Time (in milliseconds) of a heartbeat interval. heartbeat-interval: 100 # Time (in milliseconds) for an election to timeout. election-timeout: 1000 # Raise alarms when backend size exceeds the given quota. 0 means use the # default quota. quota-backend-bytes: 0 # List of comma separated URLs to listen on for client traffic. listen-client-urls: "https://192.168.206.1:2379" # Maximum number of snapshot files to retain (0 is unlimited). max-snapshots: 5 # Maximum number of wal files to retain (0 is unlimited). max-wals: 5 # List of this member's client URLs to advertise to the public. # The URLs needed to be a comma-separated list. advertise-client-urls: "https://192.168.206.1:2379" # Accept etcd V2 client requests enable-v2: true # Valid values include 'on', 'readonly', 'off' proxy: "off" # Time (in milliseconds) an endpoint will be held in a failed state. proxy-failure-wait: 5000 # Time (in milliseconds) of the endpoints refresh interval. proxy-refresh-interval: 30000 # Time (in milliseconds) for a dial to timeout. proxy-dial-timeout: 1000 # Time (in milliseconds) for a write to timeout. proxy-write-timeout: 5000 # Time (in milliseconds) for a read to timeout. proxy-read-timeout: 0 client-transport-security: # Path to the client server TLS cert file. cert-file: "/etc/etcd/etcd-server.crt" # Path to the client server TLS key file. key-file: # Enable client cert authentication. client-cert-auth: true # Path to the client server TLS trusted CA key file. trusted-ca-file: "/etc/etcd/ca.crt" # Enable debug-level logging for etcd. debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet \| ipaddr(1)).split('/')[0] }} so it will propagate to: - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157) - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address \| ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin \| changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin \| changed: [localhost] => (item=sed -i -e 's\|<%= @etcd_endpoint %>\|'$ETCD_ENDPOINT'\|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing	Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass --------- N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] *** E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.206.3 nodeRegistration: criSocket: "/var/run/containerd/containerd.sock" — apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiServer: certSANs: - 192.168.206.2 - 127.0.0.1 - 128.224.150.54 - 128.224.150.219 - 128.224.150.212 extraArgs: default-not-ready-toleration-seconds: "30" default-unreachable-toleration-seconds: "30" feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true" event-ttl: "24h" encryption-provider-config: /etc/kubernetes/encryption-provider.yaml extraVolumes: - name: "encryption-config" hostPath: /etc/kubernetes/encryption-provider.yaml mountPath: /etc/kubernetes/encryption-provider.yaml readOnly: true pathType: File controllerManager: extraArgs: node-monitor-period: "2s" node-monitor-grace-period: "20s" pod-eviction-timeout: "30s" feature-gates: "TTLAfterFinished=true" flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ controlPlaneEndpoint: 192.168.206.2 etcd: external: endpoints: - https://192.168.206.2:2379 caFile: /etc/kubernetes/pki/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key imageRepository: "registry.local:9001/k8s.gcr.io" kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local podSubnet: 172.16.0.0/16 serviceSubnet: 10.96.0.0/12 — kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 nodeStatusUpdateFrequency: "4s" featureGates: HugePageStorageMediumSize: true failSwapOn: false cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf #[member] ETCD_NAME="controller" ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd" ETCD_SNAPSHOT_COUNT=10000 ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=1000 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379" ETCD_MAX_SNAPSHOTS=5 ETCD_MAX_WALS=5 ETCD_ENABLE_V2=true # #[proxy] ETCD_PROXY="off" ETCD_PROXY_FAILURE_WAIT=5000 ETCD_PROXY_REFRESH_INTERVAL=30000 ETCD_PROXY_DIAL_TIMEOUT=1000 ETCD_PROXY_WRITE_TIMEOUT=5000 ETCD_PROXY_READ_TIMEOUT=0 # #[security] ETCD_CERT_FILE="/etc/etcd/etcd-server.crt" ETCD_KEY_FILE="/etc/etcd/etcd-server.key" ETCD_CLIENT_CERT_AUTH=true ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt" ETCD_PEER_CLIENT_CERT_AUTH=false # #[logging] ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml # Managed by Puppet # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample # This is the configuration file for the etcd server. # Human-readable name for this member. name: "controller" # Path to the data directory. data-dir: "/opt/etcd/20.12/controller.etcd" # Number of committed transactions to trigger a snapshot to disk. snapshot-count: 10000 # Time (in milliseconds) of a heartbeat interval. heartbeat-interval: 100 # Time (in milliseconds) for an election to timeout. election-timeout: 1000 # Raise alarms when backend size exceeds the given quota. 0 means use the # default quota. quota-backend-bytes: 0 # List of comma separated URLs to listen on for client traffic. listen-client-urls: "https://192.168.206.1:2379" # Maximum number of snapshot files to retain (0 is unlimited). max-snapshots: 5 # Maximum number of wal files to retain (0 is unlimited). max-wals: 5 # List of this member's client URLs to advertise to the public. # The URLs needed to be a comma-separated list. advertise-client-urls: "https://192.168.206.1:2379" # Accept etcd V2 client requests enable-v2: true # Valid values include 'on', 'readonly', 'off' proxy: "off" # Time (in milliseconds) an endpoint will be held in a failed state. proxy-failure-wait: 5000 # Time (in milliseconds) of the endpoints refresh interval. proxy-refresh-interval: 30000 # Time (in milliseconds) for a dial to timeout. proxy-dial-timeout: 1000 # Time (in milliseconds) for a write to timeout. proxy-write-timeout: 5000 # Time (in milliseconds) for a read to timeout. proxy-read-timeout: 0 client-transport-security: # Path to the client server TLS cert file. cert-file: "/etc/etcd/etcd-server.crt" # Path to the client server TLS key file. key-file: # Enable client cert authentication. client-cert-auth: true # Path to the client server TLS trusted CA key file. trusted-ca-file: "/etc/etcd/ca.crt" # Enable debug-level logging for etcd. debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet \| ipaddr(1)).split('/')[0] }} so it will propagate to: - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157) - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address \| ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin \| changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin \| changed: [localhost] => (item=sed -i -e 's\|<%= @etcd_endpoint %>\|'$ETCD_ENDPOINT'\|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing
2021-03-15 20:15:38	Frank Miller	starlingx: importance	Undecided	Medium
2021-03-15 20:16:57	Frank Miller	tags		stx.5.0 stx.security
2021-03-16 10:58:32	Mihnea Saracin	starlingx: status	New	Fix Released
2021-06-03 12:39:45	OpenStack Infra	tags	stx.5.0 stx.security	in-f-centos8 stx.5.0 stx.security