Activity log for bug #1918130

Date Who What changed Old value New value Message
2021-03-08 11:46:21 Mihnea Saracin bug added bug
2021-03-08 11:46:28 Mihnea Saracin starlingx: assignee Mihnea Saracin (msaracin)
2021-03-08 11:47:29 Mihnea Saracin description Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully   Actual Behavior ---------------- Bootstrap playbook fails   Reproducibility --------------- 9/9   System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass --------- N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] *** E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.206.3 nodeRegistration: criSocket: "/var/run/containerd/containerd.sock" — apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiServer: certSANs: - 192.168.206.2 - 127.0.0.1 - 128.224.150.54 - 128.224.150.219 - 128.224.150.212 extraArgs: default-not-ready-toleration-seconds: "30" default-unreachable-toleration-seconds: "30" feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true" event-ttl: "24h" encryption-provider-config: /etc/kubernetes/encryption-provider.yaml extraVolumes: - name: "encryption-config" hostPath: /etc/kubernetes/encryption-provider.yaml mountPath: /etc/kubernetes/encryption-provider.yaml readOnly: true pathType: File controllerManager: extraArgs: node-monitor-period: "2s" node-monitor-grace-period: "20s" pod-eviction-timeout: "30s" feature-gates: "TTLAfterFinished=true" flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ controlPlaneEndpoint: 192.168.206.2 etcd: external: endpoints: - https://192.168.206.2:2379 caFile: /etc/kubernetes/pki/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key imageRepository: "registry.local:9001/k8s.gcr.io" kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local podSubnet: 172.16.0.0/16 serviceSubnet: 10.96.0.0/12 — kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 nodeStatusUpdateFrequency: "4s" featureGates: HugePageStorageMediumSize: true failSwapOn: false cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf #[member] ETCD_NAME="controller" ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd" ETCD_SNAPSHOT_COUNT=10000 ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=1000 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379" ETCD_MAX_SNAPSHOTS=5 ETCD_MAX_WALS=5 ETCD_ENABLE_V2=true # #[proxy] ETCD_PROXY="off" ETCD_PROXY_FAILURE_WAIT=5000 ETCD_PROXY_REFRESH_INTERVAL=30000 ETCD_PROXY_DIAL_TIMEOUT=1000 ETCD_PROXY_WRITE_TIMEOUT=5000 ETCD_PROXY_READ_TIMEOUT=0 # #[security] ETCD_CERT_FILE="/etc/etcd/etcd-server.crt" ETCD_KEY_FILE="/etc/etcd/etcd-server.key" ETCD_CLIENT_CERT_AUTH=true ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt" ETCD_PEER_CLIENT_CERT_AUTH=false # #[logging] ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml # Managed by Puppet # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample # This is the configuration file for the etcd server. # Human-readable name for this member. name: "controller" # Path to the data directory. data-dir: "/opt/etcd/20.12/controller.etcd" # Number of committed transactions to trigger a snapshot to disk. snapshot-count: 10000 # Time (in milliseconds) of a heartbeat interval. heartbeat-interval: 100 # Time (in milliseconds) for an election to timeout. election-timeout: 1000 # Raise alarms when backend size exceeds the given quota. 0 means use the # default quota. quota-backend-bytes: 0 # List of comma separated URLs to listen on for client traffic. listen-client-urls: "https://192.168.206.1:2379" # Maximum number of snapshot files to retain (0 is unlimited). max-snapshots: 5 # Maximum number of wal files to retain (0 is unlimited). max-wals: 5 # List of this member's client URLs to advertise to the public. # The URLs needed to be a comma-separated list. advertise-client-urls: "https://192.168.206.1:2379" # Accept etcd V2 client requests enable-v2: true # Valid values include 'on', 'readonly', 'off' proxy: "off" # Time (in milliseconds) an endpoint will be held in a failed state. proxy-failure-wait: 5000 # Time (in milliseconds) of the endpoints refresh interval. proxy-refresh-interval: 30000 # Time (in milliseconds) for a dial to timeout. proxy-dial-timeout: 1000 # Time (in milliseconds) for a write to timeout. proxy-write-timeout: 5000 # Time (in milliseconds) for a read to timeout. proxy-read-timeout: 0 client-transport-security: # Path to the client server TLS cert file. cert-file: "/etc/etcd/etcd-server.crt" # Path to the client server TLS key file. key-file: # Enable client cert authentication. client-cert-auth: true # Path to the client server TLS trusted CA key file. trusted-ca-file: "/etc/etcd/ca.crt" # Enable debug-level logging for etcd. debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to: - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157) - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file.   We can see in the ansible.log:   The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1)   The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce  ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass ---------  N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***  E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml  apiVersion: kubeadm.k8s.io/v1beta2  kind: InitConfiguration  localAPIEndpoint:  advertiseAddress: 192.168.206.3  nodeRegistration:  criSocket: "/var/run/containerd/containerd.sock"  —  apiVersion: kubeadm.k8s.io/v1beta2  kind: ClusterConfiguration  apiServer:  certSANs:  - 192.168.206.2  - 127.0.0.1  - 128.224.150.54  - 128.224.150.219  - 128.224.150.212  extraArgs:  default-not-ready-toleration-seconds: "30"  default-unreachable-toleration-seconds: "30"  feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"  event-ttl: "24h"  encryption-provider-config: /etc/kubernetes/encryption-provider.yaml  extraVolumes:  - name: "encryption-config"  hostPath: /etc/kubernetes/encryption-provider.yaml  mountPath: /etc/kubernetes/encryption-provider.yaml  readOnly: true  pathType: File  controllerManager:  extraArgs:  node-monitor-period: "2s"  node-monitor-grace-period: "20s"  pod-eviction-timeout: "30s"  feature-gates: "TTLAfterFinished=true"  flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/  controlPlaneEndpoint: 192.168.206.2  etcd:  external:  endpoints:  - https://192.168.206.2:2379  caFile: /etc/kubernetes/pki/ca.crt  certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt  keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key  imageRepository: "registry.local:9001/k8s.gcr.io"  kubernetesVersion: v1.18.1  networking:  dnsDomain: cluster.local  podSubnet: 172.16.0.0/16  serviceSubnet: 10.96.0.0/12  —  kind: KubeletConfiguration  apiVersion: kubelet.config.k8s.io/v1beta1  nodeStatusUpdateFrequency: "4s"  featureGates:  HugePageStorageMediumSize: true  failSwapOn: false  cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf  #[member]  ETCD_NAME="controller"  ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"  ETCD_SNAPSHOT_COUNT=10000  ETCD_HEARTBEAT_INTERVAL=100  ETCD_ELECTION_TIMEOUT=1000  ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_MAX_SNAPSHOTS=5  ETCD_MAX_WALS=5  ETCD_ENABLE_V2=true  #  #[proxy]  ETCD_PROXY="off"  ETCD_PROXY_FAILURE_WAIT=5000  ETCD_PROXY_REFRESH_INTERVAL=30000  ETCD_PROXY_DIAL_TIMEOUT=1000  ETCD_PROXY_WRITE_TIMEOUT=5000  ETCD_PROXY_READ_TIMEOUT=0  # #[security]  ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"  ETCD_KEY_FILE="/etc/etcd/etcd-server.key"  ETCD_CLIENT_CERT_AUTH=true  ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"  ETCD_PEER_CLIENT_CERT_AUTH=false  #  #[logging]  ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml  # Managed by Puppet  # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample  # This is the configuration file for the etcd server.  # Human-readable name for this member.  name: "controller"  # Path to the data directory.  data-dir: "/opt/etcd/20.12/controller.etcd"  # Number of committed transactions to trigger a snapshot to disk.  snapshot-count: 10000  # Time (in milliseconds) of a heartbeat interval.  heartbeat-interval: 100  # Time (in milliseconds) for an election to timeout.  election-timeout: 1000  # Raise alarms when backend size exceeds the given quota. 0 means use the  # default quota.  quota-backend-bytes: 0  # List of comma separated URLs to listen on for client traffic.  listen-client-urls: "https://192.168.206.1:2379"  # Maximum number of snapshot files to retain (0 is unlimited).  max-snapshots: 5  # Maximum number of wal files to retain (0 is unlimited).  max-wals: 5  # List of this member's client URLs to advertise to the public.  # The URLs needed to be a comma-separated list.  advertise-client-urls: "https://192.168.206.1:2379"  # Accept etcd V2 client requests  enable-v2: true  # Valid values include 'on', 'readonly', 'off'  proxy: "off"  # Time (in milliseconds) an endpoint will be held in a failed state.  proxy-failure-wait: 5000  # Time (in milliseconds) of the endpoints refresh interval.  proxy-refresh-interval: 30000  # Time (in milliseconds) for a dial to timeout.  proxy-dial-timeout: 1000  # Time (in milliseconds) for a write to timeout.  proxy-write-timeout: 5000  # Time (in milliseconds) for a read to timeout.  proxy-read-timeout: 0 client-transport-security:  # Path to the client server TLS cert file.  cert-file: "/etc/etcd/etcd-server.crt"  # Path to the client server TLS key file.  key-file:  # Enable client cert authentication.  client-cert-auth: true  # Path to the client server TLS trusted CA key file.  trusted-ca-file: "/etc/etcd/ca.crt"  # Enable debug-level logging for etcd.  debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:  - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)  - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing
2021-03-08 11:48:04 Mihnea Saracin description Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce  ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass ---------  N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***  E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml  apiVersion: kubeadm.k8s.io/v1beta2  kind: InitConfiguration  localAPIEndpoint:  advertiseAddress: 192.168.206.3  nodeRegistration:  criSocket: "/var/run/containerd/containerd.sock"  —  apiVersion: kubeadm.k8s.io/v1beta2  kind: ClusterConfiguration  apiServer:  certSANs:  - 192.168.206.2  - 127.0.0.1  - 128.224.150.54  - 128.224.150.219  - 128.224.150.212  extraArgs:  default-not-ready-toleration-seconds: "30"  default-unreachable-toleration-seconds: "30"  feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"  event-ttl: "24h"  encryption-provider-config: /etc/kubernetes/encryption-provider.yaml  extraVolumes:  - name: "encryption-config"  hostPath: /etc/kubernetes/encryption-provider.yaml  mountPath: /etc/kubernetes/encryption-provider.yaml  readOnly: true  pathType: File  controllerManager:  extraArgs:  node-monitor-period: "2s"  node-monitor-grace-period: "20s"  pod-eviction-timeout: "30s"  feature-gates: "TTLAfterFinished=true"  flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/  controlPlaneEndpoint: 192.168.206.2  etcd:  external:  endpoints:  - https://192.168.206.2:2379  caFile: /etc/kubernetes/pki/ca.crt  certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt  keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key  imageRepository: "registry.local:9001/k8s.gcr.io"  kubernetesVersion: v1.18.1  networking:  dnsDomain: cluster.local  podSubnet: 172.16.0.0/16  serviceSubnet: 10.96.0.0/12  —  kind: KubeletConfiguration  apiVersion: kubelet.config.k8s.io/v1beta1  nodeStatusUpdateFrequency: "4s"  featureGates:  HugePageStorageMediumSize: true  failSwapOn: false  cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf  #[member]  ETCD_NAME="controller"  ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"  ETCD_SNAPSHOT_COUNT=10000  ETCD_HEARTBEAT_INTERVAL=100  ETCD_ELECTION_TIMEOUT=1000  ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_MAX_SNAPSHOTS=5  ETCD_MAX_WALS=5  ETCD_ENABLE_V2=true  #  #[proxy]  ETCD_PROXY="off"  ETCD_PROXY_FAILURE_WAIT=5000  ETCD_PROXY_REFRESH_INTERVAL=30000  ETCD_PROXY_DIAL_TIMEOUT=1000  ETCD_PROXY_WRITE_TIMEOUT=5000  ETCD_PROXY_READ_TIMEOUT=0  # #[security]  ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"  ETCD_KEY_FILE="/etc/etcd/etcd-server.key"  ETCD_CLIENT_CERT_AUTH=true  ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"  ETCD_PEER_CLIENT_CERT_AUTH=false  #  #[logging]  ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml  # Managed by Puppet  # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample  # This is the configuration file for the etcd server.  # Human-readable name for this member.  name: "controller"  # Path to the data directory.  data-dir: "/opt/etcd/20.12/controller.etcd"  # Number of committed transactions to trigger a snapshot to disk.  snapshot-count: 10000  # Time (in milliseconds) of a heartbeat interval.  heartbeat-interval: 100  # Time (in milliseconds) for an election to timeout.  election-timeout: 1000  # Raise alarms when backend size exceeds the given quota. 0 means use the  # default quota.  quota-backend-bytes: 0  # List of comma separated URLs to listen on for client traffic.  listen-client-urls: "https://192.168.206.1:2379"  # Maximum number of snapshot files to retain (0 is unlimited).  max-snapshots: 5  # Maximum number of wal files to retain (0 is unlimited).  max-wals: 5  # List of this member's client URLs to advertise to the public.  # The URLs needed to be a comma-separated list.  advertise-client-urls: "https://192.168.206.1:2379"  # Accept etcd V2 client requests  enable-v2: true  # Valid values include 'on', 'readonly', 'off'  proxy: "off"  # Time (in milliseconds) an endpoint will be held in a failed state.  proxy-failure-wait: 5000  # Time (in milliseconds) of the endpoints refresh interval.  proxy-refresh-interval: 30000  # Time (in milliseconds) for a dial to timeout.  proxy-dial-timeout: 1000  # Time (in milliseconds) for a write to timeout.  proxy-write-timeout: 5000  # Time (in milliseconds) for a read to timeout.  proxy-read-timeout: 0 client-transport-security:  # Path to the client server TLS cert file.  cert-file: "/etc/etcd/etcd-server.crt"  # Path to the client server TLS key file.  key-file:  # Enable client cert authentication.  client-cert-auth: true  # Path to the client server TLS trusted CA key file.  trusted-ca-file: "/etc/etcd/ca.crt"  # Enable debug-level logging for etcd.  debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:  - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)  - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing Brief Description ----------------- Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files. Steps to Reproduce ------------------ Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet' Expected Behavior ------------------ Bootstrap playbook completes successfully Actual Behavior ---------------- Bootstrap playbook fails Reproducibility --------------- 9/9 System Configuration -------------------- Standard System Branch/Pull Time/Commit ----------------------- stx master build on "2021-03-01" Last Pass ---------  N/A Timestamp/Logs -------------- TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***  E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]} From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on: controller-0:~$ cat /etc/kubernetes/kubeadm.yaml  apiVersion: kubeadm.k8s.io/v1beta2  kind: InitConfiguration  localAPIEndpoint:  advertiseAddress: 192.168.206.3  nodeRegistration:  criSocket: "/var/run/containerd/containerd.sock"  —  apiVersion: kubeadm.k8s.io/v1beta2  kind: ClusterConfiguration  apiServer:  certSANs:  - 192.168.206.2  - 127.0.0.1  - 128.224.150.54  - 128.224.150.219  - 128.224.150.212  extraArgs:  default-not-ready-toleration-seconds: "30"  default-unreachable-toleration-seconds: "30"  feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"  event-ttl: "24h"  encryption-provider-config: /etc/kubernetes/encryption-provider.yaml  extraVolumes:  - name: "encryption-config"  hostPath: /etc/kubernetes/encryption-provider.yaml  mountPath: /etc/kubernetes/encryption-provider.yaml  readOnly: true  pathType: File  controllerManager:  extraArgs:  node-monitor-period: "2s"  node-monitor-grace-period: "20s"  pod-eviction-timeout: "30s"  feature-gates: "TTLAfterFinished=true"  flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/  controlPlaneEndpoint: 192.168.206.2  etcd:  external:  endpoints:  - https://192.168.206.2:2379  caFile: /etc/kubernetes/pki/ca.crt  certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt  keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key  imageRepository: "registry.local:9001/k8s.gcr.io"  kubernetesVersion: v1.18.1  networking:  dnsDomain: cluster.local  podSubnet: 172.16.0.0/16  serviceSubnet: 10.96.0.0/12  —  kind: KubeletConfiguration  apiVersion: kubelet.config.k8s.io/v1beta1  nodeStatusUpdateFrequency: "4s"  featureGates:  HugePageStorageMediumSize: true  failSwapOn: false  cgroupRoot: "/k8s-infra" #########################3 controller-0:/etc/etcd# cat /etc/etcd/etcd.conf  #[member]  ETCD_NAME="controller"  ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"  ETCD_SNAPSHOT_COUNT=10000  ETCD_HEARTBEAT_INTERVAL=100  ETCD_ELECTION_TIMEOUT=1000  ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"  ETCD_MAX_SNAPSHOTS=5  ETCD_MAX_WALS=5  ETCD_ENABLE_V2=true  #  #[proxy]  ETCD_PROXY="off"  ETCD_PROXY_FAILURE_WAIT=5000  ETCD_PROXY_REFRESH_INTERVAL=30000  ETCD_PROXY_DIAL_TIMEOUT=1000  ETCD_PROXY_WRITE_TIMEOUT=5000  ETCD_PROXY_READ_TIMEOUT=0  # #[security]  ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"  ETCD_KEY_FILE="/etc/etcd/etcd-server.key"  ETCD_CLIENT_CERT_AUTH=true  ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"  ETCD_PEER_CLIENT_CERT_AUTH=false  #  #[logging]  ETCD_DEBUG=false ###################### controller-0:/etc/etcd# cat /etc/etcd/etcd.yml  # Managed by Puppet  # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample  # This is the configuration file for the etcd server.  # Human-readable name for this member.  name: "controller"  # Path to the data directory.  data-dir: "/opt/etcd/20.12/controller.etcd"  # Number of committed transactions to trigger a snapshot to disk.  snapshot-count: 10000  # Time (in milliseconds) of a heartbeat interval.  heartbeat-interval: 100  # Time (in milliseconds) for an election to timeout.  election-timeout: 1000  # Raise alarms when backend size exceeds the given quota. 0 means use the  # default quota.  quota-backend-bytes: 0  # List of comma separated URLs to listen on for client traffic.  listen-client-urls: "https://192.168.206.1:2379"  # Maximum number of snapshot files to retain (0 is unlimited).  max-snapshots: 5  # Maximum number of wal files to retain (0 is unlimited).  max-wals: 5  # List of this member's client URLs to advertise to the public.  # The URLs needed to be a comma-separated list.  advertise-client-urls: "https://192.168.206.1:2379"  # Accept etcd V2 client requests  enable-v2: true  # Valid values include 'on', 'readonly', 'off'  proxy: "off"  # Time (in milliseconds) an endpoint will be held in a failed state.  proxy-failure-wait: 5000  # Time (in milliseconds) of the endpoints refresh interval.  proxy-refresh-interval: 30000  # Time (in milliseconds) for a dial to timeout.  proxy-dial-timeout: 1000  # Time (in milliseconds) for a write to timeout.  proxy-write-timeout: 5000  # Time (in milliseconds) for a read to timeout.  proxy-read-timeout: 0 client-transport-security:  # Path to the client server TLS cert file.  cert-file: "/etc/etcd/etcd-server.crt"  # Path to the client server TLS key file.  key-file:  # Enable client cert authentication.  client-cert-auth: true  # Path to the client server TLS trusted CA key file.  trusted-ca-file: "/etc/etcd/ca.crt"  # Enable debug-level logging for etcd.  debug: false ###################### In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379 I think the commit that introduced this is: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39 This change in particular: https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml ##### The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet': The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:  - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)  - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file. We can see in the ansible.log: The bind_address 2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1) The ETCD_ENDPONT 2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the In the auth.log auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py Test Activity ------------- Developer Testing
2021-03-15 20:15:38 Frank Miller starlingx: importance Undecided Medium
2021-03-15 20:16:57 Frank Miller tags stx.5.0 stx.security
2021-03-16 10:58:32 Mihnea Saracin starlingx: status New Fix Released
2021-06-03 12:39:45 OpenStack Infra tags stx.5.0 stx.security in-f-centos8 stx.5.0 stx.security