Bootstrap playbook fails at Initializing Kubernetes master

Bug #1918130 reported by Mihnea Saracin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------

Standard system fails when running the bootstrap playbook at the 'Initializing Kubernetes master' step. The problem is that the etcd endpoint in the kubeadm file is different than the one in the etcd config files.

Steps to Reproduce
------------------

Deploy a Standard system with the 'cluster_host_start_address' defined in localhost.yml and different than the first address of the 'cluster_host_subnet'

Expected Behavior
------------------
Bootstrap playbook completes successfully

Actual Behavior
----------------
Bootstrap playbook fails

Reproducibility
---------------
9/9

System Configuration
--------------------
Standard System

Branch/Pull Time/Commit
-----------------------
stx master build on "2021-03-01"

Last Pass
---------
 N/A

Timestamp/Logs
--------------
TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ***
 E fatal: [localhost]: FAILED! =>

{"changed": true, "cmd": ["kubeadm", "init", "--ignore-preflight-errors=DirAvailable--var-lib-etcd", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:00:15.145010", "end": "2021-03-02 18:45:29.366854", "msg": "non-zero return code", "rc": 1, "start": "2021-03-02 18:45:14.221844", "stderr": "W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]\nerror execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0302 18:45:14.265803 123259 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]", "error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ExternalEtcdVersion]: Get https://192.168.206.2:2379/version: dial tcp 192.168.206.2:2379: connect: connection refused", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.18.1\n[preflight] Running pre-flight checks", "stdout_lines": ["[init] Using Kubernetes version: v1.18.1", "[preflight] Running pre-flight checks"]}

From what I see, It seems that the etcd endpoint defined in the kubeadm_file is different from the one that etcd listens on:

controller-0:~$ cat /etc/kubernetes/kubeadm.yaml
 apiVersion: kubeadm.k8s.io/v1beta2
 kind: InitConfiguration
 localAPIEndpoint:
 advertiseAddress: 192.168.206.3
 nodeRegistration:
 criSocket: "/var/run/containerd/containerd.sock"
 —
 apiVersion: kubeadm.k8s.io/v1beta2
 kind: ClusterConfiguration
 apiServer:
 certSANs:
 - 192.168.206.2
 - 127.0.0.1
 - 128.224.150.54
 - 128.224.150.219
 - 128.224.150.212
 extraArgs:
 default-not-ready-toleration-seconds: "30"
 default-unreachable-toleration-seconds: "30"
 feature-gates: "SCTPSupport=true,TTLAfterFinished=true,HugePageStorageMediumSize=true"
 event-ttl: "24h"
 encryption-provider-config: /etc/kubernetes/encryption-provider.yaml
 extraVolumes:
 - name: "encryption-config"
 hostPath: /etc/kubernetes/encryption-provider.yaml
 mountPath: /etc/kubernetes/encryption-provider.yaml
 readOnly: true
 pathType: File
 controllerManager:
 extraArgs:
 node-monitor-period: "2s"
 node-monitor-grace-period: "20s"
 pod-eviction-timeout: "30s"
 feature-gates: "TTLAfterFinished=true"
 flex-volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
 controlPlaneEndpoint: 192.168.206.2
 etcd:
 external:
 endpoints:
 - https://192.168.206.2:2379
 caFile: /etc/kubernetes/pki/ca.crt
 certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
 keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
 imageRepository: "registry.local:9001/k8s.gcr.io"
 kubernetesVersion: v1.18.1
 networking:
 dnsDomain: cluster.local
 podSubnet: 172.16.0.0/16
 serviceSubnet: 10.96.0.0/12
 —
 kind: KubeletConfiguration
 apiVersion: kubelet.config.k8s.io/v1beta1
 nodeStatusUpdateFrequency: "4s"
 featureGates:
 HugePageStorageMediumSize: true
 failSwapOn: false
 cgroupRoot: "/k8s-infra"

#########################3

controller-0:/etc/etcd# cat /etc/etcd/etcd.conf
 #[member]
 ETCD_NAME="controller"
 ETCD_DATA_DIR="/opt/etcd/20.12/controller.etcd"
 ETCD_SNAPSHOT_COUNT=10000
 ETCD_HEARTBEAT_INTERVAL=100
 ETCD_ELECTION_TIMEOUT=1000
 ETCD_LISTEN_CLIENT_URLS="https://192.168.206.1:2379"
 ETCD_ADVERTISE_CLIENT_URLS="https://192.168.206.1:2379"
 ETCD_MAX_SNAPSHOTS=5
 ETCD_MAX_WALS=5
 ETCD_ENABLE_V2=true
 #
 #[proxy]
 ETCD_PROXY="off"
 ETCD_PROXY_FAILURE_WAIT=5000
 ETCD_PROXY_REFRESH_INTERVAL=30000
 ETCD_PROXY_DIAL_TIMEOUT=1000
 ETCD_PROXY_WRITE_TIMEOUT=5000
 ETCD_PROXY_READ_TIMEOUT=0
 #

#[security]
 ETCD_CERT_FILE="/etc/etcd/etcd-server.crt"
 ETCD_KEY_FILE="/etc/etcd/etcd-server.key"
 ETCD_CLIENT_CERT_AUTH=true
 ETCD_TRUSTED_CA_FILE="/etc/etcd/ca.crt"
 ETCD_PEER_CLIENT_CERT_AUTH=false
 #
 #[logging]
 ETCD_DEBUG=false

######################

controller-0:/etc/etcd# cat /etc/etcd/etcd.yml
 # Managed by Puppet
 # Source URL: https://raw.githubusercontent.com/coreos/etcd/master/etcd.conf.yml.sample
 # This is the configuration file for the etcd server.

 # Human-readable name for this member.
 name: "controller"

 # Path to the data directory.
 data-dir: "/opt/etcd/20.12/controller.etcd"

 # Number of committed transactions to trigger a snapshot to disk.
 snapshot-count: 10000

 # Time (in milliseconds) of a heartbeat interval.
 heartbeat-interval: 100

 # Time (in milliseconds) for an election to timeout.
 election-timeout: 1000

 # Raise alarms when backend size exceeds the given quota. 0 means use the
 # default quota.
 quota-backend-bytes: 0

 # List of comma separated URLs to listen on for client traffic.
 listen-client-urls: "https://192.168.206.1:2379"

 # Maximum number of snapshot files to retain (0 is unlimited).
 max-snapshots: 5

 # Maximum number of wal files to retain (0 is unlimited).
 max-wals: 5

 # List of this member's client URLs to advertise to the public.
 # The URLs needed to be a comma-separated list.
 advertise-client-urls: "https://192.168.206.1:2379"

 # Accept etcd V2 client requests
 enable-v2: true

 # Valid values include 'on', 'readonly', 'off'
 proxy: "off"

 # Time (in milliseconds) an endpoint will be held in a failed state.
 proxy-failure-wait: 5000

 # Time (in milliseconds) of the endpoints refresh interval.
 proxy-refresh-interval: 30000

 # Time (in milliseconds) for a dial to timeout.
 proxy-dial-timeout: 1000

 # Time (in milliseconds) for a write to timeout.
 proxy-write-timeout: 5000

 # Time (in milliseconds) for a read to timeout.
 proxy-read-timeout: 0

client-transport-security:
 # Path to the client server TLS cert file.
 cert-file: "/etc/etcd/etcd-server.crt"

 # Path to the client server TLS key file.
 key-file:

 # Enable client cert authentication.
 client-cert-auth: true

 # Path to the client server TLS trusted CA key file.
 trusted-ca-file: "/etc/etcd/ca.crt"

 # Enable debug-level logging for etcd.
 debug: false

######################

In the kubeadm file we have https://192.168.206.2:2379 but in the etcd config files we have https://192.168.206.1:2379

I think the commit that introduced this is:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39

This change in particular:

https://review.opendev.org/c/starlingx/ansible-playbooks/+/760512/39/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml

#####

The issue reproduces if the 'cluster_host_start_address' defined in localhost.yml it's different than the first address of the 'cluster_host_subnet':

The cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}" will be different than the 'default_cluster_host_start_address': "{{ (cluster_host_subnet | ipaddr(1)).split('/')[0] }} so it will propagate to:
 - The platform::etcd::params::bind_address: \{{ default_cluster_host_start_address }}" (I think the one that gets inserted in the etcd config) (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/apply-bootstrap-manifest/tasks/main.yml#L157)

 - The ETCD_ENDPOINT: "https://\{{ cluster_floating_address | ipwrap }}:2379" (https://opendev.org/starlingx/ansible-playbooks/src/commit/820f347324d4fb7d17396d5d21f98eb2b674d23a/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/bringup_kubemaster.yml#L151) which gets into the kubeadm file.

We can see in the ansible.log:

The bind_address

2021-03-02 18:34:01,656 p=11972 u=sysadmin | changed: [localhost] => (item=platform::etcd::params::bind_address: 192.168.206.1)

The ETCD_ENDPONT

2021-03-02 18:45:10,760 p=11972 u=sysadmin | changed: [localhost] => (item=sed -i -e 's|<%= @etcd_endpoint %>|'$ETCD_ENDPOINT'|g' /etc/kubernetes/kubeadm.yaml)in the

In the auth.log

auth.log:2021-03-02T18:45:11.000 localhost sudo: notice sysadmin : TTY=ttyS0 ; PWD=/usr/share/ansible/stx-ansible/playbooks ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-llfipwqxlrukgtjrwlrelhqnuejcnvvd; POD_NETWORK_CIDR=172.16.0.0/16 CONTROLPLANE_ENDPOINT=192.168.206.2 VOLUME_PLUGIN_DIR=/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ETCD_ENDPOINT=https://192.168.206.2:2379 APISERVER_ADVERTISE_ADDRESS=192.168.206.3 SERVICE_NETWORK_CIDR=10.96.0.0/12 /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1614710711.05-29948722968306/AnsiballZ_command.py

Test Activity
-------------
Developer Testing

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
description: updated
description: updated
Revision history for this message
Mihnea Saracin (msaracin) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as gating stx.5.0 release as bug introduced by earlier commit in master (stx.5.0) branch.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0 stx.security
Changed in starlingx:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794304
Reason: bad rebase

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.