StarlingX

Backup & Restore: IPV6 Calico pods failed after controller restore

Bug #1845707 reported by Senthil Mukundakumar on 2019-09-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Joseph Richard

Bug Description

Brief Description
-----------------

Active controller restore fail in IPV6 configured lab.

two LPs: 1844686 and 1845217 are fixed which was initially raised for IPV6 configuration. However, during restore bootstrap some pods failed to come up due to network issues of cluster-service-subnet.

# The generated override file localhost.yml to bootstrap controller-0 during restore. This looks good to me.
[root@controller-0 sysadmin(keystone_admin)]# more localhost.yml
dns_servers:
  - 2620:10a:a001:a103::2
pxeboot_subnet: 192.168.202.0/24
pxeboot_start_address: 192.168.202.2
pxeboot_end_address: 192.168.202.254
pxeboot_floating_address: 192.168.202.2
pxeboot_node_0_address: 192.168.202.3
pxeboot_node_1_address: 192.168.202.4
management_subnet: face::/64
management_start_address: face::2
management_end_address: face::ffff:ffff:ffff:fffe
management_floating_address: face::2
management_node_0_address: face::3
management_node_1_address: face::4
management_multicast_subnet: ff05::31:0/124
management_multicast_start_address: ff05::31:1
management_multicast_end_address: ff05::31:e
cluster_host_subnet: feed:beef::/64
cluster_host_start_address: feed:beef::2
cluster_host_end_address: feed:beef::ffff:ffff:ffff:fffe
cluster_host_floating_address: feed:beef::2
cluster_host_node_0_address: feed:beef::3
cluster_host_node_1_address: feed:beef::4
cluster_pod_subnet: dead:beef::/64
cluster_pod_start_address: dead:beef::1
cluster_pod_end_address: dead:beef::ffff:ffff:ffff:fffe
cluster_service_subnet: fd04::/112
cluster_sevice_start_address: fd04::1
cluster_service_end_address: fd04::fffe
external_oam_subnet: 2620:10a:a001:a103::/64
external_oam_start_address: 2620:10a:a001:a103::1
external_oam_end_address: 2620:10a:a001:a103:ffff:ffff:ffff:fffe
external_oam_gateway_address: 2620:10a:a001:a103::6:0
external_oam_floating_address: 2620:10a:a001:a103::1085
external_oam_node_0_address: 2620:10a:a001:a103::1083
external_oam_node_1_address: 2620:10a:a001:a103::1084
docker_no_proxy:
  - localhost
  - 127.0.0.1
  - registry.local
  - face::2
  - face::3
  - 2620:10a:a001:a103::1085
  - 2620:10a:a001:a103::1083
  - face::4
  - 2620:10a:a001:a103::1084
  - tis-lab-registry.cumulus.wrs.com
docker_http_proxy: http://yow-proxomatic.wrs.com:3128

## Here is the failed ansible task from bootstrap during platform restore
TASK [bootstrap/bringup-essential-services : Wait for 120 seconds to ensure kube-system pods are all started] ***
ok: [localhost]

TASK [bootstrap/bringup-essential-services : Start parallel tasks to wait for Kubernetes component, Networking and Tiller pods to reach ready state] ***
changed: [localhost] => (item=k8s-app=calico-node)
changed: [localhost] => (item=k8s-app=calico-kube-controllers)
changed: [localhost] => (item=k8s-app=kube-proxy)
changed: [localhost] => (item=app=multus)
changed: [localhost] => (item=app=sriov-cni)
changed: [localhost] => (item=app=helm)
changed: [localhost] => (item=component=kube-apiserver)
changed: [localhost] => (item=component=kube-controller-manager)
changed: [localhost] => (item=component=kube-scheduler)

TASK [bootstrap/bringup-essential-services : Get wait tasks results] ***********************************
FAILED - RETRYING: Get wait tasks results (10 retries left).
FAILED - RETRYING: Get wait tasks results (9 retries left).
FAILED - RETRYING: Get wait tasks results (8 retries left).
FAILED - RETRYING: Get wait tasks results (7 retries left).

failed: [localhost] (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=calico-node', u'ansible_job_id': u'262627741349.142217', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=calico-node', u'finished': 0, u'results_file': u'/root/.ansible_async/262627741349.142217', '_ansible_ignore_errors': None, '_ansible_no_log': False}) => {"ansible_job_id": "262627741349.142217", "attempts": 5, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Ready", "pods", "--selector", "k8s-app=calico-node", "--timeout=30s"], "delta": "0:00:30.118565", "end": "2019-09-26 19:51:55.746082", "finished": 1, "item": {"ansible_job_id": "262627741349.142217", "changed": true, "failed": false, "finished": 0, "item": "k8s-app=calico-node", "results_file": "/root/.ansible_async/262627741349.142217", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2019-09-26 19:51:25.627517", "stderr": "error: timed out waiting for the condition on pods/calico-node-msntq", "stderr_lines": ["error: timed out waiting for the condition on pods/calico-node-msntq"], "stdout": "", "stdout_lines": []}
FAILED - RETRYING: Get wait tasks results (10 retries left).
failed: [localhost] (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=calico-kube-controllers', u'ansible_job_id': u'300208565192.142361', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=calico-kube-controllers', u'finished': 0, u'results_file': u'/root/.ansible_async/300208565192.142361', '_ansible_ignore_errors': None, '_ansible_no_log': False}) => {"ansible_job_id": "300208565192.142361", "attempts": 2, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Ready", "pods", "--selector", "k8s-app=calico-kube-controllers", "--timeout=30s"], "delta": "0:00:30.121536", "end": "2019-09-26 19:51:56.860818", "finished": 1, "item": {"ansible_job_id": "300208565192.142361", "changed": true, "failed": false, "finished": 0, "item": "k8s-app=calico-kube-controllers", "results_file": "/root/.ansible_async/300208565192.142361", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2019-09-26 19:51:26.739282", "stderr": "error: timed out waiting for the condition on pods/calico-kube-controllers-767467f9cf-8ssct", "stderr_lines": ["error: timed out waiting for the condition on pods/calico-kube-controllers-767467f9cf-8ssct"], "stdout": "", "stdout_lines": []}

## Look for failed pods
controller-0:/home/sysadmin# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-767467f9cf-8ssct 0/1 ContainerCreating 0 27m
kube-system calico-node-msntq 0/1 CrashLoopBackOff 11 27m
kube-system coredns-7cf476b5c8-2btsp 0/1 Pending 0 27m
kube-system coredns-7cf476b5c8-dzwdb 0/1 ContainerCreating 0 27m
kube-system kube-apiserver-controller-0 1/1 Running 0 26m
kube-system kube-controller-manager-controller-0 1/1 Running 0 26m
kube-system kube-multus-ds-amd64-c6c8b 1/1 Running 0 27m
kube-system kube-proxy-mjj92 1/1 Running 0 27m
kube-system kube-scheduler-controller-0 1/1 Running 0 26m
kube-system kube-sriov-cni-ds-amd64-8s62n 1/1 Running 0 27m
kube-system tiller-deploy-7855f54f57-bcgdb 1/1 Running 0 27m

controller-0:/home/sysadmin# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
starlingx/k8s-cni-sriov master-centos-stable-latest 938d675c0ced 2 days ago 228MB
k8s.gcr.io/kube-proxy v1.15.3 232b5c793146 5 weeks ago 82.4MB
k8s.gcr.io/kube-apiserver v1.15.3 5eb2d3fc7a44 5 weeks ago 207MB
k8s.gcr.io/kube-scheduler v1.15.3 703f9c69a5d5 5 weeks ago 81.1MB
k8s.gcr.io/kube-controller-manager v1.15.3 e77c31de5547 5 weeks ago 159MB
quay.io/airshipit/armada 8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic 3061a8a540ac 7 weeks ago 458MB
quay.io/calico/node v3.6.2 707815f0ee0a 4 months ago 73.2MB
quay.io/calico/cni v3.6.2 14f1e7286a2d 4 months ago 84.3MB
nfvpe/multus v3.2 45da14a16acc 6 months ago 500MB
gcr.io/kubernetes-helm/tiller v2.13.1 cb5aea7d0466 6 months ago 82.1MB
k8s.gcr.io/coredns 1.3.1 eb516548c180 8 months ago 40.3MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 21 months ago 742kB

## Check the log of failed pod calico-node-msntq
controller-0:/home/sysadmin# kubectl logs -n kube-system calico-node-msntq
Threshold time for bird readiness check: 30s
2019-09-26 20:20:39.124 [INFO][8] startup.go 256: Early log level set to info
2019-09-26 20:20:39.124 [INFO][8] startup.go 272: Using NODENAME environment for node name
2019-09-26 20:20:39.124 [INFO][8] startup.go 284: Determined node name: controller-0
2019-09-26 20:20:39.125 [INFO][8] startup.go 316: Checking datastore connection
2019-09-26 20:20:39.125 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::1]:443/api/v1/nodes/foo: dial tcp [fd04::1]:443: connect: network is unreachable
2019-09-26 20:20:40.126 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::1]:443/api/v1/nodes/foo: dial tcp [fd04::1]:443: connect: network is unreachable
2019-09-26 20:20:41.126 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::1]:443/api/v1/nodes/foo: dial tcp [fd04::1]:443: connect: network is unreachable
2019-09-26 20:20:42.127 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::1]:443/api/v1/nodes/foo: dial tcp [fd04::1]:443: connect: network is unreachable

[fd04:: 1]:443 is the cluster-service-subnet

Severity
--------
Critical: Unable to restore IPV6 configured lab

Steps to Reproduce
------------------
1. Bring up the IPV6 Regular system system
2. Backup the system using ansible locally
4. Re-install the controller with the same load
5. Restore the active controller

Expected Behavior
------------------
The active controller should be successfully restored

Actual Behavior
----------------
Active controller restore failed

Reproducibility
---------------
Reproducible

System Configuration
--------------------
IPv6 configured system

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-09-25_20-00-00"

Test Activity
-------------
Feature Testing

Tags:

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-27:

Joseph Richard investigated and determined the default route is missing. He believes this is the same issue as reported in https://bugs.launchpad.net/starlingx/+bug/1844192

Marking this as a duplicate.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-11:

This issue is seen again in Backup and Restore test.

Joseph Richard's comments regarding the new occurrence:
It did work again when I added the default route.
My fix was only to change this in puppet. In this case the puppet manifest hasn’t created the new network-scripts yet, so accept_ra is still set here. I will look at making this change in the kickstart scripts as well.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-10-11:

Based on the new findings removing the duplicate link. In addition to the commit required to address https://bugs.launchpad.net/starlingx/+bug/1844192 an additional commit is required to address the restore scenario. Assigning to Joseph Richard who is priming a solution for this issue.

Changed in starlingx:
assignee:	nobody → Joseph Richard (josephrichard)

Ghada Khalil (gkhalil) on 2019-10-11

tags:

added: stx.3.0 stx.update

Ghada Khalil (gkhalil) on 2019-10-11

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-15: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688823

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-16: Fix merged to metal (master)

Reviewed: https://review.opendev.org/688823
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=9e388d3bc36d8772bceed7ad2b0b042985f59b09
Submitter: Zuul
Branch: master

commit 9e388d3bc36d8772bceed7ad2b0b042985f59b09
Author: Joseph Richard <email address hidden>
Date: Tue Oct 15 12:43:54 2019 -0400

Set IPV6_AUTOCONF=no on initial onboot devices

This commit sets IPV6_AUTOCONF=no from the anaconda installer for
all interfaces that are enabled during the initial install.

    Once the puppet manifest is applied, and it updates the networking
    config scripts for the system, this will be overwritten by that config,
    which disables it for all configured interfaces,

    IPv6 auto-configuration is causing an issue during backup and restore,
    where router advertisements are causing the default route to be deleted,
    which causes the calico pod to fail to initialize.

    Closes-bug: 1845707
    Change-Id: I519ad0d92c66a636df0d10e79c6962296683520d
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-10-17:

Reopening the issue, it is still reproduced using the latest load 2019-10-16_11-52-03.

Changed in starlingx:
status:	Fix Released → Confirmed

Ghada Khalil (gkhalil) on 2019-10-22

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-23: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690732

Yang Liu (yliu12) on 2019-10-25

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-25: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/690732
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7e6be3795d5cf97ab5f6595f822621080d26c056
Submitter: Zuul
Branch: master

commit 7e6be3795d5cf97ab5f6595f822621080d26c056
Author: Joseph Richard <email address hidden>
Date: Tue Oct 22 10:48:09 2019 -0400

Always add default route on oam

    Currently, ansible playbook adds default route over the oam network
    only when running on a subcloud.
    This commits removes that check, enabling the default route to always
    be added.

    Closes-bug: 1845707
    Change-Id: I3de6af5135e2ca940c33f850d4e58214d84614be
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2019-10-31:

Verified - 2019-10-28_09-12-55

Senthil Mukundakumar (smukunda) on 2019-10-31

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.