Backup & Restore: IPV6 Calico pods failed after controller restore
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Joseph Richard |
Bug Description
Brief Description
-----------------
Active controller restore fail in IPV6 configured lab.
two LPs: 1844686 and 1845217 are fixed which was initially raised for IPV6 configuration. However, during restore bootstrap some pods failed to come up due to network issues of cluster-
# The generated override file localhost.yml to bootstrap controller-0 during restore. This looks good to me.
[root@controller-0 sysadmin(
dns_servers:
- 2620:10a:
pxeboot_subnet: 192.168.202.0/24
pxeboot_
pxeboot_
pxeboot_
pxeboot_
pxeboot_
management_subnet: face::/64
management_
management_
management_
management_
management_
management_
management_
management_
cluster_
cluster_
cluster_
cluster_
cluster_
cluster_
cluster_pod_subnet: dead:beef::/64
cluster_
cluster_
cluster_
cluster_
cluster_
external_
external_
external_
external_
external_
external_
external_
docker_no_proxy:
- localhost
- 127.0.0.1
- registry.local
- face::2
- face::3
- 2620:10a:
- 2620:10a:
- face::4
- 2620:10a:
- tis-lab-
docker_http_proxy: http://
# The restored address-pool info which looks fine to me
[root@controller-0 sysadmin(
+------
| uuid | name | network | prefix | order | ranges | floating_ | controller0_ | controller1_ | gateway_ad |
| | | | | | | address | address | address | dress |
+------
| e62e22b5-
| | | beef:: | | | beef:: | ::2 | | | |
| | | | | | 2-feed: | | | | |
| | | | | | beef:: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | fffe'] | | | | |
| | | | | | | | | | |
| 37b1c194-
| | | beef:: | | | beef:: | | | | |
| | | | | | 1-dead: | | | | |
| | | | | | beef:: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | fffe'] | | | | |
| | | | | | | | | | |
| a6832f62-
| | | | | | :1-fd04 | | | | |
| | | | | | :: | | | | |
| | | | | | fffe'] | | | | |
| | | | | | | | | | |
| 6c3f2207-
| | | | | | :2-face | | | | |
| | | | | | ::ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | fffe'] | | | | |
| | | | | | | | | | |
| da0cf28d-
| | | :0 | | | :31: | | | | |
| | | | | | 1-ff05: | | | | |
| | | | | | :31:e'] | | | | |
| | | | | | | | | | |
| 7862aaeb-
| | | :a001: | | | 10a: | a001:a103 | a001:a103:: | a001:a103:: | a001:a103: |
| | | a103:: | | | a001: | ::1085 | 1083 | 1084 | :6:0 |
| | | | | | a103:: | | | | |
| | | | | | 1-2620: | | | | |
| | | | | | 10a: | | | | |
| | | | | | a001: | | | | |
| | | | | | a103: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | ffff: | | | | |
| | | | | | fffe'] | | | | |
| | | | | | | | | | |
| 17bdf0bb-
| | | 202.0 | | | 168.202 | 202.2 | 3 | 4 | |
| | | | | | .2-192. | | | | |
| | | | | | 168.202 | | | | |
| | | | | | .254'] | | | | |
| | | | | | | | | | |
+------
# The service-
[root@controller-0 sysadmin(
+------
| uuid | service | section | name | value | personality | resource |
+------
| fca0fb39-
| 67ca4f23-
| 192ff4fc-
| ded1a18c-
| fe812a73-
| 0a5b816c-
| 6314071b-
| 13e3029e-
| 221ac0f7-
| | | | meout | | | |
| | | | | | | |
| 526242a1-
| | | | threshold | | | |
| | | | | | | |
| 8a518be6-
| | | | action | | | |
| | | | | | | |
| decad36c-
| | | | threshold | | | |
| | | | | | | |
| 2039dfc1-
| bc73ba0e-
| 2b907485-
| 27b8f7ec-
| | | | t | | | |
| | | | | | | |
| d5e96f3f-
| af448175-
| | | | | ,[face:
| | | | | 10a:a001:
| | | | | a001:a103:
| | | | | wrs.com | | |
| | | | | | | |
+------
## Here is the failed ansible task from bootstrap during platform restore
TASK [bootstrap/
ok: [localhost]
TASK [bootstrap/
changed: [localhost] => (item=k8s-
changed: [localhost] => (item=k8s-
changed: [localhost] => (item=k8s-
changed: [localhost] => (item=app=multus)
changed: [localhost] => (item=app=
changed: [localhost] => (item=app=helm)
changed: [localhost] => (item=component
changed: [localhost] => (item=component
changed: [localhost] => (item=component
TASK [bootstrap/
FAILED - RETRYING: Get wait tasks results (10 retries left).
FAILED - RETRYING: Get wait tasks results (9 retries left).
FAILED - RETRYING: Get wait tasks results (8 retries left).
FAILED - RETRYING: Get wait tasks results (7 retries left).
failed: [localhost] (item={
FAILED - RETRYING: Get wait tasks results (10 retries left).
failed: [localhost] (item={
## Look for failed pods
controller-
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-
kube-system calico-node-msntq 0/1 CrashLoopBackOff 11 27m
kube-system coredns-
kube-system coredns-
kube-system kube-apiserver-
kube-system kube-controller
kube-system kube-multus-
kube-system kube-proxy-mjj92 1/1 Running 0 27m
kube-system kube-scheduler-
kube-system kube-sriov-
kube-system tiller-
controller-
REPOSITORY TAG IMAGE ID CREATED SIZE
starlingx/
k8s.gcr.
k8s.gcr.
k8s.gcr.
k8s.gcr.
quay.io/
quay.io/calico/node v3.6.2 707815f0ee0a 4 months ago 73.2MB
quay.io/calico/cni v3.6.2 14f1e7286a2d 4 months ago 84.3MB
nfvpe/multus v3.2 45da14a16acc 6 months ago 500MB
gcr.io/
k8s.gcr.io/coredns 1.3.1 eb516548c180 8 months ago 40.3MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 21 months ago 742kB
## Check the log of failed pod calico-node-msntq
controller-
Threshold time for bird readiness check: 30s
2019-09-26 20:20:39.124 [INFO][8] startup.go 256: Early log level set to info
2019-09-26 20:20:39.124 [INFO][8] startup.go 272: Using NODENAME environment for node name
2019-09-26 20:20:39.124 [INFO][8] startup.go 284: Determined node name: controller-0
2019-09-26 20:20:39.125 [INFO][8] startup.go 316: Checking datastore connection
2019-09-26 20:20:39.125 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::
2019-09-26 20:20:40.126 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::
2019-09-26 20:20:41.126 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::
2019-09-26 20:20:42.127 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::
[fd04:: 1]:443 is the cluster-
Severity
--------
Critical: Unable to restore IPV6 configured lab
Steps to Reproduce
------------------
1. Bring up the IPV6 Regular system system
2. Backup the system using ansible locally
4. Re-install the controller with the same load
5. Restore the active controller
Expected Behavior
------------------
The active controller should be successfully restored
Actual Behavior
----------------
Active controller restore failed
Reproducibility
---------------
Reproducible
System Configuration
-------
IPv6 configured system
Branch/Pull Time/Commit
-------
BUILD_
Test Activity
-------------
Feature Testing
tags: | added: stx.3.0 stx.update |
Changed in starlingx: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in starlingx: | |
status: | Confirmed → In Progress |
tags: | added: stx.retestneeded |
tags: | removed: stx.retestneeded |
Joseph Richard investigated and determined the default route is missing. He believes this is the same issue as reported in https:/ /bugs.launchpad .net/starlingx/ +bug/1844192
Marking this as a duplicate.