Backup & Restore: Calico pods don't recover after restore on a multi-node system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Cole Walker |
Bug Description
Brief Description
-----------------
Restore fails on a system with multiple worker nodes. This happens under a specific scenario where the calico-
Severity
--------
Major
Steps to Reproduce
------------------
- On a multi-node system, ensure the calico-
- Perform a backup
- Perform a restore
Expected Behavior
------------------
The restore should pass
Actual Behavior
----------------
The restore fails
2020-08-07 20:32:04,505 p=11111 u=sysadmin | TASK [bootstrap/
Tiller pods is not ready by this time] ***
2020-08-07 20:32:04,551 p=11111 u=sysadmin | failed: [localhost] (item={
Reproducibility
---------------
Reproducible under the conditions explained above
System Configuration
-------
multi-node system with > 1 worker node
Branch/Pull Time/Commit
-------
stx master as of 2020-06-28, but expected to be a day 1 issue
Last Pass
---------
Other B&R tests pass, but this particular config was not explicitly tested previously
Timestamp/Logs
--------------
**Start of first restore attempt**
2020-08-
yml -e 'initial_
**First failure from ansible log* (My opinion based on below failures is that the calico networking pod failing to start is the trigger)
2020-08-07 20:32:04,505 p=11111 u=sysadmin | TASK [bootstrap/
Tiller pods is not ready by this time] ***
2020-08-07 20:32:04,551 p=11111 u=sysadmin | failed: [localhost] (item=
{'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for t he condition on pods/calico-
, "msg": "Pod k8s-app=
*Tiller*
"2020-08-07 20:30:32.1
15038", "stderr": "error: timed out waiting for the condition on pods/tiller-
or the condition on pods/tiller-
**Tiller and calico-
**Second restore run.
2020-08-
.yml -e 'initial_
Test Activity
-------------
Testing
Workaround
----------
Unknown
tags: | added: stx.networking |
Changed in starlingx: | |
status: | Triaged → In Progress |
stx.5.0 / medium priority - specific B&R failure