ansible-playbook failed at get wait task results

Bug #1832852 reported by Yang Liu
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Tee Ngo

Bug Description

Brief Description
-----------------
ansible-playbook failed at check for controller-0 online status

Severity
--------
Critical

Steps to Reproduce
------------------
Install controller-0
Run ansible-playbook to configure controller-0

Expected Behavior
------------------
ansible-playbook ran successfully

Actual Behavior
----------------
ansible-playbook failed. See Timestamp/Logs for details.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Any system

Branch/Pull Time/Commit
-----------------------
master 20190614T013000Z

Last Pass
---------
master 20190613T013000Z

Timestamp/Logs
--------------
TASK [bringup-essential-services : Update resolv.conf file for unlock] *********
changed: [localhost]

TASK [bringup-essential-services : Check for controller-0 online status] *******
FAILED - RETRYING: Check for controller-0 online status (10 retries left).
changed: [localhost]

TASK [bringup-essential-services : Wait for 60 seconds to ensure kube-system pods are all started] ***

TASK [bringup-essential-services : Start parallel tasks to wait for Kubernetes component, Networking and Tiller pods to reach ready state] ***
changed: [localhost] => (item=k8s-app=calico-node)
changed: [localhost] => (item=k8s-app=calico-kube-controllers)
changed: [localhost] => (item=k8s-app=kube-proxy)
changed: [localhost] => (item=app=multus)
changed: [localhost] => (item=app=sriov-cni)
changed: [localhost] => (item=app=helm)
changed: [localhost] => (item=component=kube-apiserver)
changed: [localhost] => (item=component=kube-controller-manager)
changed: [localhost] => (item=component=kube-scheduler)

TASK [bringup-essential-services : Get wait tasks results] *********************
FAILED - RETRYING: Get wait tasks results (10 retries left).
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=calico-node', u'ansible_job_id': u'915908010294.112925', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=calico-node', u'finished': 0, u'results_file': u'/root/.ansible_async/915908010294.112925', '_ansible_ignore_errors': None, '_ansible_no_log': False})
FAILED - RETRYING: Get wait tasks results (10 retries left).
FAILED - RETRYING: Get wait tasks results (9 retries left).
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=calico-kube-controllers', u'ansible_job_id': u'61243884225.113078', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=calico-kube-controllers', u'finished': 0, u'results_file': u'/root/.ansible_async/61243884225.113078', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=kube-proxy', u'ansible_job_id': u'583262417391.114092', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=kube-proxy', u'finished': 0, u'results_file': u'/root/.ansible_async/583262417391.114092', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'app=multus', u'ansible_job_id': u'230395068872.114362', 'failed': False, u'started': 1, 'changed': True, 'item': u'app=multus', u'finished': 0, u'results_file': u'/root/.ansible_async/230395068872.114362', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'app=sriov-cni', u'ansible_job_id': u'189036162740.114451', 'failed': False, u'started': 1, 'changed': True, 'item': u'app=sriov-cni', u'finished': 0, u'results_file': u'/root/.ansible_async/189036162740.114451', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'app=helm', u'ansible_job_id': u'892848644877.114531', 'failed': False, u'started': 1, 'changed': True, 'item': u'app=helm', u'finished': 0, u'results_file': u'/root/.ansible_async/892848644877.114531', '_ansible_ignore_errors': None, '_ansible_no_log': False})
failed: [localhost] (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-apiserver', u'ansible_job_id': u'302170236966.114604', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-apiserver', u'finished': 0, u'results_file': u'/root/.ansible_async/302170236966.114604', '_ansible_ignore_errors': None, '_ansible_no_log': False}) => {"ansible_job_id": "302170236966.114604", "attempts": 1, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Ready", "pods", "--selector", "component=kube-apiserver", "--timeout=30s"], "delta": "0:00:00.118897", "end": "2019-06-14 06:38:48.812429", "finished": 1, "item": {"ansible_job_id": "302170236966.114604", "changed": true, "failed": false, "finished": 0, "item": "component=kube-apiserver", "results_file": "/root/.ansible_async/302170236966.114604", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2019-06-14 06:38:48.693532", "stderr": "error: no matching resources found", "stderr_lines": ["error: no matching resources found"], "stdout": "", "stdout_lines": []}
failed: [localhost] (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-controller-manager', u'ansible_job_id': u'247619792932.115001', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-controller-manager', u'finished': 0, u'results_file': u'/root/.ansible_async/247619792932.115001', '_ansible_ignore_errors': None, '_ansible_no_log': False}) => {"ansible_job_id": "247619792932.115001", "attempts": 1, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Ready", "pods", "--selector", "component=kube-controller-manager", "--timeout=30s"], "delta": "0:00:00.120772", "end": "2019-06-14 06:38:49.916527", "finished": 1, "item": {"ansible_job_id": "247619792932.115001", "changed": true, "failed": false, "finished": 0, "item": "component=kube-controller-manager", "results_file": "/root/.ansible_async/247619792932.115001", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2019-06-14 06:38:49.795755", "stderr": "error: no matching resources found", "stderr_lines": ["error: no matching resources found"], "stdout": "", "stdout_lines": []}
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-scheduler', u'ansible_job_id': u'843007711619.115157', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-scheduler', u'finished': 0, u'results_file': u'/root/.ansible_async/843007711619.115157', '_ansible_ignore_errors': None, '_ansible_no_log': False})

PLAY RECAP *********************************************************************
localhost : ok=198 changed=120 unreachable=0 failed=1

Test Activity
-------------
Sanity

Numan Waheed (nwaheed)
tags: added: stx.retestneeded stx.sanity
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Please double check that the issue is missing online status.

Host came online at the 12 minute uptime mark and the online timeout for controllers is 20 minutes.

| d6f3c762-772a-48f6-ba22-4cc4283cfdfe | platform | maintenance | controller_boot_timeout | 1200 |

Auto install logs suggest the failure was a failure to start essential services.

Revision history for this message
Tee Ngo (teewrs) wrote :

This should be resolved via commit 9568d970f174349aab9b17dcd1673b13aa9ebce2.

summary: - ansible-playbook failed at checking for controller-0 online status
+ ansible-playbook failed at get wait task results
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This review increases the Ansible timeout:
https://review.opendev.org/#/c/665353/

Investigation is still ongoing regarding the cause of the general system slow downs.

tags: added: stx.2.0
Changed in starlingx:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This bug is still present on Build 20190616T233000Z

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190616T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="148"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-06-16 23:30:00 +0000"
controller-0:~$ grep FAILED ansible.log
2019-06-17 00:15:59,776 p=19067 u=wrsroot | FAILED - RETRYING: Check for controller-0 online status (10 retries left).
2019-06-17 00:16:47,624 p=19067 u=wrsroot | FAILED - RETRYING: Get wait tasks results (10 retries left).
controller-0:~$

Reproduced on Simplex, Duplex and Standard.

Revision history for this message
Tee Ngo (teewrs) wrote :

Please add pods_wait_time to localhost.yml and try again as follows:

pods_wait_time: 45

If it is still failing, increase it to 60.

Revision history for this message
Tee Ngo (teewrs) wrote :

Cristopher, can you double check the ansible.log to see if bootstrap actually failed? (please grep -A2 PLAY RECAP ansible.log). Having FAILED - RETRYING string does not mean the playbook failed.

Revision history for this message
Tee Ngo (teewrs) wrote :

The command should be: grep -A2 "PLAY RECAP" ansible.log

If the bootstrap playbook failed at task "get wait tasks result", ansible.log should contain at least one "no matching resources found" log so a positive response of the following command would confirm the issue:

     grep "error: no matching resources found" ansible.log

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Another commit to extend the timeout was merged on 2019-06-17
https://review.opendev.org/#/c/665756/

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

With image: 20190617T233000Z, this error was not reproduced on all 4 configs (Setup fails for a different reason related to drdb during unlock).

Now, checking on the logs of 06/16 image, this is the summary (RECAP):
PLAY RECAP *********************************************************************
localhost : ok=211 changed=127 unreachable=0 failed=1

I assume that the extra time-out was enough to prevent this failure. Let me know if you need further details. Thanks for the help.

Tee Ngo (teewrs)
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

No longer seen in WR daily sanity.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.