tripleo-quickstart needs to be more resilient to ssh connectivity issues

Bug #1714014 reported by Matt Young on 2017-08-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Natal Ngétal

Bug Description

TLDR - when tasks take too long (likely due to ssh instability), tripleo-quickstart is prone to fail with "Timeout (12s) waiting for privilege escalation prompt" errors

Potential mitigations include:

- increasing the timeout via TQ:ansible.cfg.
- upgrade to ansible 2.3.1, to pick up https://github.com/ansible/ansible/pull/23710

---

Longer version...

Intermittently we have been having CI failures in RDO Phase 2, where we use tripleo-quickstart to run a variety of CI jobs to validate RDO on HA, bare metal, and other configurations. The CI debugging trello card is here:

- https://trello.com/c/e3zbRidd/261-rdophase2-ansible-ssh-timeouts-in-become-module-timeout-12s-waiting-for-privilege-escalation-prompt

Here's a few (concrete) examples:
===

- https://thirdparty.logs.rdoproject.org/jenkins-promote-rhel-pike-rdo_trunk-virtha-3ctlr_1comp_192gb-3/console.txt.gz

It actually happens a few times in tasks that are ignored during teardown, until failing on something (not ignored) here:

```
21:18:09 TASK [environment/teardown : Remove bridge whitelisting from qemu bridge helper] ***
21:18:09 task path: /home/rhos-ci/jenkins/workspace/promote-rhel-pike-rdo_trunk-virtha-3ctlr_1comp_192gb/tripleo-quickstart/roles/environment/teardown/tasks/main.yml:46
21:18:09 Tuesday 29 August 2017 21:18:09 +0000 (0:00:00.229) 0:04:37.703 ********
21:18:21 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 0, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 0 seconds
21:18:33 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 1, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 1 seconds
21:18:46 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 2, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 3 seconds
21:19:01 fatal: [haa-08.ha.lab.eng.bos.redhat.com]: FAILED! => {"failed": true, "msg": "Timeout (12s) waiting for privilege escalation prompt: "}
```

- https://thirdparty.logs.rdoproject.org/jenkins-oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-27/console.txt.gz

```
TASK [repo-setup : Setup repos on live host] ***********************************
task path: /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/tripleo-quickstart/roles/repo-setup/tasks/setup_repos.yml:1
Tuesday 29 August 2017 17:43:24 -0400 (0:00:00.247) 0:36:34.460 ********
Using module file /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 0, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 0 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 1, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 1 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 2, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 3 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
fatal: [undercloud]: FAILED! => {
    "failed": true,
    "msg": "Timeout (12s) waiting for privilege escalation prompt: "
}
```

Matt Young (halcyondude) wrote :

As this is intermittent, Importance is not 'critical' - however as this jams the production chain when it does occur, setting to 'high'.

Changed in tripleo:
assignee: nobody → Matt Young (halcyondude)
importance: Undecided → High
milestone: none → pike-rc2
milestone: pike-rc2 → queens-1
wes hayutin (weshayutin) on 2017-08-30
Changed in tripleo:
status: New → Triaged
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2

Fix proposed to branch: master
Review: https://review.openstack.org/617663

Changed in tripleo:
assignee: Matt Young (halcyondude) → Natal Ngétal (hobbestigrou)
status: Triaged → In Progress
Natal Ngétal (hobbestigrou) wrote :

A patch to increase the ssh timeout is ready to review:

https://review.openstack.org/#/c/617663/

Natal Ngétal (hobbestigrou) wrote :

The version of ansible is already update. The current of ansible is 2.5.7 in the project.

Sorin Sbarnea (ssbarnea) wrote :

I am all for improving reliability of CI jobs to make the more resilient to networking glitches, still we need to answer few questions before making a change to defaults:

A) we need proof that this is recurrent issue affecting more than 1/1000 jobs. Please make a query that underlines that on http://logstash.openstack.org and put a link to it inside the ticket.

B) How does this refer to ssh retries? As documented on: https://docs.ansible.com/ansible/latest/reference_appendices/config.html?highlight=retries#envvar-ANSIBLE_SSH_RETRIES

I see that we already have 3 retries, does this means that the failure-case would take 3*20s instead of 3*10s (assuming ansible default is not already overriden using job defined ansible variable.

C) While the proposed value of 20s seems reasonable to me I wonder if this should not be defined per cloud-job configuration instead of having a generic default.

Changed in tripleo:
milestone: stein-2 → stein-3

Is this still an issue?

Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers