tripleo-quickstart needs to be more resilient to ssh connectivity issues

Bug #1714014 reported by Matt Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
High
Natal Ngétal

Bug Description

TLDR - when tasks take too long (likely due to ssh instability), tripleo-quickstart is prone to fail with "Timeout (12s) waiting for privilege escalation prompt" errors

Potential mitigations include:

- increasing the timeout via TQ:ansible.cfg.
- upgrade to ansible 2.3.1, to pick up https://github.com/ansible/ansible/pull/23710

---

Longer version...

Intermittently we have been having CI failures in RDO Phase 2, where we use tripleo-quickstart to run a variety of CI jobs to validate RDO on HA, bare metal, and other configurations. The CI debugging trello card is here:

- https://trello.com/c/e3zbRidd/261-rdophase2-ansible-ssh-timeouts-in-become-module-timeout-12s-waiting-for-privilege-escalation-prompt

Here's a few (concrete) examples:
===

- https://thirdparty.logs.rdoproject.org/jenkins-promote-rhel-pike-rdo_trunk-virtha-3ctlr_1comp_192gb-3/console.txt.gz

It actually happens a few times in tasks that are ignored during teardown, until failing on something (not ignored) here:

```
21:18:09 TASK [environment/teardown : Remove bridge whitelisting from qemu bridge helper] ***
21:18:09 task path: /home/rhos-ci/jenkins/workspace/promote-rhel-pike-rdo_trunk-virtha-3ctlr_1comp_192gb/tripleo-quickstart/roles/environment/teardown/tasks/main.yml:46
21:18:09 Tuesday 29 August 2017 21:18:09 +0000 (0:00:00.229) 0:04:37.703 ********
21:18:21 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 0, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 0 seconds
21:18:33 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 1, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 1 seconds
21:18:46 <haa-08.ha.lab.eng.bos.redhat.com> ssh_retry: attempt: 2, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-diprjprzcicizfblssoufgxasxhdlbna; /usr/bin/python'"'"' && sleep 0'...), pausing for 3 seconds
21:19:01 fatal: [haa-08.ha.lab.eng.bos.redhat.com]: FAILED! => {"failed": true, "msg": "Timeout (12s) waiting for privilege escalation prompt: "}
```

- https://thirdparty.logs.rdoproject.org/jenkins-oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-27/console.txt.gz

```
TASK [repo-setup : Setup repos on live host] ***********************************
task path: /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/tripleo-quickstart/roles/repo-setup/tasks/setup_repos.yml:1
Tuesday 29 August 2017 17:43:24 -0400 (0:00:00.247) 0:36:34.460 ********
Using module file /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 0, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 0 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 1, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 1 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<undercloud> ssh_retry: attempt: 2, caught exception(Timeout (12s) waiting for privilege escalation prompt: ) from cmd (/bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"' && sleep 0'...), pausing for 3 seconds
<undercloud> ESTABLISH SSH CONNECTION FOR USER: stack
<undercloud> SSH: EXEC ssh -vvv -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible -o StrictHostKeyChecking=no -o 'IdentityFile="/home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/id_rsa_undercloud"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=stack -o ConnectTimeout=10 -F /home/rhos-ci/jenkins/workspace/oooq-pike-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans/ssh.config.ansible undercloud '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-pxlzdbjebzfjsbnptwrvskdngvvgbfld; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
fatal: [undercloud]: FAILED! => {
    "failed": true,
    "msg": "Timeout (12s) waiting for privilege escalation prompt: "
}
```

Tags: ci quickstart
Revision history for this message
Matt Young (halcyondude) wrote :

As this is intermittent, Importance is not 'critical' - however as this jams the production chain when it does occur, setting to 'high'.

Changed in tripleo:
assignee: nobody → Matt Young (halcyondude)
importance: Undecided → High
milestone: none → pike-rc2
milestone: pike-rc2 → queens-1
wes hayutin (weshayutin)
Changed in tripleo:
status: New → Triaged
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.openstack.org/617663

Changed in tripleo:
assignee: Matt Young (halcyondude) → Natal Ngétal (hobbestigrou)
status: Triaged → In Progress
Revision history for this message
Natal Ngétal (hobbestigrou) wrote :

A patch to increase the ssh timeout is ready to review:

https://review.openstack.org/#/c/617663/

Revision history for this message
Natal Ngétal (hobbestigrou) wrote :

The version of ansible is already update. The current of ansible is 2.5.7 in the project.

Revision history for this message
Sorin Sbarnea (ssbarnea) wrote :

I am all for improving reliability of CI jobs to make the more resilient to networking glitches, still we need to answer few questions before making a change to defaults:

A) we need proof that this is recurrent issue affecting more than 1/1000 jobs. Please make a query that underlines that on http://logstash.openstack.org and put a link to it inside the ticket.

B) How does this refer to ssh retries? As documented on: https://docs.ansible.com/ansible/latest/reference_appendices/config.html?highlight=retries#envvar-ANSIBLE_SSH_RETRIES

I see that we already have 3 retries, does this means that the failure-case would take 3*20s instead of 3*10s (assuming ansible default is not already overriden using job defined ansible variable.

C) While the proposed value of 20s seems reasonable to me I wonder if this should not be defined per cloud-job configuration instead of having a generic default.

Changed in tripleo:
milestone: stein-2 → stein-3
Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Is this still an issue?

Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Revision history for this message
mathieu bultel (mat-bultel) wrote :

I think this bug is not relevant anymore.
I'm going to close it, if anybody wants to reopen it fell free

Changed in tripleo:
status: In Progress → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Sorin Sbarnea (<email address hidden>) on branch: master
Review: https://review.opendev.org/617663

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.