Remote restore fails with timeout replacing /etc/hosts

Bug #1986693 reported by Thiago Paiva Brito
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Thiago Paiva Brito

Bug Description

Brief Description
-----------------
When attempting a remote restore on Virtualbox, the operation times out at task "Restore /etc/hosts file". It was verified that when a previous task removes the existing /etc/hosts, any operation that requires sudo takes more than 40 seconds to prompt for password.

The interaction on [1] leads me to conclude that we should not be without a valid /etc/hosts at any given moment.

[1] https://stackoverflow.com/questions/39533532/ansible-timeout-12s-waiting-for-privilege-escalation-prompt

Severity
--------
Provide the severity of the defect.
Major: System/Feature is usable only on local play

Steps to Reproduce
------------------
Run the restore playbook from a non-starlingx system

Expected Behavior
------------------
Restore succeeds

Actual Behavior
----------------
Restore times out

Reproducibility
---------------
3/3

System Configuration
--------------------
Virtual AIO-SX

Branch/Pull Time/Commit
-----------------------
2022-08-05

Last Pass
---------
N/A

Timestamp/Logs
--------------
TASK [bootstrap/bringup-essential-services : Restore /etc/hosts file] ****************************************************************************************************************************************************************************************************
task path: /home/tbrito/workspace/repos/starlingx/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/roles/bootstrap/bringup-essential-services/tasks/refresh_local_dns.yml:68
<10.127.130.10> ESTABLISH SSH CONNECTION FOR USER: sysadmin
<10.127.130.10> SSH: EXEC sshpass -d10 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=11100 -o 'User="sysadmin"' -o ConnectTimeout=10 -o ControlPath=/home/tbrito/.ansible/cp/4ec5b081d9 10.127.130.10 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/.ansible-${USER}/tmp `"&& mkdir "` echo /tmp/.ansible-${USER}/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944 `" && echo ansible-tmp-1660670732.8196375-3698666-58724316334944="` echo /tmp/.ansible-${USER}/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944 `" ) && sleep 0'"'"''
<10.127.130.10> (0, b'ansible-tmp-1660670732.8196375-3698666-58724316334944=/tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944\n', b'')
Using module file /tmp/tbrito_ansible-playbookstox/venv/lib/python3.9/site-packages/ansible/modules/command.py
<10.127.130.10> PUT /home/tbrito/.ansible/tmp/ansible-local-36931609zu6qrfv/tmp3za7v_d7 TO /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/AnsiballZ_command.py
<10.127.130.10> SSH: EXEC sshpass -d10 sftp -o BatchMode=no -b - -C -o ControlMaster=auto -o ControlPersist=60s -o Port=11100 -o 'User="sysadmin"' -o ConnectTimeout=10 -o ControlPath=/home/tbrito/.ansible/cp/4ec5b081d9 '[10.127.130.10]'
<10.127.130.10> (0, b'sftp> put /home/tbrito/.ansible/tmp/ansible-local-36931609zu6qrfv/tmp3za7v_d7 /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/AnsiballZ_command.py\n', b'')
<10.127.130.10> ESTABLISH SSH CONNECTION FOR USER: sysadmin
<10.127.130.10> SSH: EXEC sshpass -d10 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=11100 -o 'User="sysadmin"' -o ConnectTimeout=10 -o ControlPath=/home/tbrito/.ansible/cp/4ec5b081d9 10.127.130.10 '/bin/sh -c '"'"'chmod u+x /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/ /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/AnsiballZ_command.py && sleep 0'"'"''
<10.127.130.10> (0, b'', b'')
<10.127.130.10> ESTABLISH SSH CONNECTION FOR USER: sysadmin
<10.127.130.10> SSH: EXEC sshpass -d10 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=11100 -o 'User="sysadmin"' -o ConnectTimeout=10 -o ControlPath=/home/tbrito/.ansible/cp/4ec5b081d9 -tt 10.127.130.10 '/bin/sh -c '"'"'sudo -H -S -p "[sudo via ansible, key=yepzauysuqlwbkqgkdudhuypckarrjzf] password:" -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-yepzauysuqlwbkqgkdudhuypckarrjzf ; /usr/bin/python /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/AnsiballZ_command.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<10.127.130.10> ESTABLISH SSH CONNECTION FOR USER: sysadmin
<10.127.130.10> SSH: EXEC sshpass -d10 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=11100 -o 'User="sysadmin"' -o ConnectTimeout=10 -o ControlPath=/home/tbrito/.ansible/cp/4ec5b081d9 10.127.130.10 '/bin/sh -c '"'"'rm -f -r /tmp/.ansible-sysadmin/tmp/ansible-tmp-1660670732.8196375-3698666-58724316334944/ > /dev/null 2>&1 && sleep 0'"'"''
<10.127.130.10> (0, b'', b'')
fatal: [lab_vbox_1-debian]: FAILED! => {
    "msg": "Timeout (12s) waiting for privilege escalation prompt: "
}

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************************************
lab_vbox_1-debian : ok=262 changed=87 unreachable=0 failed=1 skipped=277 rescued=0 ignored=0

### Trying any sudo task on the target now takes long ###
sysadmin@controller-0:~$ time sudo su
sudo: unable to resolve host controller-0: Name or service not known

real 0m42,083s
user 0m0,007s
sys 0m0,018s

Test Activity
-------------
Developer Testing

Workaround
----------
- Delete /etc/platform/.restore_in_progress
- Recreate /etc/hosts
- Increase timeout
- Retry restore

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/853361
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/aa75882544d8759c20254f48e703c70050dded78
Submitter: "Zuul (22348)"
Branch: master

commit aa75882544d8759c20254f48e703c70050dded78
Author: Thiago Brito <email address hidden>
Date: Tue Aug 16 15:25:41 2022 -0300

    Fixing replacement of /etc/hosts timeouts

    During the restore operation, there is a span of time where we end up
    without the /etc/hosts file and, when executing the playbooks remotely,
    we reach the 12s timeout for command. This commit fixes it by ensuring
    that /etc/hosts is available and with the bare minimum entries at all
    times and also add a rescue operation to recover the previous /etc/hosts
    file in case any task fails.

    TEST PLAN
    PASS remote play of the restore playbook
    PASS local play of the restore playbook
    PASS bootstrap newly installed system

    Closes-Bug: #1986693
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: Iff8e56478339f660ec66e2d4f7cd8ad000b4d306

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Thiago Paiva Brito (outbrito)
importance: Undecided → Medium
tags: added: stx.8.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.