IPv6 Distributed Cloud: ansible-playbook ‘Wipe ceph osds’ does not support re-play / re-entrance

Bug #1847336 reported by Peng Peng
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
To install distributed cloud system controller, run ansible-play bootstrap first, then run 'ansible-playbook lab-install-playbook.yaml -e "@local-install-overrides.yaml" failed at Wipe ceph osds

Severity
--------
Major

Steps to Reproduce
------------------
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml
ansible-playbook lab-install-playbook.yaml -e "@local-install-overrides.yaml"

TC-name:

Expected Behavior
------------------
ansible Wipe ceph osds should support rerun

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC system
IPv6

Lab-name: DC WCP_90

Branch/Pull Time/Commit
-----------------------
19.10 master as of 2019-10-06_20-00-00

Last Pass
---------

Timestamp/Logs
--------------
TASK [bootstrap/validate-config : Wipe ceph osds] ***********************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"msg": "privilege output closed while waiting for password prompt:\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0\n+ dev=/dev/sda\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sda -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 1 /dev/sda\nInvalid partition data!\n++ true\n+ sgdisk_part_info=\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo ''\n+ guid=\n+ '[' '' == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' '' == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 2 /dev/sda\nInvalid partition data!\n++ true\n+ sgdisk_part_info=\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo ''\n+ guid=\n+ '[' '' == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' '' == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=3\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part1 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part1\n+ dev=/dev/sda1\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda1\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part2 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part2\n+ dev=/dev/sda2\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda2\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0\n+ dev=/dev/sdb\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdb -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 1 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 4C20D594-3538-4890-A8FC-49B897430427\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 4C20D594-3538-4890-A8FC-49B897430427\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=21686148-6449-6E6F-744E-656564454649\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 2 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: C4279531-45D9-47B0-ADA7-F592B3781FE5\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 1028095 (at 502.0 MiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: C4279531-45D9-47B0-ADA7-F592B3781FE5\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 1028095 (at 502.0 MiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=3\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 3 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 403088A0-2690-4946-8578-CDD58DCA364A\nFirst sector: 1028096 (at 502.0 MiB)\nLast sector: 41988095 (at 20.0 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 403088A0-2690-4946-8578-CDD58DCA364A\nFirst sector: 1028096 (at 502.0 MiB)\nLast sector: 41988095 (at 20.0 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=4\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 4 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: 219FE0A2-734D-45C3-998E-7B5A5B359431\nFirst sector: 41988096 (at 20.0 GiB)\nLast sector: 522235903 (at 249.0 GiB)\nPartition size: 480247808 sectors (229.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: 219FE0A2-734D-45C3-998E-7B5A5B359431\nFirst sector: 41988096 (at 20.0 GiB)\nLast sector: 522235903 (at 249.0 GiB)\nPartition size: 480247808 sectors (229.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=E6D6D379-F507-44C2-A23C-238F2A3DF928\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=5\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part1 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part1\n+ dev=/dev/sdb1\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb1\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part2 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part2\n+ dev=/dev/sdb2\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb2\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part3 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part3\n+ dev=/dev/sdb3\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb3\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part4 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part4\n+ dev=/dev/sdb4\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb4\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0\n+ dev=/dev/sdc\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdc\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdc -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-3.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-3.0\n+ dev=/dev/sdd\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdd\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdd -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n"}

PLAY RECAP **************************************************************************************************************************************************************************************************
localhost : ok=88 changed=12 unreachable=0 failed=1

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
Yang Liu (yliu12)
tags: added: stx.retestneeded
Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

I run ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml 3 times in a row and it is not reproducing. So re-play in a normal deployment works.

Peng, a few questions:
1. can you upload lab-install-playbook.yaml and local-install-overrides.yaml? I want to try also with them.

2. Is this 100% reproducible?

3. Was this run locally? It seems so, but I need confirmation.

================================

Looking over the above logs it seems that the wipe-ceph-osds script executes correctly yet it throws this: "privilege output closed while waiting for password prompt:"

The error happens here:

- block:
  - name: Wipe ceph osds
    script: wipe_osds.sh
    register: results
    become: yes
    become_user: root

Which is called from here:
    - { role: bootstrap/validate-config, when: not skip_play, become: yes }

Notice that both have become: yes, which means that ansible escalates to root privileges twice.

Now, searching for similar issues, I stumbled upon a 2016 Ansible bug, https://github.com/ansible/ansible/issues/13763 which has an identical error, and, although the bug got fixed and the condition is different, one comment is important:

"I am going to guess that perhaps this is because on attempt one, sudo password is asked, and by attempt 2 sudo doesnt need a password. Thus the command completes but looks like an error and ansible stops processing the rest"

Which seems to be the case here, we are asking twice for the root privileges. This shouldn't be a problem but it is... which could lead to an Ansible issue.

For more details I need to reproduce it. The fix may be as simple as removing
    become: yes
    become_user: root
from the code or as bad as fixing Ansible itself.

Ovidiu

Revision history for this message
Peng Peng (ppeng) wrote :

I ran it twice. Both of them hit the same issue

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - this affects the ability to re-run ansible playbooks which is a supported operation

tags: added: stx.3.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.storage
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Retried with lab-install-playbook.yaml and it does not reproduce. I need the setup where it reproduces, when you get time please retest and once it reproduces ping me.

Thank you!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/689730

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/689730
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7b9d8c9a0d828ad3621596923301fe7a9ce4bb8d
Submitter: Zuul
Branch: master

commit 7b9d8c9a0d828ad3621596923301fe7a9ce4bb8d
Author: Ovidiu Poncea <email address hidden>
Date: Mon Oct 21 15:02:56 2019 +0300

    B&R: Remove unnecessary privilege escalation

    The whole role gets privilege escalation, no need to do it
    one more time for wipe_ceph_osds.

    Change-Id: Ib2a6b90934c1f8ac0178e2bf0e7af9203d00751c
    Closes-Bug: 1847336
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not found same issue in recent DC installation

tags: removed: stx.retestneeded
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

This issue still appears to exist in the distributed cloud lab on replay. Logs are attached.

Revision history for this message
Frank Miller (sensfan22) wrote :

Changing state from Fix Released back to Triaged as the issue has occurred again in a load that includes this commit: https://review.opendev.org/689730

Additional investigation is required. Will request access to the lab where the issue occurred in order to debug since Ovidiu and Tyler cannot reproduce in their lab.

Changed in starlingx:
status: Fix Released → Triaged
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Yang:
    is this issue reproduced recently ?

Thanks.
BR
Austin Sun.

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Stefan Dinescu (stefandinescu)
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :
Download full text (11.5 KiB)

I have attempted to reproduce the issue couple times but in both cases a replay was successful. Initially, I failed the bootstrap by changing the registry names. The initial failure log was:
TASK [common/push-docker-images : Download images and push to local registry] *************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Traceback (most recent call last):\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1575920855.86-196888930696435/download_images.py\", line 117, in <module>\n raise Exception(\"Failed to download images %s\" % failed_downloads)\nException: Failed to download images [u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic']\n", "stderr_lines": ["Traceback (most recent call last):", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1575920855.86-196888930696435/download_images.py\", line 117, in <module>", " raise Exception(\"Failed to download images %s\" % failed_downloads)", "Exception: Failed to download images [u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic']"], "stdout": " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")\nSleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")\nSleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/kube-apiserver:v1.16.2\nImage download succeeded: tis-...

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks Yosief :
      does it mean we can close this issue now ?

Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

It seems this issue was specific to that certain lab and could not be reproduced on any other lab. Also it seems that at this moment, it can't be reproduced in that lab as well.

I think this issue should be closed at this moment.

Changed in starlingx:
status: Triaged → Invalid
Peng Peng (ppeng)
tags: removed: stx.retestneeded
Revision history for this message
Sabeel Ansari (sansariwr) wrote :
Download full text (9.6 KiB)

Observed this problem on VM when trying a bootstrap replay before controller0 unlocked.

Steps to reproduce this issue is pretty simple & straightforward.

1. Create a AIO-SX VM using the VDM with the last step of ‘unlock_controller_0’ commented out. Environment: Windows & Cygwin running Virtual Box 6.1.
2. Once the installation is completed, run following command:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml

Following error is seen on bootstrap replay:

TASK [bootstrap/validate-config : Check for ssl_ca_cert file] ********************************************************************************
TASK [bootstrap/validate-config : Check OpenID Connect parameters] ***************************************************************************
TASK [bootstrap/validate-config : Check for Ceph data wipe flag] *****************************************************************************
TASK [bootstrap/validate-config : Wipe ceph osds] ********************************************************************************************fatal: [localhost]: FAILED! => {"msg": "privilege output closed while waiting for password prompt:\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:01.1-ata-2.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:01.1-ata-2.0\n+ dev=/dev/sr0\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sr0\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0\n+ dev=/dev/sda\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sda -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 1 /dev/sda\n+ sgdisk_part_info='Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 13E4D3EB-3CC3-4F95-A700-5920B722C26B\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n++ grep 'Partition GUID code'\n++ awk '{print $4;}'\n++ echo 'Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 13E4D3EB-3CC3-4F95-A700-5920B722C26B\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n+ guid=21686148-6449-6E6F-744E-656564454649\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 2 /dev/sda\n+ sgdisk_part_info='Partition GUID code: BA5EBA11-0000-1111-2222-000000000002 (Unknown)\nPartition unique GUID: 82174DF2-2989-41A4-BC88-CB8FF9987B9E\nFirst sector: 4...

Read more...

Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

Steps to reproduce:
1)either with automation or manually : Install controller-0, run bootstrap, stop here.
2)Run `ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml`

Notes:
1) `sudo ls` asks for password the first time. Then for some time you can `sudo whatever` without asking for password. After some time it asks you once more for the password. Then for some time it doesn't.
2) The exact opposite happens when running the playbook: the first time ok, then for some time it shows the error while trying to rerun. Then once more passes once, then again it shows the error for subsequent reruns.

3) The rc code is always 0, adding failed_when: results.rc != 0 doesn't help because the error takes precedence before interpreting the rc.

4) it only happens if the script called outputs something to the error file descriptor.
 - For example i called a script containing `echo 123 >&2` right before the script we're seeing the issue. Now the new script failed with the same error.
   Removed the writing to stderr and it went past the new script, hitting the error on the wipe_osds.sh script call.
   Done some experimenting with set -x.

 - Using set -x as in wipe_ceph.sh (set -x outputs each interpreted line to stderr)

5) ansible does: EXEC /bin/sh -c 'sudo -H -S -p "[sudo via ansible, key=jmesjrbeakbdiytifqllayafvphllbbc] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-jmesjrbeakbdiytifqllayafvphllbbc; /tmp/.ansible-sysadmin/tmp/ansible-tmp-1591231055.86-255198310786494/wipe_osds.sh'"'"' && sleep 0'

All the above makes me think the issue is ansible internal related to the script module.

Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

One way to workaround this is to remove `set -x` from wipe_ceph.sh. But I'm not sure we want that. I think we still need that piece of information for debugging purposes. Maybe find a way to output directly to stdout.

`Not writing to stderr if there is no error` applies to all the scripts called using `script` ansible module that will run after this one.

Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

ansible-playbook /home/sysadmin/play1.yml -vvvvvvvvvv

controller-0:~$ cat play1.yml
- hosts: all
  gather_facts: no
  become: yes
  vars:
    ansible_become_pass: EDITED

  tasks:
    - name: TESTB
      script: b.sh 2>&1 # or without 2>&1

# SAME thing happens with both b.sh
controller-0:~$ cat b.sh
set -x
echo 123153345

controller-0:~$ cat b.sh
echo 123153345 >&2

Results:
OK: script: b.sh 2>&1
FAIL: script b.sh

But this only happens after starlingx ansible bootstrap was succesfully/partially run. It remains to be determined what is the step in the ansible bootstrap that needs to be arrived at and executed, for subsequent replays to fail.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as the issue is seen again in stx.4.0

Changed in starlingx:
status: Invalid → Confirmed
tags: added: stx.4.0
removed: stx.3.0
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Stefan Dinescu (stefandinescu) → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/736975

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/736975
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=cd5fc0deb4649031eee00bca61f59e8f394c3b51
Submitter: Zuul
Branch: master

commit cd5fc0deb4649031eee00bca61f59e8f394c3b51
Author: Dan Voiculeasa <email address hidden>
Date: Fri Jun 19 13:39:00 2020 +0300

    Insert sleep into wipe_osds.sh script

    Become mechanism has issues with correctly detecting the BECOME-SUCCESS
    message. It happens when tasks quick to produce output.
    Example: calling bash scripts using `script` module.

    The python scripts seem not to be affected by this due to overhead of
    loading.
    Command and shell modules will call a python process to execute payload.

    Workaround: insert sleep. sleep 0 will not suffice.
    Testing with sleep 0 the issue reproduced 1/100.

    Created an upstream issue
    https://github.com/ansible/ansible/issues/70092

    Closes-Bug: 1847336
    Change-Id: Ibbba9c858b0dde535bfce753369a0f26159db8d8
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.