StarlingX

IPv6 Distributed Cloud: ansible-playbook ‘Wipe ceph osds’ does not support re-play / re-entrance

Bug #1847336 reported by Peng Peng on 2019-10-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Dan Voiculeasa

Bug Description

Brief Description
-----------------
To install distributed cloud system controller, run ansible-play bootstrap first, then run 'ansible-playbook lab-install-playbook.yaml -e "@local-install-overrides.yaml" failed at Wipe ceph osds

Severity
--------
Major

Steps to Reproduce
------------------
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml
ansible-playbook lab-install-playbook.yaml -e "@local-install-overrides.yaml"

TC-name:

Expected Behavior
------------------
ansible Wipe ceph osds should support rerun

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC system
IPv6

Lab-name: DC WCP_90

Branch/Pull Time/Commit
-----------------------
19.10 master as of 2019-10-06_20-00-00

Last Pass
---------

Timestamp/Logs
--------------
TASK [bootstrap/validate-config : Wipe ceph osds] ***********************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"msg": "privilege output closed while waiting for password prompt:\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0\n+ dev=/dev/sda\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sda -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 1 /dev/sda\nInvalid partition data!\n++ true\n+ sgdisk_part_info=\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo ''\n+ guid=\n+ '[' '' == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' '' == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 2 /dev/sda\nInvalid partition data!\n++ true\n+ sgdisk_part_info=\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo ''\n+ guid=\n+ '[' '' == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' '' == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=3\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part1 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part1\n+ dev=/dev/sda1\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda1\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part2 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:14.0-usb-0:12:1.0-scsi-0:0:0:0-part2\n+ dev=/dev/sda2\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda2\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0\n+ dev=/dev/sdb\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdb -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 1 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 4C20D594-3538-4890-A8FC-49B897430427\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 4C20D594-3538-4890-A8FC-49B897430427\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=21686148-6449-6E6F-744E-656564454649\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 2 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: C4279531-45D9-47B0-ADA7-F592B3781FE5\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 1028095 (at 502.0 MiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: C4279531-45D9-47B0-ADA7-F592B3781FE5\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 1028095 (at 502.0 MiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=3\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 3 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 403088A0-2690-4946-8578-CDD58DCA364A\nFirst sector: 1028096 (at 502.0 MiB)\nLast sector: 41988095 (at 20.0 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 403088A0-2690-4946-8578-CDD58DCA364A\nFirst sector: 1028096 (at 502.0 MiB)\nLast sector: 41988095 (at 20.0 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=4\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME | awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sdb sgdisk -i 4 /dev/sdb\n+ sgdisk_part_info='Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: 219FE0A2-734D-45C3-998E-7B5A5B359431\nFirst sector: 41988096 (at 20.0 GiB)\nLast sector: 522235903 (at 249.0 GiB)\nPartition size: 480247808 sectors (229.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: 219FE0A2-734D-45C3-998E-7B5A5B359431\nFirst sector: 41988096 (at 20.0 GiB)\nLast sector: 522235903 (at 249.0 GiB)\nPartition size: 480247808 sectors (229.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\'''\\'''\n+ guid=E6D6D379-F507-44C2-A23C-238F2A3DF928\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=5\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part1 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part1\n+ dev=/dev/sdb1\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb1\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part2 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part2\n+ dev=/dev/sdb2\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb2\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part3 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part3\n+ dev=/dev/sdb3\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb3\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part4 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0-part4\n+ dev=/dev/sdb4\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb4\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0\n+ dev=/dev/sdc\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdc\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdc -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:1f.2-ata-3.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:1f.2-ata-3.0\n+ dev=/dev/sdd\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdd\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdd -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n"}

PLAY RECAP **************************************************************************************************************************************************************************************************
localhost : ok=88 changed=12 unreachable=0 failed=1

Test Activity
-------------
installation

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-08:

ALL_NODES_20191008.150724.tar Edit (34.3 MiB, application/x-tar)

Yang Liu (yliu12) on 2019-10-08

tags:

added: stx.retestneeded

Ovidiu Poncea (ovidiuponcea) on 2019-10-09

Changed in starlingx:
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-10-09:

I run ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml 3 times in a row and it is not reproducing. So re-play in a normal deployment works.

Peng, a few questions:
1. can you upload lab-install-playbook.yaml and local-install-overrides.yaml? I want to try also with them.

2. Is this 100% reproducible?

3. Was this run locally? It seems so, but I need confirmation.

================================

Looking over the above logs it seems that the wipe-ceph-osds script executes correctly yet it throws this: "privilege output closed while waiting for password prompt:"

The error happens here:

- block:
  - name: Wipe ceph osds
    script: wipe_osds.sh
    register: results
    become: yes
    become_user: root

Which is called from here:
- { role: bootstrap/validate-config, when: not skip_play, become: yes }

Notice that both have become: yes, which means that ansible escalates to root privileges twice.

Now, searching for similar issues, I stumbled upon a 2016 Ansible bug, https://github.com/ansible/ansible/issues/13763 which has an identical error, and, although the bug got fixed and the condition is different, one comment is important:

"I am going to guess that perhaps this is because on attempt one, sudo password is asked, and by attempt 2 sudo doesnt need a password. Thus the command completes but looks like an error and ansible stops processing the rest"

Which seems to be the case here, we are asking twice for the root privileges. This shouldn't be a problem but it is... which could lead to an Ansible issue.

For more details I need to reproduce it. The fix may be as simple as removing
become: yes
become_user: root
from the code or as bad as fixing Ansible itself.

Ovidiu

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-09:

lab-install-playbook.yaml Edit (633 bytes, text/plain)

I ran it twice. Both of them hit the same issue

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-09:

local-install-overrides.yaml Edit (346 bytes, text/plain)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-09:

Marking as stx.3.0 / medium priority - this affects the ability to re-run ansible playbooks which is a supported operation

tags:	added: stx.3.0 stx.config
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.storage

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-10-10:

Retried with lab-install-playbook.yaml and it does not reproduce. I need the setup where it reproduces, when you get time please retest and once it reproduces ping me.

Thank you!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-21: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/689730

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-22: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/689730
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7b9d8c9a0d828ad3621596923301fe7a9ce4bb8d
Submitter: Zuul
Branch: master

commit 7b9d8c9a0d828ad3621596923301fe7a9ce4bb8d
Author: Ovidiu Poncea <email address hidden>
Date: Mon Oct 21 15:02:56 2019 +0300

B&R: Remove unnecessary privilege escalation

The whole role gets privilege escalation, no need to do it
one more time for wipe_ceph_osds.

    Change-Id: Ib2a6b90934c1f8ac0178e2bf0e7af9203d00751c
    Closes-Bug: 1847336
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-29:

Not found same issue in recent DC installation

tags:

removed: stx.retestneeded

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-10-31:

#10

ALL_NODES_20191030.160119.tar Edit (29.9 MiB, application/x-tar)

This issue still appears to exist in the distributed cloud lab on replay. Logs are attached.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-10-31:

#11

Changing state from Fix Released back to Triaged as the issue has occurred again in a load that includes this commit: https://review.opendev.org/689730

Additional investigation is required. Will request access to the lab where the issue occurred in order to debug since Ovidiu and Tyler cannot reproduce in their lab.

Changed in starlingx:
status:	Fix Released → Triaged

Yang Liu (yliu12) on 2019-11-19

tags:

added: stx.retestneeded

Revision history for this message

Austin Sun (sunausti) wrote on 2019-12-05:

#12

Hi, Yang:
is this issue reproduced recently ?

Thanks.
BR
Austin Sun.

Frank Miller (sensfan22) on 2019-12-09

Changed in starlingx:
assignee:	Ovidiu Poncea (ovidiu.poncea) → Stefan Dinescu (stefandinescu)

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-12-09:

#13

Download full text (11.5 KiB)

I have attempted to reproduce the issue couple times but  in both cases a replay was successful.  Initially,  I failed the bootstrap by changing the registry names.  The initial failure log was:
TASK [common/push-docker-images : Download images and push to local registry] *************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Traceback (most recent call last):\n  File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1575920855.86-196888930696435/download_images.py\", line 117, in <module>\n    raise Exception(\"Failed to download images %s\" % failed_downloads)\nException: Failed to download images [u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic']\n", "stderr_lines": ["Traceback (most recent call last):", "  File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1575920855.86-196888930696435/download_images.py\", line 117, in <module>", "    raise Exception(\"Failed to download images %s\" % failed_downloads)", "Exception: Failed to download images [u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1', u'tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic']"], "stdout": " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")\nSleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")\nSleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/kube-apiserver:v1.16.2\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-controller-manager:v1.16.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.16.2\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-scheduler:v1.16.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/kube-scheduler:v1.16.2\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-proxy:v1.16.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/kube-proxy:v1.16.2\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/pause:3.1\nImage push succeeded: registry.local:9001/k8s.gcr.io/pause:3.1\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/etcd:3.3.15-0\nImage push succeeded: registry.local:9001/k8s.gcr.io/etcd:3.3.15-0\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/coredns:1.6.2\nImage push succeeded: registry.local:9001/k8s.gcr.io/coredns:1.6.2\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2 not found\")\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2 not found\")\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2 not found\")\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/multus:v3.2.16\nImage push succeeded: registry.local:9001/docker.io/starlingx/multus:v3.2.16\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/k8s-cni-sriov:master-centos-stable-latest\nImage push succeeded: registry.local:9001/docker.io/starlingx/k8s-cni-sriov:master-centos-stable-latest\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/k8s-plugins-sriov-network-device:master-centos-stable-latest\nImage push succeeded: registry.local:9001/docker.io/starlingx/k8s-plugins-sriov-network-device:master-centos-stable-latest\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1 not found\")\n Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic not found\")\nImage download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wind-river/cloud-platform-deployment-manager:TC_19.10\nImage push succeeded: registry.local:9001/tis-lab-registry.cumulus.wrs.com/wind-river/cloud-platform-deployment-manager:TC_19.10\n", "stdout_lines": [" Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")", "Sleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2500 Server Error: Internal Server Error (\"Get https://tis-lab-registry.cumulus.wrs.com:9001/v2/: dial tcp: lookup tis-lab-registry.cumulus.wrs.com on [2620:10a:a001:a103::2]:53: server misbehaving\")", "Sleep 20s before retry downloading image tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2 ...", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-apiserver:v1.16.2", "Image push succeeded: registry.local:9001/k8s.gcr.io/kube-apiserver:v1.16.2", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-controller-manager:v1.16.2", "Image push succeeded: registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.16.2", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-scheduler:v1.16.2", "Image push succeeded: registry.local:9001/k8s.gcr.io/kube-scheduler:v1.16.2", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/kube-proxy:v1.16.2", "Image push succeeded: registry.local:9001/k8s.gcr.io/kube-proxy:v1.16.2", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/pause:3.1", "Image push succeeded: registry.local:9001/k8s.gcr.io/pause:3.1", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/etcd:3.3.15-0", "Image push succeeded: registry.local:9001/k8s.gcr.io/etcd:3.3.15-0", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/k8s.gcr.io/coredns:1.6.2", "Image push succeeded: registry.local:9001/k8s.gcr.io/coredns:1.6.2", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/cni:v3.6.2 not found\")", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/node:v3.6.2 not found\")", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/calico/kube-controllers:v3.6.2 not found\")", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/multus:v3.2.16", "Image push succeeded: registry.local:9001/docker.io/starlingx/multus:v3.2.16", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/k8s-cni-sriov:master-centos-stable-latest", "Image push succeeded: registry.local:9001/docker.io/starlingx/k8s-cni-sriov:master-centos-stable-latest", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/docker.io/starlingx/k8s-plugins-sriov-network-device:master-centos-stable-latest", "Image push succeeded: registry.local:9001/docker.io/starlingx/k8s-plugins-sriov-network-device:master-centos-stable-latest", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/igcr.io/kubernetes-helm/tiller:v2.13.1 not found\")", " Image download failed: tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic404 Client Error: Not Found (\"manifest for tis-lab-registry.cumulus.wrs.com:9001/wrcp-staging/iuay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic not found\")", "Image download succeeded: tis-lab-registry.cumulus.wrs.com:9001/wind-river/cloud-platform-deployment-manager:TC_19.10", "Image push succeeded: registry.local:9001/tis-lab-registry.cumulus.wrs.com/wind-river/cloud-platform-deployment-manager:TC_19.10"]}

PLAY RECAP ********************************************************************************************************************************************************************************************************************************************
localhost                  : ok=214  changed=94   unreachable=0    failed=1

A second replay with correct registry names succeed both times:
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************
localhost                  : ok=288  changed=146  unreachable=0    failed=0

Yosief

Revision history for this message

Austin Sun (sunausti) wrote on 2019-12-10:

#14

Thanks Yosief :
does it mean we can close this issue now ?

Revision history for this message

Stefan Dinescu (stefandinescu) wrote on 2019-12-10:

#15

It seems this issue was specific to that certain lab and could not be reproduced on any other lab. Also it seems that at this moment, it can't be reproduced in that lab as well.

I think this issue should be closed at this moment.

Stefan Dinescu (stefandinescu) on 2019-12-10

Changed in starlingx:
status:	Triaged → Invalid

Peng Peng (ppeng) on 2019-12-13

tags:

removed: stx.retestneeded

Revision history for this message

Sabeel Ansari (sansariwr) wrote on 2020-06-11:

#16

Download full text (9.6 KiB)

Observed this problem on VM when trying a bootstrap replay before controller0 unlocked.

Steps to reproduce this issue is pretty simple & straightforward.

1. Create a AIO-SX VM using the VDM with the last step of ‘unlock_controller_0’ commented out. Environment: Windows & Cygwin running Virtual Box 6.1.
2. Once the installation is completed, run following command:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml

Following error is seen on bootstrap replay:

Observed this problem on VM when trying a bootstrap replay before controller0 unlocked.

Steps to reproduce this issue is pretty simple & straightforward.

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml

Following error is seen on bootstrap replay:

TASK [bootstrap/validate-config : Check for ssl_ca_cert file] ********************************************************************************
TASK [bootstrap/validate-config : Check OpenID Connect parameters] ***************************************************************************
TASK [bootstrap/validate-config : Check for Ceph data wipe flag] *****************************************************************************
TASK [bootstrap/validate-config : Wipe ceph osds] ********************************************************************************************fatal: [localhost]: FAILED! => {"msg": "privilege output closed while waiting for password prompt:\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:01.1-ata-2.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:01.1-ata-2.0\n+ dev=/dev/sr0\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sr0\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0\n+ dev=/dev/sda\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sda -o TYPE,NAME\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME |         awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 1 /dev/sda\n+ sgdisk_part_info='Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 13E4D3EB-3CC3-4F95-A700-5920B722C26B\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n++ grep 'Partition GUID code'\n++ awk '{print $4;}'\n++ echo 'Partition GUID code: 21686148-6449-6E6F-744E-656564454649 (BIOS boot partition)\nPartition unique GUID: 13E4D3EB-3CC3-4F95-A700-5920B722C26B\nFirst sector: 2048 (at 1024.0 KiB)\nLast sector: 4095 (at 2.0 MiB)\nPartition size: 2048 sectors (1024.0 KiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n+ guid=21686148-6449-6E6F-744E-656564454649\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' 21686148-6449-6E6F-744E-656564454649 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=2\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME |         awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 2 /dev/sda\n+ sgdisk_part_info='Partition GUID code: BA5EBA11-0000-1111-2222-000000000002 (Unknown)\nPartition unique GUID: 82174DF2-2989-41A4-BC88-CB8FF9987B9E\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 20484095 (at 9.8 GiB)\nPartition size: 20480000 sectors (9.8 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''Platform Backup'\\'''\n++ grep 'Partition GUID code'\n++ awk '{print $4;}'\n++ echo 'Partition GUID code: BA5EBA11-0000-1111-2222-000000000002 (Unknown)\nPartition unique GUID: 82174DF2-2989-41A4-BC88-CB8FF9987B9E\nFirst sector: 4096 (at 2.0 MiB)\nLast sector: 20484095 (at 9.8 GiB)\nPartition size: 20480000 sectors (9.8 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''Platform Backup'\\'''\n+ guid=BA5EBA11-0000-1111-2222-000000000002\n+ '[' BA5EBA11-0000-1111-2222-000000000002 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' BA5EBA11-0000-1111-2222-000000000002 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=3\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME |         awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 3 /dev/sda\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: B16506C6-DE1E-4801-BEAC-D531DE50B135\nFirst sector: 20484096 (at 9.8 GiB)\nLast sector: 21508095 (at 10.3 GiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n++ awk '{print $4;}'\n++ grep 'Partition GUID code'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: B16506C6-DE1E-4801-BEAC-D531DE50B135\nFirst sector: 20484096 (at 9.8 GiB)\nLast sector: 21508095 (at 10.3 GiB)\nPartition size: 1024000 sectors (500.0 MiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=4\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME |         awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 4 /dev/sda\n+ sgdisk_part_info='Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 583CF5A4-B070-4D94-A655-8C5CCD364FA3\nFirst sector: 21508096 (at 10.3 GiB)\nLast sector: 62468095 (at 29.8 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n++ grep 'Partition GUID code'\n++ awk '{print $4;}'\n++ echo 'Partition GUID code: EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (Microsoft basic data)\nPartition unique GUID: 583CF5A4-B070-4D94-A655-8C5CCD364FA3\nFirst sector: 21508096 (at 10.3 GiB)\nLast sector: 62468095 (at 29.8 GiB)\nPartition size: 40960000 sectors (19.5 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''primary'\\'''\n+ guid=EBD0A0A2-B9E5-4433-87C0-68B6B72699C7\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=5\n+ for part in '$(lsblk -rip ${dev} -o TYPE,NAME |         awk '\\''$1 == \"part\" {print $2}'\\'')'\n++ flock /dev/sda sgdisk -i 5 /dev/sda\n+ sgdisk_part_info='Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: D6CB0F59-0D0D-449B-A5C2-F41AE575D196\nFirst sector: 62468096 (at 29.8 GiB)\nLast sector: 437858303 (at 208.8 GiB)\nPartition size: 375390208 sectors (179.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''extended'\\'''\n++ grep 'Partition GUID code'\n++ awk '{print $4;}'\n++ echo 'Partition GUID code: E6D6D379-F507-44C2-A23C-238F2A3DF928 (Linux LVM)\nPartition unique GUID: D6CB0F59-0D0D-449B-A5C2-F41AE575D196\nFirst sector: 62468096 (at 29.8 GiB)\nLast sector: 437858303 (at 208.8 GiB)\nPartition size: 375390208 sectors (179.0 GiB)\nAttribute flags: 0000000000000000\nPartition name: '\\''extended'\\'''\n+ guid=E6D6D379-F507-44C2-A23C-238F2A3DF928\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'\n+ '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 == 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'\n+ part_no=6\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part1 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part1\n+ dev=/dev/sda1\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda1\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part2 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part2\n+ dev=/dev/sda2\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda2\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part3 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part3\n+ dev=/dev/sda3\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda3\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part4 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part4\n+ dev=/dev/sda4\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda4\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part5 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0-part5\n+ dev=/dev/sda5\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sda5\n+ '[' 1 -ne 0 ']'\n+ continue\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-2.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-2.0\n+ dev=/dev/sdb\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdb\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdb -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n+ for f in '/dev/disk/by-path/*'\n+ '[' '!' -e /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0 ']'\n++ readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0\n+ dev=/dev/sdc\n+ grep -q 'TYPE=\"disk\"'\n+ lsblk --nodeps --pairs /dev/sdc\n+ '[' 0 -ne 0 ']'\n+ set -e\n+ part_no=1\n+ ceph_disk=false\n++ awk '$1 == \"part\" {print $2}'\n++ lsblk -rip /dev/sdc -o TYPE,NAME\n+ '[' false == true ']'\n+ set +e\n"}
PLAY RECAP ***********************************************************************************************************************************localhost                  : ok=117  changed=28   unreachable=0    failed=1

controller-0:~$

Revision history for this message

Dan Voiculeasa (dvoicule) wrote on 2020-06-11:

#17

Steps to reproduce:
1)either with automation or manually : Install controller-0, run bootstrap, stop here.
2)Run `ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml`

Notes:
1) `sudo ls` asks for password the first time. Then for some time you can `sudo whatever` without asking for password. After some time it asks you once more for the password. Then for some time it doesn't.
2) The exact opposite happens when running the playbook: the first time ok, then for some time it shows the error while trying to rerun. Then once more passes once, then again it shows the error for subsequent reruns.

3) The rc code is always 0, adding failed_when: results.rc != 0 doesn't help because the error takes precedence before interpreting the rc.

4) it only happens if the script called outputs something to the error file descriptor.
- For example i called a script containing `echo 123 >&2` right before the script we're seeing the issue. Now the new script failed with the same error.
Removed the writing to stderr and it went past the new script, hitting the error on the wipe_osds.sh script call.
Done some experimenting with set -x.

- Using set -x as in wipe_ceph.sh (set -x outputs each interpreted line to stderr)

5) ansible does: EXEC /bin/sh -c 'sudo -H -S -p "[sudo via ansible, key=jmesjrbeakbdiytifqllayafvphllbbc] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-jmesjrbeakbdiytifqllayafvphllbbc; /tmp/.ansible-sysadmin/tmp/ansible-tmp-1591231055.86-255198310786494/wipe_osds.sh'"'"' && sleep 0'

All the above makes me think the issue is ansible internal related to the script module.

Revision history for this message

Dan Voiculeasa (dvoicule) wrote on 2020-06-11:

#18

One way to workaround this is to remove `set -x` from wipe_ceph.sh. But I'm not sure we want that. I think we still need that piece of information for debugging purposes. Maybe find a way to output directly to stdout.

`Not writing to stderr if there is no error` applies to all the scripts called using `script` ansible module that will run after this one.

Revision history for this message

Dan Voiculeasa (dvoicule) wrote on 2020-06-12:

#19

ansible-playbook /home/sysadmin/play1.yml -vvvvvvvvvv

controller-0:~$ cat play1.yml
- hosts: all
  gather_facts: no
  become: yes
  vars:
    ansible_become_pass: EDITED

  tasks:
    - name: TESTB
      script: b.sh 2>&1 # or without 2>&1

# SAME thing happens with both b.sh
controller-0:~$ cat b.sh
set -x
echo 123153345

controller-0:~$ cat b.sh
echo 123153345 >&2

Results:
OK: script: b.sh 2>&1
FAIL: script b.sh

But this only happens after starlingx ansible bootstrap was succesfully/partially run. It remains to be determined what is the step in the ansible bootstrap that needs to be arrived at and executed, for subsequent replays to fail.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-13:

#20

Re-opening as the issue is seen again in stx.4.0

Changed in starlingx:
status:	Invalid → Confirmed
tags:	added: stx.4.0 removed: stx.3.0

Frank Miller (sensfan22) on 2020-06-15

Changed in starlingx:
assignee:	Stefan Dinescu (stefandinescu) → Dan Voiculeasa (dvoicule)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-19: Fix proposed to ansible-playbooks (master)

#21

Fix proposed to branch: master
Review: https://review.opendev.org/736975

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-23: Fix merged to ansible-playbooks (master)

#22

Reviewed: https://review.opendev.org/736975
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=cd5fc0deb4649031eee00bca61f59e8f394c3b51
Submitter: Zuul
Branch: master

commit cd5fc0deb4649031eee00bca61f59e8f394c3b51
Author: Dan Voiculeasa <email address hidden>
Date: Fri Jun 19 13:39:00 2020 +0300

Insert sleep into wipe_osds.sh script

    Become mechanism has issues with correctly detecting the BECOME-SUCCESS
    message. It happens when tasks quick to produce output.
    Example: calling bash scripts using `script` module.

    The python scripts seem not to be affected by this due to overhead of
    loading.
    Command and shell modules will call a python process to execute payload.

Workaround: insert sleep. sleep 0 will not suffice.
Testing with sleep 0 the issue reproduced 1/100.

Created an upstream issue
https://github.com/ansible/ansible/issues/70092

    Closes-Bug: 1847336
    Change-Id: Ibbba9c858b0dde535bfce753369a0f26159db8d8
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-ansible-ansible #13763
[closed bug] Edit
auto-github-ansible-ansible #70092
[open support:core bug has_pr affects_2.7 P3 needs_verified] Edit

Bug watches keep track of this bug in other bug trackers.