wipe_osds.sh fails due to race condition

Bug #2056765 reported by Erickson Silva de Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Erickson Silva de Oliveira

Bug Description

Brief Description
-----------------
The first Ansible bootstrap attempt failed with the following output:

2024-03-01 17:21:12,542 p=6063 u=sysadmin n=ansible | TASK [common/wipe-ceph-osds : Check for Ceph data wipe flag] *******************
2024-03-01 17:21:12,542 p=6063 u=sysadmin n=ansible | Friday 01 March 2024 17:21:12 +0000 (0:00:00.032) 0:00:39.271 **********
2024-03-01 17:21:12,553 p=6063 u=sysadmin n=ansible | skipping: [localhost]
2024-03-01 17:21:12,556 p=6063 u=sysadmin n=ansible | TASK [common/wipe-ceph-osds : Wipe ceph osds] **********************************
2024-03-01 17:21:12,556 p=6063 u=sysadmin n=ansible | Friday 01 March 2024 17:21:12 +0000 (0:00:00.014) 0:00:39.285 **********
2024-03-01 17:21:15,051 p=6063 u=sysadmin n=ansible | fatal: [localhost]: FAILED! => changed=true
  msg: non-zero return code
  rc: 1
  stderr: |-
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-1.0 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-1.0
    + dev=/dev/sda
    + lsblk --nodeps --pairs /dev/sda
    + grep -q 'TYPE="disk"'
    + multipath -c /dev/sda
    + set -e
    + wipe_if_ceph_disk /dev/sda 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 0
    + __dev=/dev/sda
    + __osd_guid=4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
    + __journal_guid=45B0969E-9B03-4F30-B4C6-B4B80CEFF106
    + __is_multipath=0
    + ceph_disk=false
    ++ flock /dev/sda sfdisk -q -l /dev/sda
    ++ awk '$1 == "Device" {i=1; next}; i {print $1}'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sda1
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    + part_no=1
    ++ flock /dev/sda sfdisk --part-type /dev/sda 1
    + guid=BA5EBA11-0000-1111-2222-000000000002
    + '[' BA5EBA11-0000-1111-2222-000000000002 = 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'
    + '[' BA5EBA11-0000-1111-2222-000000000002 = 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sda2
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    + part_no=2
    ++ flock /dev/sda sfdisk --part-type /dev/sda 2
    + guid=C12A7328-F81F-11D2-BA4B-00A0C93EC93B
    + '[' C12A7328-F81F-11D2-BA4B-00A0C93EC93B = 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'
    + '[' C12A7328-F81F-11D2-BA4B-00A0C93EC93B = 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sda3
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    + part_no=3
    ++ flock /dev/sda sfdisk --part-type /dev/sda 3
    + guid=0FC63DAF-8483-4772-8E79-3D69D8477DE4
    + '[' 0FC63DAF-8483-4772-8E79-3D69D8477DE4 = 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'
    + '[' 0FC63DAF-8483-4772-8E79-3D69D8477DE4 = 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sda4
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    + part_no=4
    ++ flock /dev/sda sfdisk --part-type /dev/sda 4
    + guid=E6D6D379-F507-44C2-A23C-238F2A3DF928
    + '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 = 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'
    + '[' E6D6D379-F507-44C2-A23C-238F2A3DF928 = 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 ']'
    + '[' false = true ']'
    + set +e
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part1 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part1
    + dev=/dev/sda1
    + lsblk --nodeps --pairs /dev/sda1
    + grep -q 'TYPE="disk"'
    + continue
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part2 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part2
    + dev=/dev/sda2
    + lsblk --nodeps --pairs /dev/sda2
    + grep -q 'TYPE="disk"'
    + continue
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part3 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part3
    + dev=/dev/sda3
    + lsblk --nodeps --pairs /dev/sda3
    + grep -q 'TYPE="disk"'
    + continue
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part4 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-1.0-part4
    + dev=/dev/sda4
    + lsblk --nodeps --pairs /dev/sda4
    + grep -q 'TYPE="disk"'
    + continue
    + for f in /dev/disk/by-path/*
    + '[' '!' -e /dev/disk/by-path/pci-0000:00:17.0-ata-2.0 ']'
    ++ readlink -f /dev/disk/by-path/pci-0000:00:17.0-ata-2.0
    + dev=/dev/sdb
    + lsblk --nodeps --pairs /dev/sdb
    + grep -q 'TYPE="disk"'
    + multipath -c /dev/sdb
    + set -e
    + wipe_if_ceph_disk /dev/sdb 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 0
    + __dev=/dev/sdb
    + __osd_guid=4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
    + __journal_guid=45B0969E-9B03-4F30-B4C6-B4B80CEFF106
    + __is_multipath=0
    + ceph_disk=false
    ++ flock /dev/sdb sfdisk -q -l /dev/sdb
    ++ awk '$1 == "Device" {i=1; next}; i {print $1}'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sdb1
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    + part_no=1
    ++ flock /dev/sdb sfdisk --part-type /dev/sdb 1
    + guid=4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
    + '[' 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D = 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D ']'
    + echo 'Found Ceph OSD partition #1 /dev/sdb1, erasing!'
    + dd if=/dev/zero of=/dev/sdb1 bs=512 count=34
    ++ blockdev --getsz /dev/sdb1
    + seek_end=1873285741
    + dd if=/dev/zero of=/dev/sdb1 bs=512 count=34 seek=1873285741
    + parted -s /dev/sdb rm 1
    + ceph_disk=true
    + parted /dev/sdb p
    + grep 'ceph data'
    + for part in $(flock "${__dev}" sfdisk -q -l "${__dev}" | awk '$1 == "Device" {i=1; next}; i {print $1}')
    ++ udevadm info /dev/sdb2
    ++ grep -oP -m1 'E: PARTN=\K.*|E: DM_PART=\K.*'
    Unknown device "/dev/sdb2": No such file or directory
    + part_no=
  stderr_lines: <omitted>
  stdout: |-
    DM_MULTIPATH_DEVICE_PATH="0"
    DM_MULTIPATH_DEVICE_PATH="0"
    Found Ceph OSD partition #1 /dev/sdb1, erasing!
  stdout_lines: <omitted>
2024-03-01 17:21:15,052 p=6063 u=sysadmin n=ansible | PLAY RECAP *********************************************************************
2024-03-01 17:21:15,053 p=6063 u=sysadmin n=ansible | localhost : ok=138 changed=24 unreachable=0 failed=1 skipped=176 rescued=0 ignored=0

Severity
--------
Minor: Prevents the first Ansible bootstrap from succeeding with every lab installation that is installing on top of a disk with an older installation.

Steps to Reproduce
------------------
Install a ISO image onto a server in All-in-One simplex configuration.

Expected Behavior
------------------
The installation should succeed in the first try of the Ansible bootstrap.

Actual Behavior
----------------
First Ansible bootstrap attempt fails due to the wipe_osds.sh script issue mentioned above with the quoted logs.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
AIO-SX

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/912463
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/e63bfd683e5ffe3db941f921129b6fafab790c0f
Submitter: "Zuul (22348)"
Branch: master

commit e63bfd683e5ffe3db941f921129b6fafab790c0f
Author: Erickson Silva de Oliveira <email address hidden>
Date: Mon Mar 11 11:13:39 2024 -0300

    Add flock in wipe_osds.sh to avoid race condition

    When running the script to wipe the OSDs, it sometimes happened
    that the second partition was not found, although it was there.

    Analyzing the code, it was possible to replicate the problem,
    which is caused by a race condition when running the parted
    command on the first partition, which causes udev to be reloaded
    while the second partition was being processed.

    To solve this problem, the “flock” command was used to lock the
    entire disk, not just the partition.

    Additionally, the use of udevadm has also been removed.

    Test Plan:
      - PASS: (AIO-SX) Replace wipe_osds.sh with changes,
              run and check script output

    Closes-Bug: 2056765

    Change-Id: Icbc351a868b413a51dc8273ca422737d39756b3b
    Signed-off-by: Erickson Silva de Oliveira <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.storage
Changed in starlingx:
assignee: nobody → Erickson Silva de Oliveira (esilvade)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.