Disk not found causes ceph osd bootstrap to fail

Bug #1824787 reported by wangwei
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Medium
Unassigned
Rocky
Fix Released
Medium
Unassigned
Stein
Fix Released
Medium
Unassigned
Train
Fix Released
Medium
Unassigned

Bug Description

I am deploying a ceph cluster using the following configuration:

```
ceph-node1 mon/mgr/osd disk:sdb/sdc/sdd

ceph-node2 mon/osd disk:sdb/sdc/sdd

ceph-node3 mon/osd disk:sdb/sdc/sdd
```
The command to initialize the disk is as follows:

```
sudo sgdisk --zap-all -- /dev/sdb
sudo sgdisk --zap-all -- /dev/sdc
sudo sgdisk --zap-all -- /dev/sdd

sudo /sbin/parted /dev/sdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 -1
sudo /sbin/parted /dev/sdc -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO2 1 -1
sudo /sbin/parted /dev/sdd -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO3 1 -1
```

But in the deployment, there will be some osd bootstrap failures:

```
"stderr": "+ sudo -E kolla_set_configs\n
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json\n
INFO:__main__:Validating config file\n
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS\n
INFO:__main__:Copying service configuration files\n
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.conf to /etc/ceph/ceph.conf\n
INFO:__main__:Setting permission for /etc/ceph/ceph.conf\n
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.client.admin.keyring to /etc/ceph/ceph.client.admin.keyring\n
INFO:__main__:Setting permission for /etc/ceph/ceph.client.admin.keyring\n
INFO:__main__:Writing out command to execute\n
++ cat /run_command\n
+ CMD='/usr/bin/ceph-osd -f --public-addr 192.168.10.12 --cluster-addr 192.168.10.12'\n
+ ARGS=\n
+ [[ ! -n '' ]]\n
+ . kolla_extend_start\n
++ [[ ! -d /var/log/kolla/ceph ]]\n
+++ stat -c %a /var/log/kolla/ceph\n
++ [[ 2755 != \\7\\5\\5 ]]\n
++ chmod 755 /var/log/kolla/ceph\n
++ [[ -n 0 ]]\n
++ CEPH_JOURNAL_TYPE_CODE=45B0969E-9B03-4F30-B4C6-B4B80CEFF106\n
++ CEPH_OSD_TYPE_CODE=4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D\n
++ CEPH_OSD_BS_WAL_TYPE_CODE=0FC63DAF-8483-4772-8E79-3D69D8477DE4\n
++ CEPH_OSD_BS_DB_TYPE_CODE=CE8DF73C-B89D-45B0-AD98-D45332906d90\n
++ ceph quorum_status\n
++ [[ False == \\F\\a\\l\\s\\e ]]\n
++ [[ bluestore == \\b\\l\\u\\e\\s\\t\\o\\r\\e ]]\n
++ [[ /dev/sdd =~ /dev/loop ]]\n
++ sgdisk --zap-all -- /dev/sdd1\n
++ '[' -n '' ']'\n
++ sgdisk --zap-all -- /dev/sdd\n
++ sgdisk --new=1:0:+100M --mbrtogpt -- /dev/sdd\n
++ sgdisk --largest-new=2 --mbrtogpt -- /dev/sdd\n
++ sgdisk --zap-all -- /dev/sdd2\n
Problem opening /dev/sdd2 for reading! Error is 2.\n
The specified file does not exist!\n
Problem opening '' for writing! Program will now terminate.\n
Warning! MBR not overwritten! Error is 2!\n",

```

The kolla code is as follows(kolla\docker\ceph\ceph-osd\extend_start.sh):

```
                sgdisk --zap-all -- "${OSD_BS_DEV}"
                sgdisk --new=1:0:+100M --mbrtogpt -- "${OSD_BS_DEV}"
                sgdisk --largest-new=2 --mbrtogpt -- "${OSD_BS_DEV}"
                sgdisk --zap-all -- "${OSD_BS_DEV}"2
```

Here should add a partprobe command to re-read the partition:

```
                sgdisk --zap-all -- "${OSD_BS_DEV}"
                sgdisk --new=1:0:+100M --mbrtogpt -- "${OSD_BS_DEV}"
                sgdisk --largest-new=2 --mbrtogpt -- "${OSD_BS_DEV}"
                partprobe || true
                sgdisk --zap-all -- "${OSD_BS_DEV}"2
```

Revision history for this message
wangwei (wangwei-david) wrote :

https://review.openstack.org/652612

This commit fixes this problem

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/652612
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=ddab09fdd8d7c08ce8db70948ea12571bb4267a8
Submitter: Zuul
Branch: master

commit ddab09fdd8d7c08ce8db70948ea12571bb4267a8
Author: wangwei <email address hidden>
Date: Mon Apr 15 19:13:47 2019 +0900

    Fix the problem of osd initialization failed

    When deploying osd, if the user does not use the extra block
    partition, the kolla will automatically partition the disk and then
    clean up the data on the disk partition. Sometimes the disk partition
    will not be updated, there will be an error not finding the partition.

    This commit fixes the problem.

    Change-Id: I14708f38614dcb75268c2f460ae3d921748c2d10
    Closes-bug: #1824787

Changed in kolla:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/652923

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/652925

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/stein)

Reviewed: https://review.openstack.org/652925
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=ea14e33bc694b8a3befda6cf751d4a2032b98c4a
Submitter: Zuul
Branch: stable/stein

commit ea14e33bc694b8a3befda6cf751d4a2032b98c4a
Author: wangwei <email address hidden>
Date: Mon Apr 15 19:13:47 2019 +0900

    Fix the problem of osd initialization failed

    When deploying osd, if the user does not use the extra block
    partition, the kolla will automatically partition the disk and then
    clean up the data on the disk partition. Sometimes the disk partition
    will not be updated, there will be an error not finding the partition.

    This commit fixes the problem.

    Change-Id: I14708f38614dcb75268c2f460ae3d921748c2d10
    Closes-bug: #1824787
    (cherry picked from commit ddab09fdd8d7c08ce8db70948ea12571bb4267a8)

tags: added: in-stable-stein
tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/rocky)

Reviewed: https://review.openstack.org/652923
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=9f85941324904fc61f88bc14175e4e364cee7f6f
Submitter: Zuul
Branch: stable/rocky

commit 9f85941324904fc61f88bc14175e4e364cee7f6f
Author: wangwei <email address hidden>
Date: Mon Apr 15 19:13:47 2019 +0900

    Fix the problem of osd initialization failed

    When deploying osd, if the user does not use the extra block
    partition, the kolla will automatically partition the disk and then
    clean up the data on the disk partition. Sometimes the disk partition
    will not be updated, there will be an error not finding the partition.

    This commit fixes the problem.

    Change-Id: I14708f38614dcb75268c2f460ae3d921748c2d10
    Closes-bug: #1824787
    (cherry picked from commit ddab09fdd8d7c08ce8db70948ea12571bb4267a8)

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

we are still seeing this in our CI system. It started to appear around the same time the patch was issued.

Testing around a little bit it seems that it appears /sometimes/ but not always. The fix did not worsen the problem nor solve it completely.

As a temporary workaround, I set a `sleep 120` after the `partprobe || true` which solves it for me in our CI environment.

Revision history for this message
wangwei (wangwei-david) wrote :

@Magnus Lööf (magnus-loof)

Can you share the error log?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 7.0.3

This issue was fixed in the openstack/kolla 7.0.3 release.

Revision history for this message
Guillaume Chenuet (gchenuet) wrote :

Hi,

Same error on my side with the 7.0.3 kolla-ansible version.
As we are deploying many OSD at the same time on the same host, sometimes deployment failed (on a random disk) in the `partprobe || true` command.

The workaround wrote by Magnus Lööf (magnus-loof) works for me (reduced to 60) but it's not a really clean way to fix it.

Should we use an "until" block between the `sgdisk --zap-all -- "${OSD_BS_DEV}"2` command ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670001

Revision history for this message
wangwei (wangwei-david) wrote :

@Magnus Lööf, Guillaume Chenuet

Hi, in my production environment, I used sleep 1 to avoid this problem.
Because the ceph osd initialization I use is different from the community, so I thought this was my own problem.

I submitted a commit here to better solve this problem.

https://review.opendev.org/#/c/670001/

In this commit I used the partprobe command that is consistent with ceph-disk.

udevadm settle --timeout=600
flock -s ${device} partprobe ${device}
udevadm settle --timeout=600

Because 'partprobe || true' is to update all disk information, and the above command only updates the specified disk.

Then I added a loop to get the partition, the maximum time is 10s, in general, this time is enough for the partition to appear.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.opendev.org/670001
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=342c9f0cd0dc4b97870dc40fab6c5a7d241b40ed
Submitter: Zuul
Branch: master

commit 342c9f0cd0dc4b97870dc40fab6c5a7d241b40ed
Author: wangwei <email address hidden>
Date: Wed Jul 10 16:29:48 2019 +0900

    Add partition detection to fix osd initialization failure

    The changes in the following commit did not completely solve the
    problem of osd initialization failure:
    https://review.opendev.org/652612

    In order to solve this problem, the partprobe command consistent with
    ceph-disk is added, and then a loop is added to detect the partition,
    which is acquired every 1s. When the partition appears, the osd
    initialization is continued.

    Change-Id: I0ca255c6358132d9e3acfa6b610b70a78756512c
    Closes-bug: #1824787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 8.0.0.0rc2

This issue was fixed in the openstack/kolla 8.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676657

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676658

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/stein)

Reviewed: https://review.opendev.org/676657
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=64c1acd32537b0746f9f6ada2b9f63a676990ecc
Submitter: Zuul
Branch: stable/stein

commit 64c1acd32537b0746f9f6ada2b9f63a676990ecc
Author: wangwei <email address hidden>
Date: Wed Jul 10 16:29:48 2019 +0900

    Add partition detection to fix osd initialization failure

    The changes in the following commit did not completely solve the
    problem of osd initialization failure:
    https://review.opendev.org/652612

    In order to solve this problem, the partprobe command consistent with
    ceph-disk is added, and then a loop is added to detect the partition,
    which is acquired every 1s. When the partition appears, the osd
    initialization is continued.

    Change-Id: I0ca255c6358132d9e3acfa6b610b70a78756512c
    Closes-bug: #1824787
    (cherry picked from commit 342c9f0cd0dc4b97870dc40fab6c5a7d241b40ed)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/rocky)

Reviewed: https://review.opendev.org/676658
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=e6eb8b030eaa366d3c9ef3ea58c23ac7128e14de
Submitter: Zuul
Branch: stable/rocky

commit e6eb8b030eaa366d3c9ef3ea58c23ac7128e14de
Author: wangwei <email address hidden>
Date: Wed Jul 10 16:29:48 2019 +0900

    Add partition detection to fix osd initialization failure

    The changes in the following commit did not completely solve the
    problem of osd initialization failure:
    https://review.opendev.org/652612

    In order to solve this problem, the partprobe command consistent with
    ceph-disk is added, and then a loop is added to detect the partition,
    which is acquired every 1s. When the partition appears, the osd
    initialization is continued.

    Change-Id: I0ca255c6358132d9e3acfa6b610b70a78756512c
    Closes-bug: #1824787
    (cherry picked from commit 342c9f0cd0dc4b97870dc40fab6c5a7d241b40ed)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 7.0.4

This issue was fixed in the openstack/kolla 7.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 8.0.1

This issue was fixed in the openstack/kolla 8.0.1 release.

Mark Goddard (mgoddard)
Changed in kolla:
milestone: none → 9.0.0
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 9.0.0.0rc1

This issue was fixed in the openstack/kolla 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.