ceph-disk times out when preparing an NVMe based OSD

Bug #1913343 reported by Bob Church
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------
When ceph-disk is used to prepare an new OSD on a NVMe drive and intermittent failure is observed.

2020-12-08T13:37:06.376 Debug: 2020-12-08 13:37:06 +0000 Executing: '/bin/true # comment to satisfy puppet syntax requirements
2020-12-08T13:37:06.378 set -ex
2020-12-08T13:37:06.381 disk=$(readlink -f /dev/disk/by-path/pci-0000:84:00.0-nvme-1)
2020-12-08T13:37:06.383 ceph-disk --verbose --log-stdout prepare --filestore --cluster-uuid 1003fbcc-ffc6-47fc-841d-3f9d785ea99a --osd-uuid 21361e38-b0b9-4915-a4ee-f353ab44ad37 --osd-id 0 --fs-type xfs --zap-disk ${disk} $(readlink -f )
2020-12-08T13:37:06.385 mkdir -p /var/lib/ceph/osd/ceph-0
2020-12-08T13:37:06.387 ceph auth del osd.0 || true
2020-12-08T13:37:06.390 part=${disk}
2020-12-08T13:37:06.393 if [[ $part == nvme ]]; then
2020-12-08T13:37:06.395 part=${part}p1
2020-12-08T13:37:06.397 else
2020-12-08T13:37:06.399 part=${part}1
2020-12-08T13:37:06.401 fi
2020-12-08T13:37:06.403 mount $(readlink -f ${part}) /var/lib/ceph/osd/ceph-0
2020-12-08T13:37:06.406 ceph-osd --id 0 --mkfs --mkkey --mkjournal
2020-12-08T13:37:06.410 ceph auth add osd.0 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-0/keyring
2020-12-08T13:37:06.413 umount /var/lib/ceph/osd/ceph-0
2020-12-08T13:37:06.423 '
2020-12-08T13:47:06.263 Error: 2020-12-08 13:47:06 +0000 Command exceeded timeout

Severity
--------
Critical: System/Feature is not usable due to the defect.

Steps to Reproduce
------------------
Add the Ceph storage backend and the add a OSD to the system:

$ system storage-backend-add ceph --confirmed
watch -n 10 system storage-backend-list
$ system host-disk-list controller-0 | awk '/\/dev\/nvme1n1/{print $2}' | xargs -i system host-stor-add controller-0 {}
$ system host-disk-list controller-1 | awk '/\/dev\/nvme1n1/{print $2}' | xargs -i system host-stor-add controller-1 {}
$ watch "system host-stor-list controller-0; system host-stor-list controller-1"

Expected Behavior
------------------
The stors (OSDs) to become "configured"

Actual Behavior
----------------
The stors (OSDs) end up in "configurtion-failed"

Reproducibility
---------------
Intermittent.

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
N/A

Last Pass
---------
N/A

Timestamp/Logs
--------------
Debugged on a live system no corresponding collect logs

Test Activity
-------------
Developer Testing

Workaround
----------
Manually execute the ceph-disk command outside of the puppet context to prepare the OSD. Then lock/unlock the host(s)

Revision history for this message
Bob Church (rchurch) wrote :
Bob Church (rchurch)
Changed in starlingx:
importance: Undecided → High
Bob Church (rchurch)
Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.5.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.