Hosts randomly 'losing' disks, breaking ceph-osd service enumeration
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
High
|
James Page | ||
Queens |
Fix Released
|
High
|
James Page | ||
Rocky |
Fix Released
|
High
|
James Page | ||
Stein |
Fix Released
|
High
|
James Page | ||
Train |
Fix Released
|
High
|
James Page | ||
ceph (Ubuntu) |
Fix Released
|
High
|
James Page | ||
Bionic |
Fix Released
|
High
|
James Page | ||
Disco |
Fix Released
|
High
|
James Page | ||
Eoan |
Fix Released
|
High
|
James Page |
Bug Description
[Impact]
For deployments where the bluestore DB and WAL devices are on separate underlying OSD's, its possible on reboot that the LV's configured on these devices have not yet been scanned and detected; the OSD boot process ignores this fact and tries to boot the OSD anyway as soon as the primary LV supporting the OSD is detected, resulting in the OSD crashing as required block device symlinks are not present.
[Test Case]
Deploy ceph with bluestore + separate DB and WAL devices.
Reboot servers
OSD devices will fail to start after reboot (its a race so not always).
[Regression Potential]
Low - the fix has been landed upstream and simple ensures that if a separate LV is expected for the DB and WAL devices for an OSD, the OSD will not try to boot until they are present.
[Original Bug Report]
Ubuntu 18.04.2 Ceph deployment.
Ceph OSD devices utilizing LVM volumes pointing to udev-based physical devices.
LVM module is supposed to create PVs from devices using the links in /dev/disk/by-dname/ folder that are created by udev.
However on reboot it happens (not always, rather like race condition) that Ceph services cannot start, and pvdisplay doesn't show any volumes created. The folder /dev/disk/by-dname/ however has all necessary device created by the end of boot process.
The behaviour can be fixed manually by running "#/sbin/lvm pvscan --cache --activate ay /dev/nvme0n1" command for re-activating the LVM components and then the services can be started.
tags: | added: canonical-bootstack |
affects: | systemd (Ubuntu) → ceph (Ubuntu) |
Changed in ceph (Ubuntu): | |
importance: | Critical → High |
status: | Triaged → In Progress |
Changed in ceph (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in ceph (Ubuntu Disco): | |
status: | New → In Progress |
assignee: | nobody → James Page (james-page) |
Changed in ceph (Ubuntu Bionic): | |
assignee: | nobody → James Page (james-page) |
importance: | Undecided → High |
Changed in ceph (Ubuntu Disco): | |
importance: | Undecided → High |
description: | updated |
no longer affects: | cloud-archive/pike |
Changed in cloud-archive: | |
status: | Fix Committed → Fix Released |
Status changed to 'Confirmed' because the bug affects multiple users.