Ceph OSD Containers Fail when Device Names Change
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
When a system reboots the discovery order of disks can change for various reasons. If this disk name change happens to ceph osd disk with a co-resident journal the ceph osd container will no longer start. This happens because the data partition of the disk correctly uses a the uuid name of the disk so it remains tied to the correct disk partition. The problem is that the journal on the same disk uses the traditional device name such as /dev/sdj2 which is unreliable across boots. This mismatch between the data and its journal causes restart loop for the container because of the mismatch.
The fix for this issue is to configure the uuid of the partition found in /dev/disk/
A suggested way to reproduce this issue is to attach 2 or more qcow disks disks to a VM for use as ceph osd's. Label the disk for co-resident journals and data. Then after the kolla deploy completes, make note of the uuid of the ceph devices using the command blkid. Shutdown the VM and swap the names of the qcow files so the data changes device names. Restart the VM and verify that the device name change for the ceph disks. Next start the ceph osd container to see the error.
Changed in kolla-ansible: | |
status: | New → Confirmed |
status: | Confirmed → In Progress |
assignee: | nobody → Michal Nasiadka (mnasiadka) |
Changed in kolla-ansible: | |
assignee: | Michal Nasiadka (mnasiadka) → nobody |
Changed in kolla-ansible: | |
status: | In Progress → Confirmed |
Changed in kolla-ansible: | |
status: | Confirmed → Won't Fix |