Ceph OSD Containers Fail when Device Names Change

Bug #1701148 reported by James McEvoy
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Won't Fix
Undecided
Unassigned

Bug Description

When a system reboots the discovery order of disks can change for various reasons. If this disk name change happens to ceph osd disk with a co-resident journal the ceph osd container will no longer start. This happens because the data partition of the disk correctly uses a the uuid name of the disk so it remains tied to the correct disk partition. The problem is that the journal on the same disk uses the traditional device name such as /dev/sdj2 which is unreliable across boots. This mismatch between the data and its journal causes restart loop for the container because of the mismatch.

The fix for this issue is to configure the uuid of the partition found in /dev/disk/by-partuuid/ to link the link the journal and it data together across reboots.

A suggested way to reproduce this issue is to attach 2 or more qcow disks disks to a VM for use as ceph osd's. Label the disk for co-resident journals and data. Then after the kolla deploy completes, make note of the uuid of the ceph devices using the command blkid. Shutdown the VM and swap the names of the qcow files so the data changes device names. Restart the VM and verify that the device name change for the ceph disks. Next start the ceph osd container to see the error.

Changed in kolla-ansible:
status: New → Confirmed
status: Confirmed → In Progress
assignee: nobody → Michal Nasiadka (mnasiadka)
Changed in kolla-ansible:
assignee: Michal Nasiadka (mnasiadka) → nobody
Changed in kolla-ansible:
status: In Progress → Confirmed
Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.