OSCI/ServerStack: non-pristine devices cause random testing failures

Bug #1840836 reported by Alex Kavanagh
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Invalid
High
Ryan Beisner
OpenStack Charm Test Infra
Fix Released
High
Unassigned

Bug Description

When running on ServerStack/OSCI, ceph-osd can fail with a 'non-pristine device' failure. This is due to a complex interplay of technologies that deliver a block device to the ceph-osd unit as part of Juju storage. Essentially, the issue is:

1. The bundle specs a block device to be provided to ceph-osd
2. Juju (storage) requests that the openstack provider (essentially cinder) attach a block device to the ceph-osd unit as part of the provisioning.
3. cinder responds with a block name which is provided to the ceph-osd device.
4. Randomly, the actual ceph-osd unit boots, gets its block-device BUT it has a different name to the one provided to the charm (via config).
5. Ceph then fails as the block device name it has for the OSD doesn't actually match the block device on the unit due to (most likely) the random order of assignment of block devices when the unit is booting.

This bug is a clearing house for solutions/tests to solve this problem, which may include:

1. disabling swap on the ceph-osd unit.
2. Trying to use a loop device on the unit instead for consistent naming, via cloud-init.

Tags: uosci
Revision history for this message
Ryan Beisner (1chb1n) wrote :
tags: added: uosci
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I suggest that something like https://review.opendev.org/#/c/668489/ would likely resolve this issue.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

For clarity, flavor change in https://review.opendev.org/#/c/668489/ caused the instance type selected to be one without a swap device, which was theorized to be contributing the issue with device naming.

Rather than push constraints everywhere, I adjusted the flavor in the CI so that the default instance type no longer has a swap device, but has 2GB (up from 1536M).

Let's watch and see if these disappear on the whole.

Changed in charm-ceph-osd:
status: New → Incomplete
assignee: nobody → Ryan Beisner (1chb1n)
importance: Undecided → High
Changed in charm-test-infra:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Ryan Beisner (1chb1n)
Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
status: Incomplete → Invalid
Revision history for this message
Ryan Beisner (1chb1n) wrote :

This recent, unrelated change, was failing on the non-pristine issue in this bug:
 https://review.opendev.org/#/c/677874/

I've retriggered testing after the flavor modifications, need to confirm when that is done.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Changed in charm-test-infra:
status: In Progress → Confirmed
assignee: Ryan Beisner (1chb1n) → nobody
Revision history for this message
Ryan Beisner (1chb1n) wrote :

In further inspection, I had only updated 1 flavor to remove the swap disk. I've now updated all of the flavors, and confirmed that none have swap devices defined. Also, digging to confirm which flavors actually get used in these tests.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

On a ceph-proxy func test run, confirmed that all instances are m1.small (no swap).

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

I believe that the occurrences in comments #9 and #10 are actually a separate bug, limited to Disco and Eoan: https://bugs.launchpad.net/charm-nova-compute/+bug/1842751

Further, I have yet to see the SAME type of occurrence as this original bug after we adjusted the flavors.

Changed in charm-test-infra:
status: Confirmed → Fix Committed
David Ames (thedac)
Changed in charm-test-infra:
milestone: none → 19.10
David Ames (thedac)
Changed in charm-test-infra:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.