OSDs all move from Unit Is Ready to Non-pristine devices detected after a few minutes

Bug #1781453 reported by Chris Procter
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Dmitrii Shcherbakov

Bug Description

We have a ceph cluster containing 15 nodes each with 10 nvme devices using vault and encryption at rest.

The charm deploys correctly, and when we unlock vault they as show as formating and tuning the storage, eventually reaching active with status "Unit is Ready (10 OSDs)", then over the course of about 10 minutes they all switch to blocked wth a status of "Non-pristine devices detected, consult `list-disks`, `zap-disk` and `blacklist-*` actions."

Running list-disks action on any of the units gives:
results:
  blacklist: '[]'
  disks: '[''/dev/nvme4n1'', ''/dev/nvme8n1'', ''/dev/nvme6n1'', ''/dev/nvme5n1'',
    ''/dev/nvme9n1'', ''/dev/nvme10n1'', ''/dev/nvme3n1'', ''/dev/nvme7n1'', ''/dev/nvme2n1'',
    ''/dev/nvme0n1'', ''/dev/nvme1n1'']'
  non-pristine: '[''/dev/nvme4n1'', ''/dev/nvme8n1'', ''/dev/nvme6n1'', ''/dev/nvme5n1'',
    ''/dev/nvme9n1'', ''/dev/nvme10n1'', ''/dev/nvme3n1'', ''/dev/nvme7n1'', ''/dev/nvme2n1'',
    ''/dev/nvme0n1'', ''/dev/nvme1n1'']'

There are no python tracebacks in the logs

Tags: cpe-onsite
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please add juju status, the sanitized bundle used to deploy this.

Also, a juju crashdump is ideal for analysis, though it may contain private information.

Changed in charm-ceph-osd:
status: New → Incomplete
Revision history for this message
Alexander Litvinov (alitvinov) wrote :
Revision history for this message
Chris Procter (chrisp262) wrote :

Ceph seems to think its all ok

ubuntu@juju-0a39cf-14-lxd-2:~$ sudo ceph -k /etc/ceph/ceph.client.admin.keyring -s
  cluster:
    id: 615f9d3e-8652-11e8-a479-00163ef72832
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum juju-0a39cf-14-lxd-2,juju-0a39cf-16-lxd-1,juju-0a39cf-15-lxd-1
    mgr: juju-0a39cf-16-lxd-1(active), standbys: juju-0a39cf-15-lxd-1, juju-0a39cf-14-lxd-2
    osd: 117 osds: 117 up, 117 in

  data:
    pools: 2 pools, 2560 pgs
    objects: 0 objects, 0 bytes
    usage: 235 GB used, 422 TB / 422 TB avail
    pgs: 2560 active+clean

and systemctl shows all the osd units as running, and active.

Changed in charm-ceph-osd:
status: Incomplete → New
Revision history for this message
Alexander Litvinov (alitvinov) wrote :
tags: added: cpe-onsite
Revision history for this message
Alexander Litvinov (alitvinov) wrote :
Revision history for this message
Chris Procter (chrisp262) wrote :

What appears to be happening is the relationship with one of the mons is changing. This fires the mon-relation-changed hook, calling the mon_relation() function in hooks/ceph_hooks.py

mon_relation() calls prepare_disks_and_activate() which tries to format any legitimate devices that it finds. If they are already formatted by something it sensibly refuses to format them.

Unfortunately if the OSD is already running when the osd relationship changes it detects the OSD disks as non-pristine and sets the status to "blocked" and "Non-pristine devices detected" (see line 435 in hooks/ceph_hooks.py)

The list of "legitimate" devices is gathered from the osd-devices config option and then filtered by:
# filter osd-devices that are file system paths
# filter osd-devices that does not exist on this unit
# filter osd-devices that are already mounted
# filter osd-devices that are active bluestore devices

So not filtered for "this device is already being used by me"

The only thing that might come close is "already mounted" but that uses the MOUNTPOINT attribute from lsblk and that is set to "" for all our disks so that filter is ignored.

Revision history for this message
James Page (james-page) wrote :

I think 'active bluestore device' would match for this deployment - anyway if a disk has been successfully processed once, the charm records that and won't touch it again - its the first pre-flight check.

Revision history for this message
Chris Procter (chrisp262) wrote :

    devices = [dev for dev in devices
               if not ceph.is_active_bluestore_device(dev)]
    log('Checking for pristine devices: "{}"'.format(devices), level=DEBUG)
    if not all(ceph.is_pristine_disk(dev) for dev in devices):
        status_set('blocked',
                   'Non-pristine devices detected, consult '
                   '`list-disks`, `zap-disk` and `blacklist-*` actions.')
        return

if active bluestore device caught it then it should get filtered out by the
ceph.is_active_bluestore_device(dev) check but a look in our log suggests that this is great at filtering out the wal device (nvme10n1) but not the osd devices:

root@ceph-mon-4:/var/log/juju# grep "Checking for pristine devices:" unit-nvme-ceph-osd-4.log
2018-07-13 11:55:09 DEBUG juju-log secrets-storage:224: Checking for pristine devices: "['/dev/nvme0n1', '/dev/nvme1n1', '/dev/nvme2n1', '/dev/nvme3n1', '/dev/nvme4n1', '/dev/nvme5n1', '/dev/nvme6n1', '/dev/nvme7n1', '/dev/nvme8n1']"
2018-07-13 12:06:55 DEBUG juju-log mon:46: Checking for pristine devices: "['/dev/nvme0n1', '/dev/nvme1n1', '/dev/nvme2n1', '/dev/nvme3n1', '/dev/nvme4n1', '/dev/nvme5n1', '/dev/nvme6n1', '/dev/nvme7n1', '/dev/nvme8n1']"
2018-07-13 12:45:32 DEBUG juju-log mon:46: Checking for pristine devices: "['/dev/nvme0n1', '/dev/nvme1n1', '/dev/nvme2n1', '/dev/nvme3n1', '/dev/nvme4n1', '/dev/nvme5n1', '/dev/nvme6n1', '/dev/nvme7n1', '/dev/nvme8n1']"
2018-07-13 13:01:24 DEBUG juju-log mon:46: Checking for pristine devices: "['/dev/nvme0n1', '/dev/nvme1n1', '/dev/nvme2n1', '/dev/nvme3n1', '/dev/nvme4n1', '/dev/nvme5n1', '/dev/nvme6n1', '/dev/nvme7n1', '/dev/nvme8n1']"

If the charm is skipping disks it has seen before I cant see the mechanism for that.

The bright side is that it returns before the formatting code, so its just juju has its state set incorrectly.

Revision history for this message
Chris Procter (chrisp262) wrote :

BTW, yes this is bluestore, I think the "already mounted" check would work perfectly for filestore

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I think what happens is:

1) get_devices https://github.com/openstack/charm-ceph-osd/blob/stable/18.05/hooks/ceph_hooks.py#L493-L512

gets called by prepare_disks_and_activate https://github.com/openstack/charm-ceph-osd/blob/stable/18.05/hooks/ceph_hooks.py#L422-L423

2) the prepare_disks_and_activate code falls through to is_active_bluestore_device

https://github.com/openstack/charm-ceph-osd/blob/stable/18.05/lib/ceph/utils.py#L1643-L1664

3) as get_devices returns a list of /dev/nvme* entries is_active_bluestore_device immediately evaluates to false as the underlying device is not an LVM physical volume - it's decrypted device mapper device is:

    if not lvm.is_lvm_physical_volume(dev):
        return False

So the check should take device encryption into account: if a device has a luks header and it is decrypted, check if the device-mapper-created block device is an lvm physical volume.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The check for already processed devices mentioned by James is there but it is only done after prepare_disks_and_activate, in particular:

code path 1: prepare_disks_and_activate -> *non-pristine device filtering logic* -> *if not all devices are pristine, set the non-pristine-devices-detected status*

code path 2: prepare_disks_and_activate -> *non-pristine device filtering logic* -> *all filtered devices are considered to be pristine* -> ceph.is_bootstrapped() -> utils.py:osdize -> utils.py:osdize_dev does the local unitdb-based check (https://github.com/openstack/charms.ceph/blob/4d8f31d/ceph/utils.py#L902-L906)

So it seems that there are several things that need to be done:

1) prepare_disks_and_activate needs to account for already processed osd-devices in unitdata;
2) prepare_disks_and_activate needs to consider non-processed osd-devices with LUKS header as non-pristine (regardless of whether a mapping is present or not);
3) osdize_dev should ignore active LUKS devices just in case;
4) zap action code should ignore requests for mapped LUKS devices;
5) list_disks should take osd-devices present in unitdata into account and also consider unprocessed devices with LUKS header as non-pristine.

NOTE: With ceph-volume the charm relies on unified prepare+activate functionality of 'ceph-volume lvm create' (only that command is used in osdize_dev and ceph.start_osds is extraneous for LVM-based setups):
http://docs.ceph.com/docs/luminous/ceph-volume/lvm/create/#ceph-volume-lvm-create
"This subcommand wraps the two-step process to provision a new osd (calling prepare first and then activate) into a single one."

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Changed in charm-ceph-osd:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
milestone: none → 18.08
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (stable/18.05)

Fix proposed to branch: stable/18.05
Review: https://review.openstack.org/583256

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/583207
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=e340cc851c0d409ae20c59bcf9e2199e1e0a277f
Submitter: Zuul
Branch: master

commit e340cc851c0d409ae20c59bcf9e2199e1e0a277f
Author: Dmitrii Shcherbakov <email address hidden>
Date: Sat Jul 14 22:48:20 2018 +0300

    ignore devices that have already been processed

    Similar to how osdize in charms.ceph checks for already processed
    devices we need to avoid checking if they are pristine or not.

    Additionally, mapped LUKS devices need to be filtered from being zapped
    as they may hold valuable data. They are only used as underlying devices
    for device mapper and dmcrypt to provide a decrypted block device
    abstration so if they really need to be zapped a mapping needs to be
    removed first.

    This change also pulls charms.ceph modifications.

    Change-Id: I96b3d40b3f9e56681be142377e454b15f9e22be3
    Co-Authored-By: Dmitrii Shcherbakov <email address hidden>
    Co-Authored-By: Chris Procter <email address hidden>
    Closes-Bug: 1781453

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (stable/18.05)

Reviewed: https://review.openstack.org/583256
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=98eec96859d8c913fa1a1706f489aaf5dc853583
Submitter: Zuul
Branch: stable/18.05

commit 98eec96859d8c913fa1a1706f489aaf5dc853583
Author: Dmitrii Shcherbakov <email address hidden>
Date: Sat Jul 14 22:48:20 2018 +0300

    ignore devices that have already been processed

    Similar to how osdize in charms.ceph checks for already processed
    devices we need to avoid checking if they are pristine or not.

    Additionally, mapped LUKS devices need to be filtered from being zapped
    as they may hold valuable data. They are only used as underlying devices
    for device mapper and dmcrypt to provide a decrypted block device
    abstration so if they really need to be zapped a mapping needs to be
    removed first.

    This change also pulls charms.ceph modifications.

    Change-Id: I96b3d40b3f9e56681be142377e454b15f9e22be3
    Co-Authored-By: Dmitrii Shcherbakov <email address hidden>
    Co-Authored-By: Chris Procter <email address hidden>
    Closes-Bug: 1781453
    (cherry picked from commit 8dad311db43e3c57c3569cc8e062dd8846470647)

James Page (james-page)
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.