StarlingX

Ceph monitor provisioning on storage system unexpectedly allows adding a compute into the quorum (missing semantic check)

Bug #1840073 reported by Wendy Mitchell on 2019-08-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Daniel Badea

Bug Description

Brief Description
-----------------
Ceph monitor provisioning (on storage system) seems unexpectedly allow adding a compute into the quorum
(needs a semantic check)

Severity
--------
Standard

Steps to Reproduce
------------------
1. On a 2+2(storage)+X compute system, lock and delete storage-0. This leaves two ceph monitors: controller-0 and controller-1.
2. Provision one of the computes as the new ceph monitor (instead of the remaining storage node)
by locking the node and running system ceph-mon-add <nodename>, unlock the host

3. Once the host unlocks, check ceph -s. You'll see the following:

$ ceph -s
  cluster:
    id: 364fbdf0-9747-4a40-9f8c-9e0cc107342d
    health: HEALTH_WARN
            1 osds down
            1 host (4 osds) down
            181/1272 objects misplaced (14.230%)
            Degraded data redundancy: 455/1272 objects degraded (35.770%), 55 pgs degraded, 428 pgs undersized
            1/4 mons down, quorum controller-0,controller-1,compute-1

  services:
    mon: 4 daemons, quorum controller-0,controller-1,compute-1, out of quorum: storage-0
    mgr: controller-0(active), standbys: controller-1
    osd: 8 osds: 4 up, 5 in; 172 remapped pgs

  data:
    pools: 5 pools, 600 pgs
    objects: 636 objects, 2.4 GiB
    usage: 3.4 GiB used, 2.3 TiB / 2.3 TiB avail
    pgs: 455/1272 objects degraded (35.770%)
             181/1272 objects misplaced (14.230%)
             373 active+undersized
             172 active+clean+remapped
             55 active+undersized+degraded

io:
client: 84 KiB/s rd, 482 KiB/s wr, 98 op/s rd, 90 op/s wr

Expected Behavior
------------------
Did not expect to be allowed to add the compute node to the quorum in step 2
In a storage configuration you should only be allowed to have controllers 0/1 + storage node in the quorum. (No other configuration is supported)

Actual Behavior
----------------
In step 2, the addition of the worker node was not rejected.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Standard 2 controller + 2 storage + X computes

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-08-12_20-59-00"

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------
bug retest

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-14:

Marking as stx.3.0 / medium priority - this is an issue if the user provides the wrong monitor during provisioning. So the issue can be avoided, but still the system should prevent this user error.

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.3.0 stx.config stx.storage
Changed in starlingx:
assignee:	nobody → Daniel Badea (daniel.badea)

Numan Waheed (nwaheed) on 2019-08-14

tags:

added: stx.retestneeded

Daniel Badea (daniel.badea) on 2019-08-26

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-26: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678595

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-03: Fix merged to config (master)

Reviewed: https://review.opendev.org/678595
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=0b30e535121bf2c36da46e32913583ab7555011d
Submitter: Zuul
Branch: master

commit 0b30e535121bf2c36da46e32913583ab7555011d
Author: Daniel Badea <email address hidden>
Date: Mon Aug 26 15:18:30 2019 +0000

ceph: enable storage model if storage nodes provisioned

    As the system is provisioned by adding new nodes the storage
    model can change between controller based or storage based
    depending on the type of nodes being added an how the ceph
    monitors are configured.

    In the case of a storage lab with (for example) 2 controllers,
    2 storage nodes and 2 workers configured:
    - ceph monitors are configured to run on: controller-0,
      controller-1 and storage-0
    - ceph osd's are configured on: storage-0 and storage-1
    When storage-0 is locked and deleted then if we look only at
    controller and worker nodes there is no difference between this
    setup and one with 2 controllers and 2 computes that's being
    provisioned.

To fix this issue, look for other provisioned storage nodes.
If any is found then storage model was configured.

    Change-Id: Ie23721a24a8b643d4ac26256fe8532806f2672da
    Closes-bug: 1840073
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-23:

verified the following semantic checks
2019-09-22_20-00-00

1. attempt to add compute-0 as ceph-mon while storage-0 still there
$ system ceph-mon-add compute-0
Ceph monitor already configured for host 'storage-0'.

*** 2. attempt to add compute-0 as ceph-mon after storage-0 lock and deleted
$ system ceph-mon-add compute-0
Can not add a storage monitor to a worker if ceph's deployments model is already set to storage-nodes.

3. attempt to add storage-1 as ceph-mon while storage-1 is still online, available
$ system ceph-mon-add storage-1
Host storage-1 must be locked and online.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.