ceph_mon stuck in configuring state

Bug #1831064 reported by Brent Rowsell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tingjie Chen

Bug Description

Brief Description
-----------------
I was deploying a standard system. I configured the 3rd ceph monitor on a worker node(worker-1). It never went to configured state and was stuck in configuring state. The monitor did configure though.

system ceph-mon-list
+--------------------------------------+-------+--------------+-------------+-----------------------------------------------------------------+
| uuid | ceph_ | hostname | state | task |
| | mon_g | | | |
| | ib | | | |
+--------------------------------------+-------+--------------+-------------+-----------------------------------------------------------------+
| 2414dc0a-f355-4e85-80cb-20c19882d503 | 20 | worker-1 | configuring | {u'controller-1': 'configuring', u'controller-0': 'configured'} |
| 4a5b2c45-dcd8-4739-b6be-421e1275fbff | 20 | controller-1 | configured | None |
| b3238d16-29f2-4be6-9d99-d67e9417d6c3 | 20 | controller-0 | configured | None |
+--------------------------------------+-------+--------------+-------------+-----------------------------------------------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

ceph -s
  cluster:
    id: 2008f13a-da9c-4bd3-80d7-93125f149d6f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum controller-0,controller-1,worker-1
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 2 up, 2 in
    rgw: 1 daemon active

  data:
    pools: 5 pools, 320 pgs
    objects: 2.62 k objects, 4.6 GiB
    usage: 9.6 GiB used, 919 GiB / 929 GiB avail
    pgs: 320 active+clean

  io:
    client: 2.4 MiB/s wr, 0 op/s rd, 228 op/s wr

Severity
--------
Major

Steps to Reproduce
------------------
system ceph-mon-add <worker node>

Expected Behavior
------------------
Ceph monitor goes to provisioned state

Actual Behavior
----------------
Ceph monitor stuck in configured state

Reproducibility
---------------
Not sure

System Configuration
--------------------
Standard

Branch/Pull Time/Commit
-----------------------
"2019-05-29 17:08:35 -0400"

Last Pass
---------
May 25th

Timestamp/Logs
--------------
2019-05-30 01:14:18.582 107959 INFO sysinv.api.controllers.v1.ceph_mon [-] Creating ceph-mon DB entry for host uuid 4e9348ec-df38-422e-879f-43aa3168e029: {'forihostid': 3, 'device_path': None, 'uuid': '2414dc0a-f355-4e85-80cb-20c19882d503', 'ceph_mon_gib': 20, 'created_at': None, 'hostname': None, 'updated_at': None, 'ihost_uuid': u'4e9348ec-df38-422e-879f-43aa3168e029', 'state': 'configuring', 'task': "{u'controller-1': 'configuring', u'controller-0': 'configuring'}", 'ceph_mon_dev': None, 'id': None}

Logs attached

Test Activity
-------------
Other

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Added logs

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.2.0
tags: added: stx.config
Ghada Khalil (gkhalil)
tags: added: stx.storage
Revision history for this message
Tingjie Chen (silverhandy) wrote :

I have deploy for many times, just met it only once (stuck in configuring, but seems not worker but controller), seems reproduce rate is low, maybe can try the latest code for validation.

Changed in starlingx:
assignee: nobody → Tingjie Chen (silverhandy)
Revision history for this message
Tingjie Chen (silverhandy) wrote :

I am trying to reproduce the issue with latest release ISO: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190609T233000Z

But cannot reproduce the stuck status of ceph mon, so suggest to keep monitor this issue, and not gating stx.2.0 release.

[wrsroot@controller-1 ~(keystone_admin)]$ system ceph-mon-list
+--------------------------------------+--------------+--------------+------------+------+
| uuid | ceph_mon_gib | hostname | state | task |
+--------------------------------------+--------------+--------------+------------+------+
| 2a16a921-c672-47f4-a7f4-9209491bd67d | 20 | compute-0 | configured | None |
| 353a5202-6097-4c0d-a1eb-75aac91a56bb | 20 | controller-1 | configured | None |
| c3f7cce8-8f7b-4e23-a396-ee4465cf6c4a | 20 | controller-0 | configured | None |
+--------------------------------------+--------------+--------------+------------+------+

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Tingjie Chen (silverhandy) wrote :

Since the issue has not reproduced for 3 weeks, set it fix released.

Changed in starlingx:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.