Bug #1827119 “ceph-mon-modify does not update the ceph-mon parti...” : Bugs : StarlingX

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-05-01:

#1

Marking stx.2.0 gating as ceph functionality is required for all StarlingX configurations.

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Cindy Xie (xxie1)
tags:	added: stx.2.0 stx.retestneeded

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-07:

#2

Maria, can you retest using the latest ISO (after 5/3 Ceph upgrade)?

Tingjie Chen (silverhandy) on 2019-05-08

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Tingjie Chen (silverhandy)

Revision history for this message

Maria Yousaf (myousaf) wrote on 2019-05-08:

#3

This is still a problem:

[wrsroot@controller-0 ~(keystone_admin)]$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 43G 115M 40G 1% /var/lib/ceph/mon

controller-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 43G 115M 40G 1% /var/lib/ceph/mon

compute-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 111M 20G 1% /var/lib/ceph/mon

###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Tingjie Chen (silverhandy) on 2019-05-09

tags:

added: stx.storage

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-05-09:

#4

We have two approaches here, depending on the path we want to take:

1. Drop resize functionality, ceph-mon uses less than 1GB of data (more like 500MB) and we allocate 20GB, Ceph docs recomands 10GB just in case. I'll throw a question on Ceph's mailing list as there is no clear specification regarding ceph-mon limits.
2. Since all ceph-mon partitions have to be equal on all nods we implement resize for worker nodes and for storage nodes (none exists today). There are two solutions here:
A. Simplified but inconsistent solution: extend ceph-mon also for computes and storages. Problem is if we don't have enough space in cgts-vg we can't extend it either on storage. The good part is that we allocate the entire disk space to it, therefore there should be enough space as long as we put a limit to it (if memory serves me, current limit is set at 40GB).
Another problem is we manage all other LVs in cgts-vg through 'system controllerfs-*' commands, see B below.
B. Generic, longterm and consistent solution: Since ceph mon data resides on a logical volume in cgts-vg and all the other LVs in here are managed through 'system controller-fs-*' commands it makes sense to also manage ceph-mon using controllerfs cmds. The problem here is that this mechanism doesn't exist for worker not storage nodes. Therefore to make this work we will need a small story that:
- renames 'system controllerfs-*' commands to 'system nodefs-*'
- enables rootfs modifications for all node types
- removes existing ceph-mon-gib funtionality.

I would go for #1 but we need confirmation on Ceph mailing lists that there is no risk to go above the 20GB space we currently allocate to ceph-mon esle 2A is the simplest solution to implement now, given the fact that in following releases we plan to also containerize Ceph thus we will need to revisit this mechanism anyway.

Tingjie Chen (silverhandy) on 2019-05-10

Changed in starlingx:
status:	Triaged → In Progress

Cindy Xie (xxie1) on 2019-05-13

tags:

added: stx.distro.other

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-05-14:

#5

Thanks Ovidiu for your suggestions,

for #1 solution, to starlingx users, it is a problem if ceph-mon data reach the limits without warning message, currently for ceph cluster commands, there are setttings of mon_data_size_warn/mon_data_avail_warn, and I think is needed for a command to setup familiar threshold or limit mechanism.

for #2A solutions, some ceph deployment allocate entire disk to share data with mon and others (osd, logs, etc), and set a warning threshold percentage for ceph-mon data, it simple to implement but has cross influence impact, if system logs crazy increase and fill up disk space, ceph-mon data has no way to expand even it is not reach threashold size or percentage.

#2B solution is a ideal solution but consider the complex source change with stability risk, I think we have to make a compromise, I tend to #2A currently, yes next containerized ceph upgrade need to reconsider a longterm solution and maybe have other concerns.

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-05-15:

#6

Re #1. We got 3 answers on the ceph mailing list:

1. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034672.html
2. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034674.html
3. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034709.html

Summary is:
1. In normal conditions size is small (~1.5GB)
2. Ceph-mon data increases if there are OSDs misbehaving (i.e. OSDs down, nodes down) for long time as cluster needs to keep previous epochs for replays once the misbehaving OSD rejoins the cluster.
3. Once reply is done, previous data is cleaned and space is released back (so no leackage, which is good!)
4. There has to be enough space for replays, recommendation is ~64 GB but it depends on the cluster size (note that our clusters are quite small, 4-8 storage nodes each with 4 OSDs is a small cluster from Ceph's perspective). As you said Tingjie, better have a warning and allow user to take action (allow him to increase mon partition size & fix error condition).

Conclusions is that we still need the resize => #1 is out of the way. And the initial 20GB is ok but has to be resizable.

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-05-15:

#7

So, indeed Tingjie, #2B is too much. #1 is not a good idea (see prev comment). #2A is the way to go coupled with setting "mon data avail warn" to 10 (i.e. alarm will be raised when ceph mon disk usage reaches 90%) in $MY_REPO/stx/stx-integ/ceph/ceph/files/ceph.conf.

In our case ceph-mon resides on a separate partition (i.e. lvm logical volume), so no need to worry about other processes filling it (no logging nor OSD data is wirtten there, our logging is done to /var/log/ceph and OSDs keep data on the OSD disk or journal disk) => 90% is a good static limit for the threshold (by default is 70% but that's too loose).

Also, it's a good idea to disable "mon data size warn" (i.e. setting it to 0 should do it) else we would have to update it each time ceph-mon is resized or alarm will be raised at 15GB.

Cindy Xie (xxie1) on 2019-05-22

tags:

removed: stx.distro.other

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-23: Fix proposed to config (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/660889

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-23: Change abandoned on config (master)

#9

Change abandoned by Tingjie Chen (<email address hidden>) on branch: master
Review: https://review.opendev.org/660889
Reason: Abandon it since MON data need to sync on 3 MONs, change controllers without worker/storage does not make sense.

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-05-23:

#10

Some update for the issue...

Command: system ceph-mon-modify <controller> ceph_mon_gib=<number>
since it support controllers only in literal meaning, but after add
storage/worker support, there are challenges.

According to current implementation, it will update controllers and
worker/storage node to lvextend cgts-vg, but only check free size on
controllerfs, we cannot get the cgts-vg information of worker/storage,
so there will have 2 issues:
1. Update worker/storage (if ceph mon) also besides controllers which
not according with command literal meaning for controllers only, refer
LP: 1827119
2. Have no way to check fs of worker/storage, if there is no enough free
space to extend the vgts-vg but when lock/unlock, puppet script execute
/usr/sbin/lvextend -L xxxk /dev/cgts-vg/ceph-mon-lv and return failed,
worker/storage node status is fail, refer LP: 1828262

So it is not simple to extend ceph-mon-modify ceph_mon_gib on compute and storage nodes.

One way we resolve the issue is update controllers only by following
literal meaning of the command, since there is no way to check the free
space of vgts-vg, or the implementation to check on worker/storage is
complicated. And the problem is ceph_mon_gib cannot config on
storage/worker node, but ceph mon need sync and mon size equality is required.
Another way is make some check mechanism before lvextend in worker/storage node,
when node reboot, if puppet check failed will give alarm information for users to adjust.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-29: Fix proposed to config (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/661900

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-05-30:

#12

I have submit gerrit patch: https://review.opendev.org/661900, which following the #2A solution.
The way we resolve the issue is extend the check mechanism to controllers
and worker/storage nodes, if one node check cgtsvg limit failed, the
command will return error message and no action is implemented.

Revision history for this message

Kristine Bujold (kbujold) wrote on 2019-06-17:

#13

Something to note is that ceph-mon is also configurable from Horizon under "Admin/Platform/System Configuration/Controller Filesystem"

Revision history for this message

Kristine Bujold (kbujold) wrote on 2019-06-17:

#14

ceph-mon_horizon.PNG Edit (20.9 KiB, image/png)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-24:

#15

Fix proposed to branch: master
Review: https://review.opendev.org/667044

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-27: Change abandoned on config (master)

#16

Change abandoned by Tingjie Chen (<email address hidden>) on branch: master
Review: https://review.opendev.org/667044
Reason: Abandon it since it is debug only

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix merged to config (master)

#17

Reviewed: https://review.opendev.org/661900
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=39cb9d0c0b66a4f4920cd26ecd27bdf196d9b260
Submitter: Zuul
Branch: master

commit 39cb9d0c0b66a4f4920cd26ecd27bdf196d9b260
Author: Chen, Tingjie <email address hidden>
Date: Fri Jul 12 21:49:42 2019 +0800

Refine command ceph-mon-modify for all nodes support

    Command: system ceph-mon-modify <controller> ceph_mon_gib=<number>
    since it support controllers only in literal meaning, but after add
    storage/worker support, there are challenges.

    According to current implementation, it will update controllers and
    worker/storage node to lvextend cgts-vg, but only check free size on
    controllerfs, we cannot get the cgts-vg information of worker/storage,
    so there will have 2 issues:
    1. Update worker/storage (if ceph mon) also besides controllers which
    not according with command literal meaning for controllers only, refer
    LP: 1827119
    2. Have no way to check fs of worker/storage, if there is no enough free
    space to extend the vgts-vg but when lock/unlock, puppet script execute
    /usr/sbin/lvextend -L xxxk /dev/cgts-vg/ceph-mon-lv and return failed,
    worker/storage node status is fail, refer LP: 1828262

    The way we resolve the issue is extend the check mechanism to
    controllers and worker/storage nodes, if one node check cgtsvg limit
    failed, the command will return error message and no action is
    implemented.

    Closes-Bug: 1827119
    Change-Id: I106581bde1ebbe56cd34e35fa734435bd0c1a268
    Signed-off-by: Chen, Tingjie <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-08:

#18

Download full text (7.2 KiB)

20190807T053000Z
2 controller +2 worker node system

The behavior is still not quite expected in terms of compute-0 alarms and system ceph-mon-list "state" output
ie. see ****--- in step 6 and step 7 below

1. Check the ceph mon size prior to making any changes
[sysadmin@controller-0 ~(keystone_admin)]$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 108M 20G 1% /var/lib/ceph/mon
controller-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 108M 20G 1% /var/lib/ceph/mon
compute-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 108M 20G 1% /var/lib/ceph/mon

2. Attempt changes
$ system ceph-mon-modify controller-0 ceph_mon_gib=40
Node: compute-0 Total target growth size 20 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.
$ system ceph-mon-modify controller-0 ceph_mon_gib=10
ceph_mon_gib = 10. Value must be between 21 and 40.
[sysadmin@controller-0 ~(keystone_admin)]$ system ceph-mon-modify controller-0 ceph_mon_gib=25
Node: compute-0 Total target growth size 5 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.

3. command executed here to make the change:
$ system ceph-mon-modify controller-0 ceph_mon_gib=21
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+
| uuid | ceph_ | hostname | state | task |
| | mon_g | | | |
| | ib | | | |
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+
| 4df4455f-ff14-4433-9d24-645cc920e673 | 21 | compute-0 | configuring | {u'controller-1': 'configuring', u'controller-0': |
| | | | | 'configured'} |
| | | | | |
| da081bb3-36a5-4122-ab13-619d8888c299 | 21 | controller-1 | configured | None |
| ec58c58d-ded3-4169-9796-1047f1949aa4 | 21 | controller-0 | configured | None |
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+

***4. Config out-of-date alarms are triggered (but not a compute-0 alarm???)
controller-0 shows Config out-of-date

250.001 controller-0 Configuration is out-of-date. host=controller-0 major 2019-08-08T16:26:07
250.001 controller-1 Configuration is out-of-date. host=con...

20190807T053000Z
2 controller +2 worker node system

The behavior is still not quite expected in terms of compute-0 alarms and system ceph-mon-list "state" output
ie. see ****--- in step 6 and step 7 below

1. Check the ceph mon size prior to making any changes
[sysadmin@controller-0 ~(keystone_admin)]$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv   22G  108M   20G   1% /var/lib/ceph/mon
controller-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv   22G  108M   20G   1% /var/lib/ceph/mon
compute-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv      22G  108M   20G   1% /var/lib/ceph/mon

2. Attempt changes 
$ system ceph-mon-modify controller-0 ceph_mon_gib=40
Node: compute-0 Total target growth size 20 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.
$ system ceph-mon-modify controller-0 ceph_mon_gib=10
ceph_mon_gib = 10. Value must be between 21 and 40.
[sysadmin@controller-0 ~(keystone_admin)]$ system ceph-mon-modify controller-0 ceph_mon_gib=25
Node: compute-0 Total target growth size 5 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.

3. command executed here to make the change:
$ system ceph-mon-modify controller-0 ceph_mon_gib=21
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+
| uuid                                 | ceph_ | hostname     | state       | task                                                         |
|                                      | mon_g |              |             |                                                              |
|                                      | ib    |              |             |                                                              |
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+
| 4df4455f-ff14-4433-9d24-645cc920e673 | 21    | compute-0    | configuring | {u'controller-1': 'configuring', u'controller-0':            |
|                                      |       |              |             | 'configured'}                                                |
|                                      |       |              |             |                                                              |
| da081bb3-36a5-4122-ab13-619d8888c299 | 21    | controller-1 | configured  | None                                                         |
| ec58c58d-ded3-4169-9796-1047f1949aa4 | 21    | controller-0 | configured  | None                                                         |
+--------------------------------------+-------+--------------+-------------+--------------------------------------------------------------+

***4. Config out-of-date alarms are triggered (but not a compute-0 alarm???)
controller-0 shows Config out-of-date

250.001 	controller-0 Configuration is out-of-date. 	host=controller-0 	major 	2019-08-08T16:26:07 	
	250.001 	controller-1 Configuration is out-of-date. 	host=controller-1 	major 	2019-08-08T16:26:07

5. Confirm both controller ceph mon size updated after lock.unlock, swact, lock.unlock combination on controllers

Controller-0 shows ceph mon is updated
controller-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv   23G  107M   21G   1% /var/lib/ceph/mon

6. Confirm compute-0 ceph mon size:

Compute-0 still shows the value has not been updated yet 
****--- Unexpectedly, there is no alarms to indicate config out-of-date  !!

compute-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv      22G  107M   20G   1% /var/lib/ceph/mon

The ceph-mon-list shows that compute-0 is still in "configuring" state though.

$ system ceph-mon-list
+--------------------------------------+-------+--------------+-------------+---------------------------------------------------------------+
| uuid                                 | ceph_ | hostname     | state       | task                                                          |
|                                      | mon_g |              |             |                                                               |
|                                      | ib    |              |             |                                                               |
+--------------------------------------+-------+--------------+-------------+---------------------------------------------------------------+
| 4df4455f-ff14-4433-9d24-645cc920e673 | 21    | compute-0    | configuring | {u'controller-1': 'configuring', u'controller-0':             |
|                                      |       |              |             | 'configured'}                                                 |
|                                      |       |              |             |                                                               |
| da081bb3-36a5-4122-ab13-619d8888c299 | 21    | controller-1 | configured  | None                                                          |
| ec58c58d-ded3-4169-9796-1047f1949aa4 | 21    | controller-0 | configured  | None

compute-0:~$  df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv      23G  107M   21G   1% /var/lib/ceph/mon

Note: Some of the original behavior still exists. ie. "for the  worker node, there is no indication to lock/unlock the node and ceph-mon stays at the default value 20GB. 
This disagrees from the output of system ceph-mon-list"

7.0  After lock.unlock action on compote-0 the ceph mon size appears to have updated

****--- However - the compute-0 state did not update ie. still shows state "configuring"

compute-0:~$  df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv      23G  107M   21G   1% /var/lib/ceph/mon

$ system ceph-mon-list
+--------------------------------------+-------+--------------+-------------+---------------------------------------------------------------+
| uuid                                 | ceph_ | hostname     | state       | task                                                          |
|                                      | mon_g |              |             |                                                               |
|                                      | ib    |              |             |                                                               |
+--------------------------------------+-------+--------------+-------------+---------------------------------------------------------------+
| 4df4455f-ff14-4433-9d24-645cc920e673 | 21    | compute-0    | configuring | {u'controller-1': 'configuring', u'controller-0':             |
|                                      |       |              |             | 'configured'}                                                 |
|                                      |       |              |             |                                                               |
| da081bb3-36a5-4122-ab13-619d8888c299 | 21    | controller-1 | configured  | None                                                          |
| ec58c58d-ded3-4169-9796-1047f1949aa4 | 21    | controller-0 | configured  | None

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-13:

#19

This LP should be reopened.

Numan Waheed (nwaheed) on 2019-08-13

Changed in starlingx:
status:	Fix Released → Confirmed

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-13:

#20

ALL_NODES_20190808.211720.tar Edit (72.4 MiB, application/x-tar)

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-08-19:

#21

Hi Wendy, I suppose your key point are:

1/ No alarm: out-of-config in compute-0 to indicate.
2/ ceph-mon-list, compute-0 status in configuring, and cannot recover.

for the second point, I cannot reproduce the configuring status in compute-0, may I ask what is the reproduce frequency?

--------------------------------

After host-lock, host-unlock, swact in controller-0, and controller-1, and host lock/unlock in compute-0.

[sysadmin@controller-0 ~(keystone_admin)]$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 23G 108M 21G 1% /var/lib/ceph/mon

controller-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 23G 108M 21G 1% /var/lib/ceph/mon

compute-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 23G 108M 21G 1% /var/lib/ceph/mon

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-08-19:

#22

Download full text (3.3 KiB)

I have tried 2 times for the 2 points,
1/ The alarm list for compute-0 missing, this issue is confirmed, and I will raise another patch to fix.
2/ compute-0 status in configuring, this still cannot reproduce.

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+
| 250.001 | controller-1 Configuration is out-of-date. | host=controller-1 | major | 2019-08-19T11:15:03.902690 |
| 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2019-08-19T11:15:03.798282 |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster=a4c1e115-f27a-4c83-9c4d-3bc500e5f3e5.peergroup=group-0 | critical | 2019-08-19T11:10:10.389689 |
| 400.005 | Communication failure detected with peer over port ens6 on host controller-1 | host=controller-1.network=oam | major | 2019-08-19T08:41:12.293739 |
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=a4c1e115-f27a-4c83-9c4d-3bc500e5f3e5 | warning | 2019-08-19T08:39:20.555559 |
| 400.005 | Communication failure detected with peer over port ens6 on host controller-0 | host=controller-0.network=oam | major | 2019-08-19T08:19:52.216585 |
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+

after swact and ceph-mon-add compute-1:

I have tried 2 times for the 2 points,
1/ The alarm list for compute-0 missing, this issue is confirmed, and I will raise another patch to fix.
2/ compute-0 status in configuring, this still cannot reproduce.

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text                                                                                                           | Entity ID                                                      | Severity | Time Stamp                 |
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+
| 250.001  | controller-1 Configuration is out-of-date.                                                                            | host=controller-1                                              | major    | 2019-08-19T11:15:03.902690 |
| 250.001  | controller-0 Configuration is out-of-date.                                                                            | host=controller-0                                              | major    | 2019-08-19T11:15:03.798282 |
| 800.010  | Potential data loss. No available OSDs in storage replication group  group-0: no OSDs                                 | cluster=a4c1e115-f27a-4c83-9c4d-3bc500e5f3e5.peergroup=group-0 | critical | 2019-08-19T11:10:10.389689 |
| 400.005  | Communication failure detected with peer over port ens6 on host controller-1                                          | host=controller-1.network=oam                                  | major    | 2019-08-19T08:41:12.293739 |
| 800.001  | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=a4c1e115-f27a-4c83-9c4d-3bc500e5f3e5                   | warning  | 2019-08-19T08:39:20.555559 |
| 400.005  | Communication failure detected with peer over port ens6 on host controller-0                                          | host=controller-0.network=oam                                  | major    | 2019-08-19T08:19:52.216585 |
+----------+-----------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+----------+----------------------------+

after swact and ceph-mon-add compute-1:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-20: Fix proposed to config (master)

#23

Fix proposed to branch: master
Review: https://review.opendev.org/677424

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#24

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-23: Fix merged to config (master)

#25

Reviewed: https://review.opendev.org/677424
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=abebb638603cb3297ae133d4e388d034d7d3eff8
Submitter: Zuul
Branch: master

commit abebb638603cb3297ae133d4e388d034d7d3eff8
Author: Chen, Tingjie <email address hidden>
Date: Tue Aug 20 15:50:41 2019 +0800

Add alarm for worker in ceph-mon

    In 2+2 deployment, ceph-mon has 3 replication with controller-0/1 and
    compute-0, when settings such as ceph_mon_gib changed, the alarm-list
    lack of compute node dedicated message to indicate.

The change list all ceph monitors in controller/storage/worker and
alarm will produced on dedicated node with storage config change.

    Partial-bug: 1827119
    Change-Id: I9d9624b1a82e52d800ab9d594a180641c854a039
    Signed-off-by: Chen, Tingjie <email address hidden>

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-24:

#26

@Tingjie, The commit above has "Partial-bug" instead of "Closes-Bug". Are you expecting to make additional code changes for this bug? Is this only a partial fix?

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-09-25:

#27

@Ghada, In my opinion, there are 2 points raised by Maria,
1/ The alarm list for compute-0 missing, this issue is confirmed, and I will raise another patch to fix.
2/ compute-0 status in configuring, this still cannot reproduce.

The patch: https://review.opendev.org/#/c/677424/ is for resolve point 1.
but the second point I cannot reproduced, and need confirm from reporter, thanks.

Revision history for this message

yong hu (yhu6) wrote on 2019-09-25:

#28

https://review.opendev.org/#/c/677424/ was merged, but @Tingjie to further check with @Wendy about another aspect.

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-09-25:

#29

The 'configuring' state should clear itself on lock/unlock even if there are issues with ceph-mon during initial configuration. Therefore, I also recommend a retest for 2/

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-02:

#30

Marking as Fix Released based on Ovidiu's recommendation.
If an issue is encountered during retest, this bug can be re-opened or a new open can be created.

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

John Kruszewski (jiggernaut) wrote on 2019-10-11:

#31

Download full text (5.6 KiB)

# RETEST STATUS
PASSED

# CONFIGURATION
2 + 2 Controller Storage Config

# LOAD TESTED
BUILD_ID="2019-10-09_20-00-00"

# TESTING

controller-1:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 105M 20G 1% /var/lib/ceph/mon

compute-0:~$ df -H | grep /var/lib/ceph/mon
/dev/mapper/cgts--vg-ceph--mon--lv 22G 107M 20G 1% /var/lib/ceph/mon

3. After lock/unlock of controllers, compute still reports config out-of-date
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------+----------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------+----------------+----------+-------------------+
| 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2019-10-11T14:45: |
| | | | | 05.903134 |
| | | | | |
+----------+-------------------------------------+----------------+----------+-------------------+
[sysadmin@controller-1 ~(keystone_admin)]$ system ceph...

StarlingX

ceph-mon-modify does not update the ceph-mon partition on worker node

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches