StarlingX

Ceph observed to briefly report health error during regression

Bug #1801772 reported by Maria Yousaf on 2018-11-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Ovidiu Poncea

Bug Description

Brief Description
-----------------
Ceph is briefly reporting health errors during test execution

Severity
--------
Major

Steps to Reproduce
------------------
1. In test_storgroup_semantic_checks, we see that all nodes are available:

[2018-10-28 03:39:14,302] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne host-list'
[2018-10-28 03:39:15,970] 382 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | compute | unlocked | enabled | available |
| 6 | compute-1 | compute | unlocked | enabled | available |
| 7 | compute-2 | compute | unlocked | enabled | available |
| 8 | compute-3 | compute | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
2. Then the health of the ceph cluster is checked (okay):
[2018-10-28 03:39:16,074] 194 INFO MainThread verify_fixtures.ceph_precheck:: Verify the health of the CEPH cluster
[2018-10-28 03:39:16,074] 419 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-10-28 03:39:16,075] 262 DEBUG MainThread ssh.send :: Send 'ceph -s'
[2018-10-28 03:39:16,362] 382 DEBUG MainThread ssh.expect :: Output:
    cluster 71b975c7-b1ae-447f-bf71-bce3127ad63a
     health HEALTH_OK
     monmap e8: 2 mons at {controller-1=192.168.205.103:6789/0,storage-0=192.168.205.149:6789/0}
            election epoch 52, quorum 0,1 controller-1,storage-0
     osdmap e272: 4 osds: 4 up, 4 in
            flags sortbitwise,require_jewel_osds
      pgmap v9981: 2176 pgs, 13 pools, 1786 MB data, 1356 objects
            3790 MB used, 2599 GB / 2602 GB avail
                2176 active+clean
3. Then the health of the cluster is checked again shortly after and the following is seen:

[2018-10-28 03:39:16,466] 198 INFO MainThread verify_fixtures.ceph_precheck:: Verify if there are OSDs provisioned
[2018-10-28 03:39:16,466] 419 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-10-28 03:39:16,466] 262 DEBUG MainThread ssh.send :: Send 'ceph -s'
[2018-10-28 03:39:16,747] 382 DEBUG MainThread ssh.expect :: Output:
    cluster 71b975c7-b1ae-447f-bf71-bce3127ad63a
     health HEALTH_ERR
            no osds
     monmap e1: 1 mons at {controller-0=192.168.205.102:6789/0}
            election epoch 3, quorum 0 controller-0
     osdmap e1: 0 osds: 0 up, 0 in
            flags sortbitwise,require_jewel_osds
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
[wrsroot@controller-1 ~(keystone_admin)]$
[2018-10-28 03:39:16,747] 262 DEBUG MainThread ssh.send :: Send 'echo $?'
[2018-10-28 03:39:16,850] 382 DEBUG MainThread ssh.expect :: Output:
0
[wrsroot@controller-1 ~(keystone_admin)]$
[2018-10-28 03:39:16,850] 72 INFO MainThread storage_helper.get_num_osds:: There are 0 OSDs on the system
[2018-10-28 03:39:16,851] 200 INFO MainThread verify_fixtures.ceph_precheck:: System has 0 OSDS
[2018-10-28 03:39:16,866] 53 DEBUG MainThread conftest.update_results:: ***Failure at test setup: /home/svc-cgcsauto/wassp-repos.new/testcases/cgcs/CGCSAuto/testfixtures/verify_fixtures.py:201: AssertionError: There are no OSDs assigned

4. Shortly later alarms are checked (only NTP alarm is present):

[2018-10-28 03:39:23,095] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-10-28 03:39:24,455] 382 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| 70a0a511-92e2-44d2-931d-2f73a5d6ba17 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2018-10-28T02:32:31.489740 |
+--------------------------------------+----------+------------------------------------------------------------------------+-----------------------+----------+----------------------------+
[wrsroot@controller-1 ~(keystone_admin)]$

It looks like a short term blip in the ceph state.

Looking at test_swift_basic_object_copy[large], it has a similar failure signature.

[2018-10-28 03:40:14,675] 382 DEBUG MainThread ssh.expect :: Output:
    cluster 71b975c7-b1ae-447f-bf71-bce3127ad63a
     health HEALTH_ERR
            no osds
     monmap e1: 1 mons at {controller-0=192.168.205.102:6789/0}
            election epoch 3, quorum 0 controller-0
     osdmap e1: 0 osds: 0 up, 0 in
            flags sortbitwise,require_jewel_osds
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
[wrsroot@controller-1 ~(keystone_admin)]$

Expected Behavior
------------------
Ceph health error not seen

Actual Behavior
----------------
Ceph health error briefly seen and then clears

Reproducibility
---------------
Intermittent.

System Configuration
--------------------
Storage

Branch/Pull Time/Commit
-----------------------
master as of 2018-10-26_11-56-15

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-11-05:

Targeting stx.2019.03. Issue related to the ceph on all-in-one feature

Changed in starlingx:
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2019.03 stx.config

OpenStack Infra (hudson-openstack) on 2018-11-15

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-15: Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/618127
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=907f8de09ede32d24da677af90a2de4abbe7353f
Submitter: Zuul
Branch: master

commit 907f8de09ede32d24da677af90a2de4abbe7353f
Author: Ovidiu Poncea <email address hidden>
Date: Wed Nov 14 17:26:00 2018 +0200

Fix Ceph split brain on multinode

    Due to changes that make Ceph work on 1 node configuration
    'mon_initial_members = controller-0' was added to /etc/ceph.conf.
    This settings allows ceph to work with a single monitor. On
    multinode it may cause brain split as controller-0 may start by
    itself if it is not able to immediately connect to the good cluster.

    This will also cause problems at Backup & Restore when controllers
    are reinstalled but older ceph cluster is recovered. In this case
    valid monitor data present on storage-0 is downloaded by the monitor
    on controller-0. Problem is that if mon_initial_members = controller-0
    then the monitor on controller-0 will be happy to start by itself and
    will not download the valid data from storage-0 leading to a possible
    brain split condition.

This commit sets this option for 1 node configuration only and
keeps it undef for multinode.

    Change-Id: I01d5563260ad211b04fe4904c32f255f4e683b07
    Closes-bug: 1801772
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.