ceph alarms sometimes appear after swact on aio-duplex system

Bug #1851597 reported by Yang Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------

Severity
--------
Minor

Steps to Reproduce
------------------
system host-swact

TC-name: test_swact_controller

Expected Behavior
------------------
controller swacted - no new alarms generated

Actual Behavior
----------------
sometimes storage alarms are observed, they clear automatically after a while:

| 753a102b-5bdf-473e-a344-3797bcdc0329 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=af683010-fb38-4968-8f02-6b7d3a54cc1e | warning | 2019-11-06T21:04:52.153937 |
| 57e45af3-9cb6-4163-8ebe-1302b4a289d5 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=af683010-fb38-4968-8f02-6b7d3a54cc1e.peergroup=group-0.host=controller-1 | major | 2019-11-06T21:03:51.456714 |

Reproducibility
---------------
Intermittent

Seen these alarm after 2/3 swacts in a sanity run, but I did not see this for the two manual swact I tried.

The alarms after the first ever swact after fresh install - took maybe 7-10 minutes to clear; second swact - took 3-4 minutes to clear; did not appear after 3rd swact.

System Configuration
--------------------
Two node system
Lab-name:
wcp78-79

Branch/Pull Time/Commit
-----------------------
20191106

Last Pass
---------
Not sure

Timestamp/Logs
--------------
[2019-11-06 21:02:36,326] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-11-06 21:10:46,746] 311 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'

| 753a102b-5bdf-473e-a344-3797bcdc0329 | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for more details. | cluster=af683010-fb38-4968-8f02-6b7d3a54cc1e | warning | 2019-11-06T21:04:52.153937 |
| 57e45af3-9cb6-4163-8ebe-1302b4a289d5 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=af683010-fb38-4968-8f02-6b7d3a54cc1e.peergroup=group-0.host=controller-1 | major | 2019-11-06T21:03:51.456714 |

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :

Issue also reproduced on regular system.
Lab: WCP_71_75
Load: 2019-11-06_10-52-51

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

low priority / not gating - intermittent issue and alarm eventually clears

tags: added: stx.config stx.storage
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

I can't reproduce this on a virtual setup or on lab WCP_76_77 with a 2019-12-13 build.
The only Storage alarms I see are being raised during initial setup for any version I try virtual or physical.
I suggest a retest of this issue.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-01-13_00-10-00
wcp_78_79
Log @
https://files.starlingx.kube.cengn.ca/launchpad/1851597

Frank Miller (sensfan22)
tags: added: stx.4.0
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :
Download full text (4.6 KiB)

I have strong indications that this is has the same root cause as bugs 1856064, and 1856064
with the difference that there is only one ceph-mon that listens on the floating ip and
is moved to the new active controller during a swact.

2020-01-13 16:42:15.866247 mon.controller mon.0 [abcd:204::1]:6789/0 10 : cluster [INF] osd.1 failed (root=storage-tier,chassis=group-0,host=controller-1) (1 reporters from different host after 24.000200 >= grace 20.000000)
2020-01-13 16:42:16.344732 mon.controller mon.0 [abcd:204::1]:6789/0 11 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-01-13 16:42:16.345016 mon.controller mon.0 [abcd:204::1]:6789/0 12 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2020-01-13 16:42:18.726316 osd.1 osd.1 [abcd:204::3]:6800/108549 1 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running
2020-01-13 16:43:19.898651 mon.controller mon.0 [abcd:204::1]:6789/0 43 : cluster [WRN] Health check failed: Degraded data redundancy: 64 pgs undersized (PG_DEGRADED)
2020-01-13 16:49:42.494075 mon.controller mon.0 [abcd:204::1]:6789/0 141 : cluster [INF] Manager daemon controller-0 is unresponsive, replacing it with standby daemon controller-1
2020-01-13 16:49:42.504798 mon.controller mon.0 [abcd:204::1]:6789/0 154 : cluster [INF] Manager daemon controller-1 is now available
2020-01-13 16:50:00.555075 mon.controller mon.0 [abcd:204::1]:6789/0 176 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-01-13 16:50:00.555190 mon.controller mon.0 [abcd:204::1]:6789/0 177 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2020-01-13 16:50:00.566175 mon.controller mon.0 [abcd:204::1]:6789/0 178 : cluster [INF] osd.1 [abcd:204::1]:6800/212907 boot
2020-01-13 16:50:00.567005 mon.controller mon.0 [abcd:204::1]:6789/0 181 : cluster [INF] osd.0 failed (root=storage-tier,chassis=group-0,host=controller-0) (connection refused reported by osd.1)
2020-01-13 16:50:01.556863 mon.controller mon.0 [abcd:204::1]:6789/0 187 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-01-13 16:50:01.556910 mon.controller mon.0 [abcd:204::1]:6789/0 188 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2020-01-13 16:50:01.826484 mon.controller mon.0 [abcd:204::1]:6789/0 190 : cluster [INF] Active manager daemon controller-1 restarted
2020-01-13 16:50:01.826528 mon.controller mon.0 [abcd:204::1]:6789/0 191 : cluster [INF] Activating manager daemon controller-1
2020-01-13 16:50:02.591493 mon.controller mon.0 [abcd:204::1]:6789/0 205 : cluster [INF] Manager daemon controller-1 is now available
2020-01-13 16:50:04.576222 mon.controller mon.0 [abcd:204::1]:6789/0 208 : cluster [WRN] Health check failed: Reduced data availability: 32 pgs inactive, 64 pgs down (PG_AVAILABILITY)
2020-01-13 16:50:04.576262 mon.controller mon.0 [abcd:204::1]:6789/0 209 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 64 pgs undersized)
2020-01-13 16:51:01.905472 mon.controller mon.0 [abcd:204::1]:6789/0 217 : cluster [WRN] Health check update: Reduced data availability: 64 pgs inactive, 64 pgs down (PG_AVAILABILITY)
2020-01...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Fix Released. As per the above notes, this is suspected to be addressed by https://bugs.launchpad.net/starlingx/+bug/1856064 which was merged on 2020-03-31

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Close as it is not seen in recent sanitiy.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.