StarlingX

Bug #1851597
Comment #6

Comment 6 for bug 1851597

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2020-03-18:

I have strong indications that this is has the same root cause as bugs 1856064, and 1856064
with the difference that there is only one ceph-mon that listens on the floating ip and
is moved to the new active controller during a swact.

2020-01-13 16:42:15.866247 mon.controller mon.0 [abcd:204::1]:6789/0 10 : cluster [INF] osd.1 failed (root=storage-tier,chassis=group-0,host=controller-1) (1 reporters from different host after 24.000200 >= grace 20.000000)
2020-01-13 16:42:16.344732 mon.controller mon.0 [abcd:204::1]:6789/0 11 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-01-13 16:42:16.345016 mon.controller mon.0 [abcd:204::1]:6789/0 12 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2020-01-13 16:42:18.726316 osd.1 osd.1 [abcd:204::3]:6800/108549 1 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running
2020-01-13 16:43:19.898651 mon.controller mon.0 [abcd:204::1]:6789/0 43 : cluster [WRN] Health check failed: Degraded data redundancy: 64 pgs undersized (PG_DEGRADED)
2020-01-13 16:49:42.494075 mon.controller mon.0 [abcd:204::1]:6789/0 141 : cluster [INF] Manager daemon controller-0 is unresponsive, replacing it with standby daemon controller-1
2020-01-13 16:49:42.504798 mon.controller mon.0 [abcd:204::1]:6789/0 154 : cluster [INF] Manager daemon controller-1 is now available
2020-01-13 16:50:00.555075 mon.controller mon.0 [abcd:204::1]:6789/0 176 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-01-13 16:50:00.555190 mon.controller mon.0 [abcd:204::1]:6789/0 177 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2020-01-13 16:50:00.566175 mon.controller mon.0 [abcd:204::1]:6789/0 178 : cluster [INF] osd.1 [abcd:204::1]:6800/212907 boot
2020-01-13 16:50:00.567005 mon.controller mon.0 [abcd:204::1]:6789/0 181 : cluster [INF] osd.0 failed (root=storage-tier,chassis=group-0,host=controller-0) (connection refused reported by osd.1)
2020-01-13 16:50:01.556863 mon.controller mon.0 [abcd:204::1]:6789/0 187 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-01-13 16:50:01.556910 mon.controller mon.0 [abcd:204::1]:6789/0 188 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2020-01-13 16:50:01.826484 mon.controller mon.0 [abcd:204::1]:6789/0 190 : cluster [INF] Active manager daemon controller-1 restarted
2020-01-13 16:50:01.826528 mon.controller mon.0 [abcd:204::1]:6789/0 191 : cluster [INF] Activating manager daemon controller-1
2020-01-13 16:50:02.591493 mon.controller mon.0 [abcd:204::1]:6789/0 205 : cluster [INF] Manager daemon controller-1 is now available
2020-01-13 16:50:04.576222 mon.controller mon.0 [abcd:204::1]:6789/0 208 : cluster [WRN] Health check failed: Reduced data availability: 32 pgs inactive, 64 pgs down (PG_AVAILABILITY)
2020-01-13 16:50:04.576262 mon.controller mon.0 [abcd:204::1]:6789/0 209 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 64 pgs undersized)
2020-01-13 16:51:01.905472 mon.controller mon.0 [abcd:204::1]:6789/0 217 : cluster [WRN] Health check update: Reduced data availability: 64 pgs inactive, 64 pgs down (PG_AVAILABILITY)
2020-01-13 16:55:21.210652 mon.controller mon.0 [abcd:204::1]:6789/0 265 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-01-13 16:55:21.210702 mon.controller mon.0 [abcd:204::1]:6789/0 266 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2020-01-13 16:55:21.214586 mon.controller mon.0 [abcd:204::1]:6789/0 267 : cluster [INF] osd.0 [abcd:204::2]:6800/84862 boot
2020-01-13 16:55:22.218598 mon.controller mon.0 [abcd:204::1]:6789/0 272 : cluster [INF] osd.1 failed (root=storage-tier,chassis=group-0,host=controller-1) (connection refused reported by osd.0)
2020-01-13 16:55:23.215140 mon.controller mon.0 [abcd:204::1]:6789/0 277 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-01-13 16:55:23.215181 mon.controller mon.0 [abcd:204::1]:6789/0 278 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2020-01-13 16:55:24.217619 mon.controller mon.0 [abcd:204::1]:6789/0 280 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-01-13 16:55:24.217654 mon.controller mon.0 [abcd:204::1]:6789/0 281 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2020-01-13 16:55:24.221437 mon.controller mon.0 [abcd:204::1]:6789/0 282 : cluster [INF] osd.1 [abcd:204::1]:6800/212907 boot
2020-01-13 16:55:22.217468 osd.1 osd.1 [abcd:204::1]:6800/212907 1 : cluster [ERR] map e42 had wrong cluster addr ([abcd:204::1]:6801/212907 != my [abcd:204::4]:6801/212907)

I have strong indications that this is has the same root cause as bugs 1856064, and 1856064
with the difference that there is only one ceph-mon that listens on the floating ip and 
is moved to the new active controller during a swact.