transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4

Bug #1819437 reported by Gareth Woolridge
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Invalid
Undecided
Unassigned
Bionic
Fix Released
Undecided
Dan Hill
Eoan
Fix Released
Undecided
Dan Hill
Focal
Fix Released
Undecided
Unassigned

Bug Description

In a recently juju deployed 13.2.4 ceph cluster (as part of an OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN event that appeared to be associated with a short planned network outage, but did not clear without human intervention:

    health: HEALTH_WARN
            6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.

We can correlate this back to a known network event, but all OSDs are up and the cluster otherwise looks healthy:

ubuntu@juju-df624b-4-lxd-14:~$ sudo ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
 -1 7.64076 root default
-13 0.90970 host happiny
  8 hdd 0.90970 osd.8 up 1.00000 1.00000
 -5 0.90970 host jynx
  9 hdd 0.90970 osd.9 up 1.00000 1.00000
 -3 1.63739 host piplup
  0 hdd 0.81870 osd.0 up 1.00000 1.00000
  3 hdd 0.81870 osd.3 up 1.00000 1.00000
 -9 1.63739 host raichu
  5 hdd 0.81870 osd.5 up 1.00000 1.00000
  6 hdd 0.81870 osd.6 up 1.00000 1.00000
-11 0.90919 host shinx
  7 hdd 0.90919 osd.7 up 1.00000 1.00000
 -7 1.63739 host sliggoo
  1 hdd 0.81870 osd.1 up 1.00000 1.00000
  4 hdd 0.81870 osd.4 up 1.00000 1.00000

ubuntu@shinx:~$ sudo ceph daemon mon.shinx ops
{
    "ops": [
        {
            "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
            "initiated_at": "2019-03-07 00:40:43.282823",
            "age": 113953.696205,
            "duration": 113953.696225,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:40:43.282823",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:40:43.282823",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283360",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283370",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283371",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283386",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:40:43.283386",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48576937,
                    "src_is_mon": false,
                    "source": "osd.8 10.48.2.206:6800/1226277",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
            "initiated_at": "2019-03-07 00:40:43.282997",
            "age": 113953.696032,
            "duration": 113953.696127,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:40:43.282997",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:40:43.282997",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284394",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284395",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284395",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284402",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284403",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284416",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:40:43.284417",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48576958,
                    "src_is_mon": false,
                    "source": "osd.8 10.48.2.206:6800/1226277",
                    "forwarded_to_leader": false
                }
            }
        },
        {
            "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
            "initiated_at": "2019-03-07 00:41:08.839840",
            "age": 113928.139188,
            "duration": 113928.139359,
            "type_data": {
                "events": [
                    {
                        "time": "2019-03-07 00:41:08.839840",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-03-07 00:41:08.839840",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "mon:_ms_dispatch"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "mon:dispatch_op"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840040",
                        "event": "psvc:dispatch"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840058",
                        "event": "osdmap:preprocess_query"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840060",
                        "event": "osdmap:preprocess_failure"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840080",
                        "event": "osdmap:prepare_update"
                    },
                    {
                        "time": "2019-03-07 00:41:08.840081",
                        "event": "osdmap:prepare_failure"
                    }
                ],
                "info": {
                    "seq": 48578207,
                    "src_is_mon": false,
                    "source": "osd.6 10.48.2.161:6800/499396",
                    "forwarded_to_leader": false
                }
            }
        }
    ],
    "num_ops": 3
}

This looks remarkably like:

https://tracker.ceph.com/issues/24531

I restarted the 2 affected mons in turn, HEALTH OK and issue did not reoccur.

Expected behaviour: ceph health should recover from temporary network event without user interaction.

Revision history for this message
Dan Hill (hillpd) wrote :

This issue has been resolved upstream:
pr#30519 in 12.2.13
pr#30481 in 13.2.7
pr#30480 in 14.2.5

The mimic fix has been released, but be advised that upgrading from 13.2.6 -> 13.2.7 may cause OSD crashes [0]. We will be updating our packaging to 13.2.8 to address this issue.

The 12.2.13 and 14.2.7 point releases landed upstream last week. We are working on stable release updates (SRUs) for these packages. You can follow and contribute to the SRU progress at [1], and [2] respectively.

[0] https://tracker.ceph.com/issues/43106
[1] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861793
[2] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1861789

Eric Desrochers (slashd)
Changed in ceph (Ubuntu Focal):
status: New → Fix Released
Changed in ceph (Ubuntu Xenial):
status: New → Invalid
Dan Hill (hillpd)
Changed in ceph (Ubuntu Bionic):
status: New → In Progress
Changed in ceph (Ubuntu Eoan):
status: New → In Progress
Changed in ceph (Ubuntu Bionic):
assignee: nobody → Dan Hill (hillpd)
Changed in ceph (Ubuntu Eoan):
assignee: nobody → Dan Hill (hillpd)
Dan Hill (hillpd)
Changed in ceph (Ubuntu Eoan):
status: In Progress → Fix Released
Revision history for this message
Dan Hill (hillpd) wrote :

Posting an update with recent SRU activity.

The 12.2.13 SRU is in progress, the package is held up due to a regression tracked by bug 1871820.
The 13.2.8 SRU for rocky, and stein is now available in -updates (bug 1864514).
The 14.2.8 SRU for eoan, and train is now available in -updates (bug 1861789).

Revision history for this message
Dan Hill (hillpd) wrote :

The 12.2.13 SRU for bionic, and queens is available in -updates (bug 1861793).

Changed in ceph (Ubuntu Bionic):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.