transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ceph (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Xenial |
Invalid
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
Dan Hill | ||
Eoan |
Fix Released
|
Undecided
|
Dan Hill | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
In a recently juju deployed 13.2.4 ceph cluster (as part of an OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN event that appeared to be associated with a short planned network outage, but did not clear without human intervention:
health: HEALTH_WARN
6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,
We can correlate this back to a known network event, but all OSDs are up and the cluster otherwise looks healthy:
ubuntu@
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.64076 root default
-13 0.90970 host happiny
8 hdd 0.90970 osd.8 up 1.00000 1.00000
-5 0.90970 host jynx
9 hdd 0.90970 osd.9 up 1.00000 1.00000
-3 1.63739 host piplup
0 hdd 0.81870 osd.0 up 1.00000 1.00000
3 hdd 0.81870 osd.3 up 1.00000 1.00000
-9 1.63739 host raichu
5 hdd 0.81870 osd.5 up 1.00000 1.00000
6 hdd 0.81870 osd.6 up 1.00000 1.00000
-11 0.90919 host shinx
7 hdd 0.90919 osd.7 up 1.00000 1.00000
-7 1.63739 host sliggoo
1 hdd 0.81870 osd.1 up 1.00000 1.00000
4 hdd 0.81870 osd.4 up 1.00000 1.00000
ubuntu@shinx:~$ sudo ceph daemon mon.shinx ops
{
"ops": [
{
"age": 113953.696205,
],
}
}
},
{
"age": 113953.696032,
],
}
}
},
{
"age": 113928.139188,
],
}
}
}
],
"num_ops": 3
}
This looks remarkably like:
https:/
I restarted the 2 affected mons in turn, HEALTH OK and issue did not reoccur.
Expected behaviour: ceph health should recover from temporary network event without user interaction.
Changed in ceph (Ubuntu Focal): | |
status: | New → Fix Released |
Changed in ceph (Ubuntu Xenial): | |
status: | New → Invalid |
Changed in ceph (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in ceph (Ubuntu Eoan): | |
status: | New → In Progress |
Changed in ceph (Ubuntu Bionic): | |
assignee: | nobody → Dan Hill (hillpd) |
Changed in ceph (Ubuntu Eoan): | |
assignee: | nobody → Dan Hill (hillpd) |
Changed in ceph (Ubuntu Eoan): | |
status: | In Progress → Fix Released |
This issue has been resolved upstream:
pr#30519 in 12.2.13
pr#30481 in 13.2.7
pr#30480 in 14.2.5
The mimic fix has been released, but be advised that upgrading from 13.2.6 -> 13.2.7 may cause OSD crashes [0]. We will be updating our packaging to 13.2.8 to address this issue.
The 12.2.13 and 14.2.7 point releases landed upstream last week. We are working on stable release updates (SRUs) for these packages. You can follow and contribute to the SRU progress at [1], and [2] respectively.
[0] https:/ /tracker. ceph.com/ issues/ 43106 /bugs.launchpad .net/ubuntu/ +source/ ceph/+bug/ 1861793 /bugs.launchpad .net/ubuntu/ +source/ ceph/+bug/ 1861789
[1] https:/
[2] https:/