2019-08-15 18:33:18 |
Kellen Renshaw |
bug |
|
|
added bug |
2019-09-10 19:52:19 |
Eric Desrochers |
tags |
|
sts |
|
2019-09-10 20:07:56 |
Eric Desrochers |
bug |
|
|
added subscriber Eric Desrochers |
2020-02-13 19:14:47 |
Dan Hill |
ceph (Ubuntu): status |
New |
Triaged |
|
2020-02-13 19:14:52 |
Dan Hill |
ceph (Ubuntu): assignee |
|
Dan Hill (hillpd) |
|
2020-02-13 19:15:01 |
Dan Hill |
ceph (Ubuntu): importance |
Undecided |
Medium |
|
2020-04-10 18:05:16 |
Dan Hill |
summary |
Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace |
Sharded OpWQ drops suicide_grace after waiting for work |
|
2020-04-10 19:11:52 |
Dan Hill |
description |
Multiple incidents have been seen where ops were blocked for various reasons and the suicide_grace timeout was not observed, meaning that the OSD failed to suicide as expected. |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an
empty queue. While waiting, the heartbeat timeout and suicide_grace values are
modified. On Luminous, the `threadpool_default_timeout` grace is left applied
and suicide_grace is left disabled. On later releases both the grace and
suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are
not re-applied. This can result in hung operations that do not trigger an OSD
suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment
was consistently hitting a known authentication race condition (issue#37778
[0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
faulty DIMM.
The auth race condition would stall pg operations. In some cases the hung ops
would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by
exercising I/O. Since the fix applies to all version of Ceph, the plan is to
let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD
suicide will trigger a service restart and repeated restarts (flapping) will
adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent
with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix
is also restricted to the empty queue edge-case that drops the suicide_grace
timeout. The suicide_grace value is only re-applied when work is found after
waiting on an empty queue.
- In-Progress -
The fix will bake upstream on later levels before back-port consideration. |
|
2020-04-10 19:13:44 |
Dan Hill |
description |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an
empty queue. While waiting, the heartbeat timeout and suicide_grace values are
modified. On Luminous, the `threadpool_default_timeout` grace is left applied
and suicide_grace is left disabled. On later releases both the grace and
suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are
not re-applied. This can result in hung operations that do not trigger an OSD
suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment
was consistently hitting a known authentication race condition (issue#37778
[0]) due to repeated OSD service restarts on a node exhibiting MCEs from a
faulty DIMM.
The auth race condition would stall pg operations. In some cases the hung ops
would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by
exercising I/O. Since the fix applies to all version of Ceph, the plan is to
let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD
suicide will trigger a service restart and repeated restarts (flapping) will
adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent
with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix
is also restricted to the empty queue edge-case that drops the suicide_grace
timeout. The suicide_grace value is only re-applied when work is found after
waiting on an empty queue.
- In-Progress -
The fix will bake upstream on later levels before back-port consideration. |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix will bake upstream on later levels before back-port consideration. |
|
2020-04-10 19:31:21 |
Dan Hill |
attachment added |
|
ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+attachment/5351517/+files/ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff |
|
2020-04-10 19:41:44 |
Dan Hill |
nominated for series |
|
Ubuntu Focal |
|
2020-04-10 19:41:44 |
Dan Hill |
bug task added |
|
ceph (Ubuntu Focal) |
|
2020-04-10 19:41:44 |
Dan Hill |
nominated for series |
|
Ubuntu Bionic |
|
2020-04-10 19:41:44 |
Dan Hill |
bug task added |
|
ceph (Ubuntu Bionic) |
|
2020-04-10 19:41:44 |
Dan Hill |
nominated for series |
|
Ubuntu Eoan |
|
2020-04-10 19:41:44 |
Dan Hill |
bug task added |
|
ceph (Ubuntu Eoan) |
|
2020-04-10 19:42:12 |
Dan Hill |
ceph (Ubuntu Bionic): status |
New |
Confirmed |
|
2020-04-10 19:42:17 |
Dan Hill |
ceph (Ubuntu Bionic): assignee |
|
Dan Hill (hillpd) |
|
2020-04-10 19:42:21 |
Dan Hill |
ceph (Ubuntu Eoan): assignee |
|
Dan Hill (hillpd) |
|
2020-04-10 19:42:27 |
Dan Hill |
ceph (Ubuntu Bionic): importance |
Undecided |
Medium |
|
2020-04-10 19:42:30 |
Dan Hill |
ceph (Ubuntu Eoan): importance |
Undecided |
Medium |
|
2020-04-10 19:42:33 |
Dan Hill |
ceph (Ubuntu Eoan): status |
New |
Confirmed |
|
2020-04-10 19:42:36 |
Dan Hill |
ceph (Ubuntu Focal): status |
Triaged |
Confirmed |
|
2020-04-10 19:44:28 |
Dan Hill |
description |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix will bake upstream on later levels before back-port consideration. |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix needs to bake upstream on later levels before back-port consideration. |
|
2020-04-10 20:22:54 |
Ubuntu Foundations Team Bug Bot |
tags |
sts |
patch sts |
|
2020-04-10 20:23:02 |
Ubuntu Foundations Team Bug Bot |
bug |
|
|
added subscriber Ubuntu Sponsors Team |
2020-04-13 22:42:16 |
Dan Hill |
description |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix needs to bake upstream on later levels before back-port consideration. |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix needs to bake upstream on later levels before back-port consideration. |
|
2020-04-16 00:52:51 |
Dan Hill |
description |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
- In-Progress -
Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
The fix needs to bake upstream on later levels before back-port consideration. |
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled.
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
I have not identified a reliable reproducer. Currently testing the fix by exercising I/O.
Recommend letting this bake upstream before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue.
- In-Progress -
Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2]
[0] https://tracker.ceph.com/issues/37778
[1] https://tracker.ceph.com/issues/45076
[2] https://github.com/ceph/ceph/pull/34575 |
|
2020-08-18 16:59:08 |
Brian Murray |
ceph (Ubuntu Eoan): status |
Confirmed |
Won't Fix |
|
2020-11-30 17:38:54 |
Dan Hill |
ceph (Ubuntu Focal): status |
Confirmed |
Fix Released |
|
2020-11-30 21:11:52 |
Mathew Hodson |
ceph (Ubuntu): status |
Confirmed |
Fix Released |
|
2020-12-03 02:00:02 |
Dan Hill |
bug task added |
|
cloud-archive |
|
2020-12-03 02:02:06 |
Dan Hill |
cloud-archive: assignee |
|
Dan Hill (hillpd) |
|
2020-12-08 18:53:13 |
Billy Olsen |
nominated for series |
|
cloud-archive/train |
|
2020-12-08 18:53:13 |
Billy Olsen |
bug task added |
|
cloud-archive/train |
|
2020-12-08 18:53:13 |
Billy Olsen |
nominated for series |
|
cloud-archive/rocky |
|
2020-12-08 18:53:13 |
Billy Olsen |
bug task added |
|
cloud-archive/rocky |
|
2020-12-08 18:53:13 |
Billy Olsen |
nominated for series |
|
cloud-archive/queens |
|
2020-12-08 18:53:13 |
Billy Olsen |
bug task added |
|
cloud-archive/queens |
|
2020-12-08 18:53:13 |
Billy Olsen |
nominated for series |
|
cloud-archive/stein |
|
2020-12-08 18:53:13 |
Billy Olsen |
bug task added |
|
cloud-archive/stein |
|
2020-12-08 18:57:07 |
Dan Hill |
cloud-archive/train: status |
New |
Fix Released |
|
2020-12-08 18:58:28 |
Dan Hill |
cloud-archive/rocky: status |
New |
Invalid |
|
2020-12-08 18:58:40 |
Dan Hill |
cloud-archive/stein: status |
New |
In Progress |
|
2020-12-08 18:58:44 |
Dan Hill |
cloud-archive/train: assignee |
|
Dan Hill (hillpd) |
|
2020-12-08 18:58:47 |
Dan Hill |
cloud-archive/stein: assignee |
|
Dan Hill (hillpd) |
|
2020-12-08 18:58:50 |
Dan Hill |
cloud-archive/rocky: assignee |
|
Dan Hill (hillpd) |
|
2020-12-08 18:58:54 |
Dan Hill |
cloud-archive/queens: assignee |
|
Dan Hill (hillpd) |
|
2020-12-08 18:59:06 |
Dan Hill |
ceph (Ubuntu Bionic): status |
Confirmed |
In Progress |
|
2020-12-08 18:59:19 |
Dan Hill |
cloud-archive/queens: status |
New |
In Progress |
|
2020-12-08 18:59:33 |
Dan Hill |
cloud-archive: status |
New |
Fix Released |
|
2020-12-08 19:01:04 |
Billy Olsen |
cloud-archive/rocky: status |
Invalid |
Won't Fix |
|
2020-12-17 18:52:37 |
Dan Hill |
cloud-archive/train: importance |
Undecided |
Medium |
|
2020-12-17 18:53:01 |
Dan Hill |
cloud-archive/stein: importance |
Undecided |
Medium |
|
2020-12-17 18:53:06 |
Dan Hill |
cloud-archive/queens: importance |
Undecided |
Medium |
|
2020-12-17 18:53:09 |
Dan Hill |
cloud-archive: importance |
Undecided |
Medium |
|
2020-12-17 18:53:15 |
Dan Hill |
cloud-archive/rocky: importance |
Undecided |
Medium |
|
2021-10-18 15:58:41 |
Dan Hill |
bug |
|
|
added subscriber Dan Hill |
2021-10-18 15:59:35 |
Kellen Renshaw |
cloud-archive/stein: status |
In Progress |
Won't Fix |
|
2021-10-18 16:00:38 |
Kellen Renshaw |
ceph (Ubuntu Bionic): assignee |
Dan Hill (hillpd) |
Kellen Renshaw (krenshaw) |
|
2021-10-18 16:00:44 |
Kellen Renshaw |
cloud-archive/queens: assignee |
Dan Hill (hillpd) |
Kellen Renshaw (krenshaw) |
|
2021-10-18 16:42:15 |
Dan Streetman |
removed subscriber Ubuntu Sponsors Team |
|
|
|
2021-10-26 20:13:35 |
Brian Murray |
ceph (Ubuntu Bionic): status |
In Progress |
Fix Committed |
|
2021-10-26 20:13:38 |
Brian Murray |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2021-10-26 20:13:41 |
Brian Murray |
bug |
|
|
added subscriber SRU Verification |
2021-10-26 20:13:45 |
Brian Murray |
tags |
patch sts |
patch sts verification-needed verification-needed-bionic |
|