Activity log for bug #1840348

Date Who What changed Old value New value Message
2019-08-15 18:33:18 Kellen Renshaw bug added bug
2019-09-10 19:52:19 Eric Desrochers tags sts
2019-09-10 20:07:56 Eric Desrochers bug added subscriber Eric Desrochers
2020-02-13 19:14:47 Dan Hill ceph (Ubuntu): status New Triaged
2020-02-13 19:14:52 Dan Hill ceph (Ubuntu): assignee Dan Hill (hillpd)
2020-02-13 19:15:01 Dan Hill ceph (Ubuntu): importance Undecided Medium
2020-04-10 18:05:16 Dan Hill summary Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace Sharded OpWQ drops suicide_grace after waiting for work
2020-04-10 19:11:52 Dan Hill description Multiple incidents have been seen where ops were blocked for various reasons and the suicide_grace timeout was not observed, meaning that the OSD failed to suicide as expected. [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.
2020-04-10 19:13:44 Dan Hill description [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration. [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.
2020-04-10 19:31:21 Dan Hill attachment added ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+attachment/5351517/+files/ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff
2020-04-10 19:41:44 Dan Hill nominated for series Ubuntu Focal
2020-04-10 19:41:44 Dan Hill bug task added ceph (Ubuntu Focal)
2020-04-10 19:41:44 Dan Hill nominated for series Ubuntu Bionic
2020-04-10 19:41:44 Dan Hill bug task added ceph (Ubuntu Bionic)
2020-04-10 19:41:44 Dan Hill nominated for series Ubuntu Eoan
2020-04-10 19:41:44 Dan Hill bug task added ceph (Ubuntu Eoan)
2020-04-10 19:42:12 Dan Hill ceph (Ubuntu Bionic): status New Confirmed
2020-04-10 19:42:17 Dan Hill ceph (Ubuntu Bionic): assignee Dan Hill (hillpd)
2020-04-10 19:42:21 Dan Hill ceph (Ubuntu Eoan): assignee Dan Hill (hillpd)
2020-04-10 19:42:27 Dan Hill ceph (Ubuntu Bionic): importance Undecided Medium
2020-04-10 19:42:30 Dan Hill ceph (Ubuntu Eoan): importance Undecided Medium
2020-04-10 19:42:33 Dan Hill ceph (Ubuntu Eoan): status New Confirmed
2020-04-10 19:42:36 Dan Hill ceph (Ubuntu Focal): status Triaged Confirmed
2020-04-10 19:44:28 Dan Hill description [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration. [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.
2020-04-10 20:22:54 Ubuntu Foundations Team Bug Bot tags sts patch sts
2020-04-10 20:23:02 Ubuntu Foundations Team Bug Bot bug added subscriber Ubuntu Sponsors Team
2020-04-13 22:42:16 Dan Hill description [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration. [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.
2020-04-16 00:52:51 Dan Hill description [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration. [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] I have not identified a reliable reproducer. Currently testing the fix by exercising I/O. Recommend letting this bake upstream before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2] [0] https://tracker.ceph.com/issues/37778 [1] https://tracker.ceph.com/issues/45076 [2] https://github.com/ceph/ceph/pull/34575
2020-08-18 16:59:08 Brian Murray ceph (Ubuntu Eoan): status Confirmed Won't Fix
2020-11-30 17:38:54 Dan Hill ceph (Ubuntu Focal): status Confirmed Fix Released
2020-11-30 21:11:52 Mathew Hodson ceph (Ubuntu): status Confirmed Fix Released
2020-12-03 02:00:02 Dan Hill bug task added cloud-archive
2020-12-03 02:02:06 Dan Hill cloud-archive: assignee Dan Hill (hillpd)
2020-12-08 18:53:13 Billy Olsen nominated for series cloud-archive/train
2020-12-08 18:53:13 Billy Olsen bug task added cloud-archive/train
2020-12-08 18:53:13 Billy Olsen nominated for series cloud-archive/rocky
2020-12-08 18:53:13 Billy Olsen bug task added cloud-archive/rocky
2020-12-08 18:53:13 Billy Olsen nominated for series cloud-archive/queens
2020-12-08 18:53:13 Billy Olsen bug task added cloud-archive/queens
2020-12-08 18:53:13 Billy Olsen nominated for series cloud-archive/stein
2020-12-08 18:53:13 Billy Olsen bug task added cloud-archive/stein
2020-12-08 18:57:07 Dan Hill cloud-archive/train: status New Fix Released
2020-12-08 18:58:28 Dan Hill cloud-archive/rocky: status New Invalid
2020-12-08 18:58:40 Dan Hill cloud-archive/stein: status New In Progress
2020-12-08 18:58:44 Dan Hill cloud-archive/train: assignee Dan Hill (hillpd)
2020-12-08 18:58:47 Dan Hill cloud-archive/stein: assignee Dan Hill (hillpd)
2020-12-08 18:58:50 Dan Hill cloud-archive/rocky: assignee Dan Hill (hillpd)
2020-12-08 18:58:54 Dan Hill cloud-archive/queens: assignee Dan Hill (hillpd)
2020-12-08 18:59:06 Dan Hill ceph (Ubuntu Bionic): status Confirmed In Progress
2020-12-08 18:59:19 Dan Hill cloud-archive/queens: status New In Progress
2020-12-08 18:59:33 Dan Hill cloud-archive: status New Fix Released
2020-12-08 19:01:04 Billy Olsen cloud-archive/rocky: status Invalid Won't Fix
2020-12-17 18:52:37 Dan Hill cloud-archive/train: importance Undecided Medium
2020-12-17 18:53:01 Dan Hill cloud-archive/stein: importance Undecided Medium
2020-12-17 18:53:06 Dan Hill cloud-archive/queens: importance Undecided Medium
2020-12-17 18:53:09 Dan Hill cloud-archive: importance Undecided Medium
2020-12-17 18:53:15 Dan Hill cloud-archive/rocky: importance Undecided Medium
2021-10-18 15:58:41 Dan Hill bug added subscriber Dan Hill
2021-10-18 15:59:35 Kellen Renshaw cloud-archive/stein: status In Progress Won't Fix
2021-10-18 16:00:38 Kellen Renshaw ceph (Ubuntu Bionic): assignee Dan Hill (hillpd) Kellen Renshaw (krenshaw)
2021-10-18 16:00:44 Kellen Renshaw cloud-archive/queens: assignee Dan Hill (hillpd) Kellen Renshaw (krenshaw)
2021-10-18 16:42:15 Dan Streetman removed subscriber Ubuntu Sponsors Team
2021-10-26 20:13:35 Brian Murray ceph (Ubuntu Bionic): status In Progress Fix Committed
2021-10-26 20:13:38 Brian Murray bug added subscriber Ubuntu Stable Release Updates Team
2021-10-26 20:13:41 Brian Murray bug added subscriber SRU Verification
2021-10-26 20:13:45 Brian Murray tags patch sts patch sts verification-needed verification-needed-bionic