Ubuntu
ceph package

Bug #1840348
Activity log

Activity log for bug #1840348

Date	Who	What changed	Old value	New value	Message
2019-08-15 18:33:18	Kellen Renshaw	bug			added bug
2019-09-10 19:52:19	Eric Desrochers	tags		sts
2019-09-10 20:07:56	Eric Desrochers	bug			added subscriber Eric Desrochers
2020-02-13 19:14:47	Dan Hill	ceph (Ubuntu): status	New	Triaged
2020-02-13 19:14:52	Dan Hill	ceph (Ubuntu): assignee		Dan Hill (hillpd)
2020-02-13 19:15:01	Dan Hill	ceph (Ubuntu): importance	Undecided	Medium
2020-04-10 18:05:16	Dan Hill	summary	Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace	Sharded OpWQ drops suicide_grace after waiting for work
2020-04-10 19:11:52	Dan Hill	description	Multiple incidents have been seen where ops were blocked for various reasons and the suicide_grace timeout was not observed, meaning that the OSD failed to suicide as expected.	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.
2020-04-10 19:13:44	Dan Hill	description	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.
2020-04-10 19:31:21	Dan Hill	attachment added		ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+attachment/5351517/+files/ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff
2020-04-10 19:41:44	Dan Hill	nominated for series		Ubuntu Focal
2020-04-10 19:41:44	Dan Hill	bug task added		ceph (Ubuntu Focal)
2020-04-10 19:41:44	Dan Hill	nominated for series		Ubuntu Bionic
2020-04-10 19:41:44	Dan Hill	bug task added		ceph (Ubuntu Bionic)
2020-04-10 19:41:44	Dan Hill	nominated for series		Ubuntu Eoan
2020-04-10 19:41:44	Dan Hill	bug task added		ceph (Ubuntu Eoan)
2020-04-10 19:42:12	Dan Hill	ceph (Ubuntu Bionic): status	New	Confirmed
2020-04-10 19:42:17	Dan Hill	ceph (Ubuntu Bionic): assignee		Dan Hill (hillpd)
2020-04-10 19:42:21	Dan Hill	ceph (Ubuntu Eoan): assignee		Dan Hill (hillpd)
2020-04-10 19:42:27	Dan Hill	ceph (Ubuntu Bionic): importance	Undecided	Medium
2020-04-10 19:42:30	Dan Hill	ceph (Ubuntu Eoan): importance	Undecided	Medium
2020-04-10 19:42:33	Dan Hill	ceph (Ubuntu Eoan): status	New	Confirmed
2020-04-10 19:42:36	Dan Hill	ceph (Ubuntu Focal): status	Triaged	Confirmed
2020-04-10 19:44:28	Dan Hill	description	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration.	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.
2020-04-10 20:22:54	Ubuntu Foundations Team Bug Bot	tags	sts	patch sts
2020-04-10 20:23:02	Ubuntu Foundations Team Bug Bot	bug			added subscriber Ubuntu Sponsors Team
2020-04-13 22:42:16	Dan Hill	description	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.
2020-04-16 00:52:51	Dan Hill	description	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration.	[Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] I have not identified a reliable reproducer. Currently testing the fix by exercising I/O. Recommend letting this bake upstream before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2] [0] https://tracker.ceph.com/issues/37778 [1] https://tracker.ceph.com/issues/45076 [2] https://github.com/ceph/ceph/pull/34575
2020-08-18 16:59:08	Brian Murray	ceph (Ubuntu Eoan): status	Confirmed	Won't Fix
2020-11-30 17:38:54	Dan Hill	ceph (Ubuntu Focal): status	Confirmed	Fix Released
2020-11-30 21:11:52	Mathew Hodson	ceph (Ubuntu): status	Confirmed	Fix Released
2020-12-03 02:00:02	Dan Hill	bug task added		cloud-archive
2020-12-03 02:02:06	Dan Hill	cloud-archive: assignee		Dan Hill (hillpd)
2020-12-08 18:53:13	Billy Olsen	nominated for series		cloud-archive/train
2020-12-08 18:53:13	Billy Olsen	bug task added		cloud-archive/train
2020-12-08 18:53:13	Billy Olsen	nominated for series		cloud-archive/rocky
2020-12-08 18:53:13	Billy Olsen	bug task added		cloud-archive/rocky
2020-12-08 18:53:13	Billy Olsen	nominated for series		cloud-archive/queens
2020-12-08 18:53:13	Billy Olsen	bug task added		cloud-archive/queens
2020-12-08 18:53:13	Billy Olsen	nominated for series		cloud-archive/stein
2020-12-08 18:53:13	Billy Olsen	bug task added		cloud-archive/stein
2020-12-08 18:57:07	Dan Hill	cloud-archive/train: status	New	Fix Released
2020-12-08 18:58:28	Dan Hill	cloud-archive/rocky: status	New	Invalid
2020-12-08 18:58:40	Dan Hill	cloud-archive/stein: status	New	In Progress
2020-12-08 18:58:44	Dan Hill	cloud-archive/train: assignee		Dan Hill (hillpd)
2020-12-08 18:58:47	Dan Hill	cloud-archive/stein: assignee		Dan Hill (hillpd)
2020-12-08 18:58:50	Dan Hill	cloud-archive/rocky: assignee		Dan Hill (hillpd)
2020-12-08 18:58:54	Dan Hill	cloud-archive/queens: assignee		Dan Hill (hillpd)
2020-12-08 18:59:06	Dan Hill	ceph (Ubuntu Bionic): status	Confirmed	In Progress
2020-12-08 18:59:19	Dan Hill	cloud-archive/queens: status	New	In Progress
2020-12-08 18:59:33	Dan Hill	cloud-archive: status	New	Fix Released
2020-12-08 19:01:04	Billy Olsen	cloud-archive/rocky: status	Invalid	Won't Fix
2020-12-17 18:52:37	Dan Hill	cloud-archive/train: importance	Undecided	Medium
2020-12-17 18:53:01	Dan Hill	cloud-archive/stein: importance	Undecided	Medium
2020-12-17 18:53:06	Dan Hill	cloud-archive/queens: importance	Undecided	Medium
2020-12-17 18:53:09	Dan Hill	cloud-archive: importance	Undecided	Medium
2020-12-17 18:53:15	Dan Hill	cloud-archive/rocky: importance	Undecided	Medium
2021-10-18 15:58:41	Dan Hill	bug			added subscriber Dan Hill
2021-10-18 15:59:35	Kellen Renshaw	cloud-archive/stein: status	In Progress	Won't Fix
2021-10-18 16:00:38	Kellen Renshaw	ceph (Ubuntu Bionic): assignee	Dan Hill (hillpd)	Kellen Renshaw (krenshaw)
2021-10-18 16:00:44	Kellen Renshaw	cloud-archive/queens: assignee	Dan Hill (hillpd)	Kellen Renshaw (krenshaw)
2021-10-18 16:42:15	Dan Streetman	removed subscriber Ubuntu Sponsors Team
2021-10-26 20:13:35	Brian Murray	ceph (Ubuntu Bionic): status	In Progress	Fix Committed
2021-10-26 20:13:38	Brian Murray	bug			added subscriber Ubuntu Stable Release Updates Team
2021-10-26 20:13:41	Brian Murray	bug			added subscriber SRU Verification
2021-10-26 20:13:45	Brian Murray	tags	patch sts	patch sts verification-needed verification-needed-bionic

Ubuntuceph package

Activity log for bug #1840348

Ubuntu
ceph package