Ubuntu
linux package

Bug #1791790
Activity log

Activity log for bug #1791790

Date	Who	What changed	Old value	New value	Message
2018-09-10 19:01:15	Steven Haber	bug			added bug
2018-09-10 19:30:05	Ubuntu Kernel Bot	linux (Ubuntu): status	New	Incomplete
2018-09-10 19:53:37	Steven Haber	attachment added		apport.linux.4RjMFB.apport https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1791790/+attachment/5187217/+files/apport.linux.4RjMFB.apport
2018-09-10 19:55:58	Steven Haber	linux (Ubuntu): status	Incomplete	Confirmed
2018-09-12 19:17:46	Joseph Salisbury	linux (Ubuntu): assignee		Joseph Salisbury (jsalisbury)
2018-09-12 19:17:49	Joseph Salisbury	linux (Ubuntu): importance	Undecided	High
2018-09-12 19:17:51	Joseph Salisbury	linux (Ubuntu): status	Confirmed	In Progress
2018-09-12 19:17:56	Joseph Salisbury	nominated for series		Ubuntu Xenial
2018-09-12 19:17:56	Joseph Salisbury	bug task added		linux (Ubuntu Xenial)
2018-09-12 19:18:02	Joseph Salisbury	linux (Ubuntu Xenial): status	New	In Progress
2018-09-12 19:18:05	Joseph Salisbury	linux (Ubuntu Xenial): importance	Undecided	High
2018-09-12 19:18:08	Joseph Salisbury	linux (Ubuntu Xenial): assignee		Joseph Salisbury (jsalisbury)
2018-09-20 09:55:40	Joseph Salisbury	summary	Kernel hang on drive pull caused by incomplete backport for bug 1597908	Kernel hang on drive pull caused by regression introduced by commit 287922eb0b18
2018-09-20 10:00:55	Joseph Salisbury	description	A bug was introduced when backporting the fix for http://bugs.launchpad.net/bugs/1597908. This bug exists in all Ubuntu 16.04 LTS 4.4 kernels >= 4.4.0-36, and many other non-LTS kernels. This patch changes the context in which timeout work is scheduled for block devices in the kernel. Previously, timeout work was executed directly from the timer callback that fired when a deadline was met. After the patch, timeout work is scheduled using a background work queue. This means that by the time the work executes, the device queue which originally scheduled the work could be torn down. In order to prevent this, the patch takes a reference on the device queue when executing the timeout work. The problem is that the last reference to this queue can be removed before the timeout work can be executed. During teardown, the block system executes a freeze followed by a drain. The freeze drops the last reference on the queue. The drain tries to clean up any outstanding work, including timeout work. After a freeze, the timeout work in the background queue is unable to obtain a reference, and exits early without completing work. The work is now permanently stuck in the queue and it will never be completed. The drain in the device teardown path spins indefinitely. The bug manifests as a hang that looks like this: [<ffffffff81829f15>] schedule+0x35/0x80 [<ffffffffc014aea9>] hpsa_scan_start+0x109/0x140 [hpsa] [<ffffffff810c3cb0>] ? wake_atomic_t_function+0x60/0x60 [<ffffffffc014b602>] hpsa_rescan_ctlr_worker+0x1d2/0x652 [hpsa] [<ffffffff8109a2c5>] process_one_work+0x165/0x480 [<ffffffff8109a62b>] worker_thread+0x4b/0x4c0 [<ffffffff8109a5e0>] ? process_one_work+0x480/0x480 [<ffffffff810a0808>] kthread+0xd8/0xf0 [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0 [<ffffffff8182e38f>] ret_from_fork+0x3f/0x70 [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0 The fix exists upstream. It applies, builds, and runs cleanly on Ubuntu's most recent 4.4 kernel. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=4e9b6f20828ac880dbc1fa2fdbafae779473d1af We hit this bug nearly 100% of the time on some of our HP hardware. The HPSA driver has a tendency to aggressively remove missing devices, so it widens the race. As a result, we've been building our own kernel with this patch applied. It would be really nice if we could get it into mainline Ubuntu. Let me know what additional information is needed. Thanks!	== SRU Justification == The following commit was applied to Xenial and introduced this regression: 287922eb0b18 ("block: defer timeouts to a workqueue") This regression was introduced in mainline as of v4.5-rc1. Bionc was also affected by this regression, but it already go the fix when commit 4e9b6f20828a was applied to mainline in v4.15-rc1. The regression caused a kernel hang because the HPSA driver has a tendency to aggressively remove missing devices. == Fix == 4e9b6f20828a ("block: Fix a race between blk_cleanup_queue() and timeout handling") == Regression Potential == Low. This commit fixes a regression and has been cc'd to stable, so it has had addition upstream review. This commit is already applied to Bionic and Cosmic. == Test Case == A test kernel was built with this patch and tested by the original bug reporter. The bug reporter states the test kernel resolved the bug. A bug was introduced when backporting the fix for http://bugs.launchpad.net/bugs/1597908. This bug exists in all Ubuntu 16.04 LTS 4.4 kernels >= 4.4.0-36, and many other non-LTS kernels. This patch changes the context in which timeout work is scheduled for block devices in the kernel. Previously, timeout work was executed directly from the timer callback that fired when a deadline was met. After the patch, timeout work is scheduled using a background work queue. This means that by the time the work executes, the device queue which originally scheduled the work could be torn down. In order to prevent this, the patch takes a reference on the device queue when executing the timeout work. The problem is that the last reference to this queue can be removed before the timeout work can be executed. During teardown, the block system executes a freeze followed by a drain. The freeze drops the last reference on the queue. The drain tries to clean up any outstanding work, including timeout work. After a freeze, the timeout work in the background queue is unable to obtain a reference, and exits early without completing work. The work is now permanently stuck in the queue and it will never be completed. The drain in the device teardown path spins indefinitely. The bug manifests as a hang that looks like this: [<ffffffff81829f15>] schedule+0x35/0x80 [<ffffffffc014aea9>] hpsa_scan_start+0x109/0x140 [hpsa] [<ffffffff810c3cb0>] ? wake_atomic_t_function+0x60/0x60 [<ffffffffc014b602>] hpsa_rescan_ctlr_worker+0x1d2/0x652 [hpsa] [<ffffffff8109a2c5>] process_one_work+0x165/0x480 [<ffffffff8109a62b>] worker_thread+0x4b/0x4c0 [<ffffffff8109a5e0>] ? process_one_work+0x480/0x480 [<ffffffff810a0808>] kthread+0xd8/0xf0 [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0 [<ffffffff8182e38f>] ret_from_fork+0x3f/0x70 [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0 The fix exists upstream. It applies, builds, and runs cleanly on Ubuntu's most recent 4.4 kernel. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=4e9b6f20828ac880dbc1fa2fdbafae779473d1af We hit this bug nearly 100% of the time on some of our HP hardware. The HPSA driver has a tendency to aggressively remove missing devices, so it widens the race. As a result, we've been building our own kernel with this patch applied. It would be really nice if we could get it into mainline Ubuntu. Let me know what additional information is needed. Thanks!
2018-09-20 10:01:07	Joseph Salisbury	tags		xenial
2018-10-02 10:20:21	Kleber Sacilotto de Souza	linux (Ubuntu Xenial): status	In Progress	Fix Committed
2018-10-04 10:05:14	Brad Figg	tags	xenial	verification-needed-xenial xenial
2018-10-04 23:28:25	Steven Haber	tags	verification-needed-xenial xenial	verification-done-xenial xenial
2018-10-22 16:31:21	Launchpad Janitor	linux (Ubuntu Xenial): status	Fix Committed	Fix Released
2018-10-22 16:31:21	Launchpad Janitor	cve linked		2018-9363
2019-01-23 01:04:35	Joseph Salisbury	linux (Ubuntu): status	In Progress	Fix Released
2019-07-24 20:23:55	Brad Figg	tags	verification-done-xenial xenial	cscc verification-done-xenial xenial

Ubuntulinux package

Activity log for bug #1791790

Ubuntu
linux package