Ubuntu
linux package

Activity log for bug #1730717

Date	Who	What changed	Old value	New value	Message
2017-11-07 17:46:16	Iain Lane	bug			added bug
2017-11-07 17:46:35	Iain Lane	attachment added		bad run console-log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+attachment/5005464/+files/laney-test25.log
2017-11-07 17:46:52	Iain Lane	attachment added		good run console-log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+attachment/5005465/+files/laney-test14.log
2017-11-07 17:47:06	Iain Lane	bug task added		qemu-kvm (Ubuntu)
2017-11-07 17:49:20	Iain Lane	description	This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly). Guests are sometimes failing to reboot. When this happens, you see the following in the console [[0;32m OK [0m] Reached target Shutdown. [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks: [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187) [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 One host that exhibits this behaviour was: Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux guest running: Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4) The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant. Here's how I reproduced it in lcy01: $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done $ <ssh to each instance> sudo reboot # wait a minute or so for the instances to all reboot $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} \| tail; done On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal. We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.) I'll attach a bad and a good console-log for reference. If you're at Canonical then see internal rt #107135 for some other details.	This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly). 17.10 (and bionic) guests are sometimes failing to reboot. When this happens, you see the following in the console [[0;32m OK [0m] Reached target Shutdown. [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks: [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187) [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 One host that exhibits this behaviour was: Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux guest running: Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4) The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant. Here's how I reproduced it in lcy01: $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done $ <ssh to each instance> sudo reboot # wait a minute or so for the instances to all reboot $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} \| tail; done On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal. We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.) I'll attach a bad and a good console-log for reference. If you're at Canonical then see internal rt #107135 for some other details.
2017-11-07 18:00:25	Ubuntu Kernel Bot	linux (Ubuntu): status	New	Incomplete
2017-11-07 18:00:27	Ubuntu Kernel Bot	tags		xenial
2017-11-07 19:20:40	Joseph Salisbury	linux (Ubuntu): importance	Undecided	High
2017-11-07 19:20:51	Joseph Salisbury	tags	xenial	kernel-key xenial
2017-11-07 19:26:24	Joseph Salisbury	linux (Ubuntu): status	Incomplete	Triaged
2017-11-07 19:26:32	Joseph Salisbury	nominated for series		Ubuntu Bionic
2017-11-07 19:26:32	Joseph Salisbury	bug task added		linux (Ubuntu Bionic)
2017-11-07 19:26:32	Joseph Salisbury	bug task added		qemu-kvm (Ubuntu Bionic)
2017-11-07 19:26:32	Joseph Salisbury	nominated for series		Ubuntu Artful
2017-11-07 19:26:32	Joseph Salisbury	bug task added		linux (Ubuntu Artful)
2017-11-07 19:26:32	Joseph Salisbury	bug task added		qemu-kvm (Ubuntu Artful)
2017-11-07 19:26:38	Joseph Salisbury	linux (Ubuntu Artful): status	New	Triaged
2017-11-07 19:26:42	Joseph Salisbury	linux (Ubuntu Artful): importance	Undecided	High
2017-11-08 11:56:08	Christian Ehrhardt 	bug			added subscriber ChristianEhrhardt
2017-11-10 07:03:00	Junien F	bug			added subscriber Junien Fridrick
2017-11-10 07:03:02	Junien F	bug			added subscriber The Canonical Sysadmins
2017-11-16 13:17:35	Launchpad Janitor	qemu-kvm (Ubuntu): status	New	Confirmed
2017-11-16 13:17:35	Launchpad Janitor	qemu-kvm (Ubuntu Artful): status	New	Confirmed
2017-11-16 22:56:40	Joseph Salisbury	linux (Ubuntu Artful): assignee		Joseph Salisbury (jsalisbury)
2017-11-16 22:56:43	Joseph Salisbury	linux (Ubuntu Bionic): assignee		Joseph Salisbury (jsalisbury)
2017-11-16 22:56:48	Joseph Salisbury	linux (Ubuntu Artful): status	Triaged	In Progress
2017-11-16 22:56:52	Joseph Salisbury	linux (Ubuntu Bionic): status	Triaged	In Progress
2017-11-16 22:57:04	Joseph Salisbury	nominated for series		Ubuntu Zesty
2017-11-16 22:57:04	Joseph Salisbury	bug task added		linux (Ubuntu Zesty)
2017-11-16 22:57:04	Joseph Salisbury	bug task added		qemu-kvm (Ubuntu Zesty)
2017-11-16 22:57:11	Joseph Salisbury	linux (Ubuntu Zesty): status	New	Incomplete
2017-11-16 22:57:14	Joseph Salisbury	linux (Ubuntu Zesty): importance	Undecided	High
2017-11-16 22:57:19	Joseph Salisbury	linux (Ubuntu Zesty): assignee		Joseph Salisbury (jsalisbury)
2017-12-04 16:41:49	Joseph Salisbury	tags	kernel-key xenial	kernel-da-key xenial
2017-12-07 17:26:20	Seth Forshee	linux (Ubuntu Bionic): status	In Progress	Fix Committed
2017-12-07 17:26:20	Seth Forshee	linux (Ubuntu Bionic): assignee	Joseph Salisbury (jsalisbury)	Seth Forshee (sforshee)
2017-12-07 17:30:19	Colin Ian King	description	This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly). 17.10 (and bionic) guests are sometimes failing to reboot. When this happens, you see the following in the console [[0;32m OK [0m] Reached target Shutdown. [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks: [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187) [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 One host that exhibits this behaviour was: Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux guest running: Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4) The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant. Here's how I reproduced it in lcy01: $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done $ <ssh to each instance> sudo reboot # wait a minute or so for the instances to all reboot $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} \| tail; done On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal. We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.) I'll attach a bad and a good console-log for reference. If you're at Canonical then see internal rt #107135 for some other details.	== SRU Justification == The fix to bug 1672819 can cause an lockup because it can spin indefinitely waiting for a child to exit. [FIX] Add a sauce patch to the original fix to insert a reasonable small delay and an upper bounds to the number of retries that are made before bailing out with an error. This avoids the lockup and also is less aggressive in the retry loop. [TEST] Without the fix the machine hangs. With the fix, the lockup no longer occurs. [REGRESSION POTENTIAL] There may be an issue with the interruptible sleep having some unforeseen impact with userspace racy code that expects the system call to return quickly when the race condition occurs and instead it gets delayed by a few milliseconds while the retry loop spins. However, code that relies on timing on fork/exec inside pthreads where this particular code path could bite is generally non-POSIX conforming racy code anyhow. ----------------------------------- This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly). 17.10 (and bionic) guests are sometimes failing to reboot. When this happens, you see the following in the console [[0;32m OK [0m] Reached target Shutdown. [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks: [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187) [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 One host that exhibits this behaviour was: Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux guest running: Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4) The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant. Here's how I reproduced it in lcy01: $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done $ <ssh to each instance> sudo reboot # wait a minute or so for the instances to all reboot $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} \| tail; done On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal. We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.) I'll attach a bad and a good console-log for reference. If you're at Canonical then see internal rt #107135 for some other details.
2017-12-19 16:04:47	Ryan Harper	bug			added subscriber Ryan Harper
2018-01-23 10:08:04	Stefan Bader	linux (Ubuntu Zesty): status	Incomplete	Won't Fix
2018-02-01 20:34:12	Khaled El Mously	linux (Ubuntu Artful): status	In Progress	Fix Committed
2018-03-19 10:58:05	Stefan Bader	tags	kernel-da-key xenial	kernel-da-key verification-needed-artful xenial
2018-04-03 14:10:10	Launchpad Janitor	linux (Ubuntu Artful): status	Fix Committed	Fix Released
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-0861
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-1000407
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-15129
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-16994
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17448
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17450
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17741
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17805
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17806
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2017-17807
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2018-1000026
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2018-5332
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2018-5333
2018-04-03 14:10:10	Launchpad Janitor	cve linked		2018-5344
2018-04-03 14:32:22	Christian Ehrhardt 	qemu-kvm (Ubuntu Zesty): status	New	Won't Fix
2018-04-03 14:32:26	Christian Ehrhardt 	qemu-kvm (Ubuntu Bionic): status	Confirmed	Won't Fix
2018-04-03 14:32:28	Christian Ehrhardt 	qemu-kvm (Ubuntu Artful): status	Confirmed	Won't Fix
2018-04-03 15:28:37	Seth Forshee	linux (Ubuntu Bionic): status	Fix Committed	Fix Released

Ubuntulinux package

Activity log for bug #1730717

Ubuntu
linux package