[Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| linux (Ubuntu) |
Undecided
|
Unassigned | ||
| Xenial |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Unassigned | ||
| Cosmic |
Undecided
|
Unassigned | ||
| linux-azure (Ubuntu) |
Medium
|
Marcelo Cerri | ||
| Xenial |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Marcelo Cerri | ||
| Cosmic |
Undecided
|
Marcelo Cerri |
Bug Description
We had a customer seeing traces like the following:
tack trace from kern.log:
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
2018-10-
Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120 seconds.
We are seeing more issue with fsnotify related callbacks. These are not a soft/hard lockup but seem to significantly degrade the responsiveness of systemd (and from there everything else).
The following upstream commit may fix this issue, but it is in Paul's RCU tree and not in linux-next or upstream yet:
srcu: Lock srcu_data structure in srcu_gp_start()
The srcu_gp_start() function is called with the srcu_struct structure's
->lock held, but not with the srcu_data structure's ->lock. This is
problematic because this function accesses and updates the srcu_data
structure's ->srcu_cblist, which is protected by that lock. Failing to
hold this lock can result in corruption of the SRCU callback lists,
which in turn can result in arbitrarily bad results.
This commit therefore makes srcu_gp_start() acquire the srcu_data
structure's ->lock across the calls to rcu_segcblist_
rcu_segcblist_
Please investigate this issue and evaluate the proposed fix.
CVE References
- 2017-5715
- 2017-5753
- 2017-5754
- 2018-14625
- 2018-14633
- 2018-14678
- 2018-15471
- 2018-16882
- 2018-18021
- 2018-18397
- 2018-18653
- 2018-18710
- 2018-18955
- 2018-19407
- 2018-19824
- 2018-19854
- 2018-5391
- 2018-6559
- 2018-7755
- 2018-9363
- 2019-3459
- 2019-3460
- 2019-6133
- 2019-6974
- 2019-7221
- 2019-7222
- 2019-7308
- 2019-8912
- 2019-8956
- 2019-8980
- 2019-9003
- 2019-9162
- 2019-9213
Changed in linux-azure (Ubuntu): | |
status: | New → Confirmed |
tags: | added: kernel-da-key kernel-hyper-v |
Changed in linux-azure (Ubuntu): | |
importance: | Undecided → Medium |
status: | Confirmed → Triaged |
Marcelo Cerri (mhcerri) wrote : | #1 |
Joshua R. Poulson (jrp) wrote : | #2 |
It's a heavy database workload from a online site, so it is difficult to make a simple repro for.
Changed in linux-azure (Ubuntu): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
status: | Triaged → In Progress |
Joseph Salisbury (jsalisbury) wrote : | #3 |
I built a test kernel with commit eb4c2382272ae7 from linux-next. This commit relies on commit d633198088bd9 for the definition of spin_lock_rcu_node and it's corresponding unlock. That commit was added to mainline in v4.16-rc1
The test kernel can be downloaded from:
http://
It sounds like this bug is difficult to reproduce, but it would be great if the affected customer was willing to test this kernel.
Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-
Thanks in advance!
tags: | added: bjf |
tags: | removed: bjf |
tags: | added: bjf-tracking |
Joshua R. Poulson (jrp) wrote : | #4 |
Joe, is your kernel different than the one Gavin Guo built for us here? https:/
We have enabled extra debugging and given that kernel to our internal customer who attempting to repro. Since the repro takes a very long time it is difficult to decide when the fix is working or not.
Paul McKenney upstream has submitted a pull request for this patch (and others) to go into 4.21. Getting some "burn in" time upstream hasn't really started in earnest yet, but there is no negative discussion about the PR, and I am tempted to get this into the regular based on Paul's comment: "Lock srcu_data structure in srcu_gp_start(), fixing a an extremely rare but also extremely embarrassing concurrency bug, courtesy of Dennis Krein."
Gavin Guo (mimi0213kimo) wrote : | #5 |
Hi Joshua,
I just reviewed the commit Joseph provided in the launchpad. It's the
same as the two patches I backported.
eb4c2382272a srcu: Lock srcu_data structure in srcu_gp_start()
d633198088bd srcu: Prohibit call_srcu() use under raw spinlocks
The commit id eb4c2382272a is the latest linux-next commit id. When I
backported the above two patches, the eb4c2382272a was in Paul's
rcu-next tree. So, for the SRU process, Joseph's backport are the formal one.
Joseph Salisbury (jsalisbury) wrote : | #6 |
Hi Joshua,
I will touched base with Gavin to compare our trees. My test kernel is the current Azure kernel with two commits applied: d633198088bd9 and eb4c2382272ae7. Commit eb4c2382272ae7 being the patch from
Dennis Krein in linux-next:
eb4c2382272a ("srcu: Lock srcu_data structure in srcu_gp_start()")
Gavin and I both had the same set of commits. I can submit and SRU request for this if you don't want to wait for the testing, since it could take a long time. If I submit it this week, it won't land in the Azure kernel until the next SRU cycle in the new year. Just let us know what you think.
Here are the dates for the next cycle:
cycle: 14-Jan through 03-Feb
=======
11-Jan Last day for kernel commits for this cycle.
14-Jan - 18-Jan Kernel prep week.
21-Jan - 01-Feb Bug verification & Regression testing.
31-Jan Release 18.04.2 kernels to -updates
04-Feb Release remaining kernels to -updates.
Looking at these dates, we may want to SRU it this week, due to a Company shutdown between 24-Dec and 06-Jan.
Changed in linux-azure (Ubuntu): | |
assignee: | Joseph Salisbury (jsalisbury) → Marcelo Cerri (mhcerri) |
status: | In Progress → Triaged |
Marcelo Cerri (mhcerri) wrote : | #7 |
Hi, Josh.
Did you have any feedback from the customer regarding the test kernel?
How do you want to proceed with that?
Joshua R. Poulson (jrp) wrote : | #8 |
No new instances of the problem on the test cluster for many weeks. Let's move forward with this change.
overlord (lazamarius1) wrote : | #9 |
Will this fix be available for Linux 4.15.0-generic x86_64, or is it available already?
I am currently on Linux 4.15.0-43-generic x86_64 and on some servers I have this issue, others are fine, I am not sure what triggers the problem but when it triggers kworker, dockerd, systemd, go in uninterruptible sleep and I need to reboot the servers to recover from the issue.
After a while the issue reappears, so I would like to patch the servers as fast as possible.
Thanks!
overlord (lazamarius1) wrote : | #10 |
Using Ubuntu 16.04.5 LTS btw.
Joshua R. Poulson (jrp) wrote : | #11 |
The fix was picked up for upstream stable 4.19.15 and 4.20.2. I would expect the generic kernels to eventually pick up this fix.
Marcelo Cerri (mhcerri) wrote : | #12 |
Hi, Josh. Should we apply the fixes for the 4.15 and 4.18 linux-azure kernel then?
Joshua R. Poulson (jrp) wrote : | #13 |
Yes, please, per comment #8
Changed in linux-azure (Ubuntu Bionic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Cosmic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Bionic): | |
status: | New → Confirmed |
Changed in linux-azure (Ubuntu Cosmic): | |
status: | New → Confirmed |
Changed in linux-azure (Ubuntu): | |
status: | Triaged → Confirmed |
Changed in linux-hwe (Ubuntu): | |
status: | New → Confirmed |
Launchpad Janitor (janitor) wrote : | #14 |
Status changed to 'Confirmed' because the bug affects multiple users.
affects: | linux-hwe (Ubuntu) → linux-meta-hwe (Ubuntu) |
Changed in linux-meta-hwe (Ubuntu Bionic): | |
status: | New → Confirmed |
Changed in linux-meta-hwe (Ubuntu Cosmic): | |
status: | New → Confirmed |
no longer affects: | linux-meta-hwe (Ubuntu) |
no longer affects: | linux-meta-hwe (Ubuntu Bionic) |
no longer affects: | linux-meta-hwe (Ubuntu Cosmic) |
no longer affects: | linux-azure (Ubuntu Bionic) |
no longer affects: | linux-azure (Ubuntu Cosmic) |
Changed in linux-azure (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in linux-azure (Ubuntu Cosmic): | |
status: | New → In Progress |
Changed in linux-azure (Ubuntu Bionic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Cosmic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in linux-azure (Ubuntu Cosmic): | |
importance: | Undecided → Medium |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1802021
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Bionic): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Cosmic): | |
status: | New → Incomplete |
tags: | added: bionic |
Marcelo Cerri (mhcerri) wrote : | #17 |
overlord (lazamarius1) wrote : | #18 |
Hey guys, I am not sure how to do this but can you also make a patch for Xenial, 4.15.0-generic, for linux-hwe package? We have the same problem and we can't upgrade the boxes to a newer release so this would really help us.
affects: | linux (Ubuntu) → linux-hwe (Ubuntu) |
Changed in linux-hwe (Ubuntu): | |
status: | Incomplete → Confirmed |
overlord (lazamarius1) wrote : | #19 |
Can't run the logs collection tool since the system is stuck in D state and times out. (Also the tool is not installed on the systems). We just need this fix back-ported on 4.15.0-generic for Xenial on linux-hwe package (we don't use linux-azure)
Changed in linux-hwe (Ubuntu Bionic): | |
status: | Incomplete → Confirmed |
Changed in linux-hwe (Ubuntu Cosmic): | |
status: | Incomplete → Confirmed |
information type: | Public → Public Security |
Changed in linux-hwe (Ubuntu): | |
assignee: | nobody → overlord (lazamarius1) |
assignee: | overlord (lazamarius1) → nobody |
information type: | Public Security → Public |
Marcelo Cerri (mhcerri) wrote : | #21 |
Hi, @overlord.
I changed it to "linux" instead because xenial/linux-hwe is simply a backport of bionic/linux, so we need to apply the fix to bionic/linux first and that will be include to xenial/linux-hwe automatically.
Mauricio Faria de Oliveira (mfo) wrote : | #22 |
Marcelo @mhcerri,
Would you be able to provide a test kernel for bionic/linux-hwe so that @lazamarius1 can provide test results for -generic?
I'll be happy to do that as well if you're short on time right now.
(I guess the patchset is the same you posted for linux-azure.)
Thanks,
Mauricio
overlord (lazamarius1) wrote : | #23 |
Thanks Marcelo! However I still see the unassigned package is linux-hwe, and I can't add Xenial tag.
I also notices I received an update today from 4.15.0-43-generic to 4.15.0-45-generic but I cannot tell if this update has the fix from this bug.
Could you help me with some directions on this? Where could I check ?
Thanks!
overlord (lazamarius1) wrote : | #24 |
Mauricio @mfo, is there a way to reproduce the issue easily in order to test it? I was not able to find it.
The only way I can tell the issue is there or not is to apply the patch and wait for the servers to "hit" the problem (could take days or weeks...), and when that happens, in my case, docker tasks will end up in D state and load average will go to 100 very fast, then also a certain kworker will hit the D state and possibly in the end the init/systemd will go also in D state, and the only recovery action is to restart the box.
David Coronel (davecore) wrote : | #25 |
Hi overlord. An easy way to check the updates included in released kernels is to look at the "-changes" mailing list for your Ubuntu release.
In this situation it would be https:/
And you can find that new kernel here:
https:/
You can see this kernel 4.15.0-45 does not include the fix for this LP bug #1802021.
David Coronel (davecore) wrote : | #26 |
@overlord: AFAIK, there is no simple reproducer test case for this issue. The ideal testing scenario for a bug and fix like this one would be for each user who reported this issue to use a test kernel with the fix in their environment and report back if the issue still manifests or not after some time has passed with that test kernel in your affected environment.
overlord (lazamarius1) wrote : | #27 |
Thanks David, that was my intention to patch the systems and wait for it to reproduce (usually we get the issue back in 3-8 days or so...)
Also thanks for the pointers for checking the release notes for certain releases.
Is there a connection between the changes listed in the second link and the code base for those changes? I mean to say maybe the LP bug does not appear in the changes because it was not tagged with a correct tag for it to show up but maybe the code fix is already there ? Does this make sense? I know the code fix it's already available starting with 4.19.16 but I don't know how the back-porting is handled.
David Coronel (davecore) wrote : | #28 |
You can clone the ubuntu-xenial kernel:
git clone git://kernel.
And then grep for the commit you're looking for. There's a few different ways to do it, I do:
git log --oneline | grep "Expose SMT control init function"
no longer affects: | linux-hwe (Ubuntu) |
no longer affects: | linux-hwe (Ubuntu Bionic) |
no longer affects: | linux-hwe (Ubuntu Cosmic) |
no longer affects: | linux-azure (Ubuntu Bionic) |
no longer affects: | linux-azure (Ubuntu Cosmic) |
Changed in linux-azure (Ubuntu Bionic): | |
status: | New → Fix Committed |
Changed in linux-azure (Ubuntu Cosmic): | |
status: | New → Fix Committed |
Changed in linux-azure (Ubuntu Bionic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Cosmic): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1802021
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Bionic): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Cosmic): | |
status: | New → Incomplete |
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
Changed in linux (Ubuntu Bionic): | |
status: | Incomplete → Confirmed |
Changed in linux (Ubuntu Cosmic): | |
status: | Incomplete → Confirmed |
Mauricio Faria de Oliveira (mfo) wrote : | #30 |
Hi Marcelo (@mhcerri),
We have another user who confirmed the 2 patches submitted for linux-azure also fix the problem on linux(-generic).
srcu: Prohibit call_srcu() use under raw spinlocks
srcu: Lock srcu_data structure in srcu_gp_start()
Could they be submitted for linux as well?
Thank you very much,
Mauricio
Brad Figg (brad-figg) wrote : | #31 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-bionic |
overlord (lazamarius1) wrote : | #32 |
Hi @brad-figg,
Can we get a proposed fix for Xenial linux(-generic) package?
Or that is planned after the bionic one?
Mauricio Faria de Oliveira (mfo) wrote : | #33 |
Hi @lazamarius1,
The fix for linux generic should be applied in the next kernel SRU cycle.
The current cycle ends on late February [1].
David Coronel (davecore) wrote : | #34 |
@lazamarius1: Just to clarify, the fix is scheduled to go in the 4.15 kernel in Bionic which is the same kernel as the Xenial HWE kernel. So there's no need to add anything to the Affects section. You will see a new linux-hwe 4.15 kernel in xenial-proposed once this is ready to test. Thanks!
Launchpad Janitor (janitor) wrote : | #35 |
This bug was fixed in the package linux-azure - 4.18.0-1011.11
---------------
linux-azure (4.18.0-1011.11) cosmic; urgency=medium
* linux-azure: 4.18.0-1011.11 -proposed tracker (LP: #1816081)
* 4.15.0-1037 does not see all PCI devices on GPU VMs (LP: #1816106)
- Revert "PCI: hv: Make sure the bus domain is really unique"
linux-azure (4.18.0-1009.9) cosmic; urgency=medium
* Allow I/O schedulers to be loaded with modprobe in linux-azure
(LP: #1813211)
- [Config] linux-azure: Enable all IO schedulers as modules
* [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start() (LP: #1802021)
- srcu: Lock srcu_data structure in srcu_gp_start()
* CONFIG_
(LP: #1813866)
- [Config]: disable CONFIG_
[ Ubuntu: 4.18.0-15.16 ]
* Ubuntu boot failure. 4.18.0-14 boot stalls. (does not boot) (LP: #1814555)
- Revert "drm/i915/
* Userspace break as a result of missing patch backport (LP: #1813873)
- tty: Don't hold ldisc lock in tty_reopen() if ldisc present
-- Stefan Bader <email address hidden> Fri, 15 Feb 2019 17:16:24 +0100
Changed in linux-azure (Ubuntu Cosmic): | |
status: | Fix Committed → Fix Released |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #37 |
This bug was fixed in the package linux-azure - 4.18.0-
---------------
linux-azure (4.18.0-
* linux-azure: 4.18.0-
* Miscellaneous Ubuntu changes
- Start new release
[ Ubuntu: 4.18.0-1011.11 ]
* linux-azure: 4.18.0-1011.11 -proposed tracker (LP: #1816081)
* 4.15.0-1037 does not see all PCI devices on GPU VMs (LP: #1816106)
- Revert "PCI: hv: Make sure the bus domain is really unique"
linux-azure (4.18.0-
* linux-azure: 4.18.0-
* Packaging resync (LP: #1786013)
- [Packaging] update helper scripts
* Move bionic/linux-azure to 4.18 (LP: #1815451)
- [Packaging] bionic/linux-azure is now a backport of cosmic/linux-azure
* Miscellaneous Ubuntu changes
- Start new release
[ Ubuntu: 4.18.0-1009.9 ]
* Allow I/O schedulers to be loaded with modprobe in linux-azure
(LP: #1813211)
- [Config] linux-azure: Enable all IO schedulers as modules
* [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start() (LP: #1802021)
- srcu: Lock srcu_data structure in srcu_gp_start()
* CONFIG_
(LP: #1813866)
- [Config]: disable CONFIG_
* Ubuntu boot failure. 4.18.0-14 boot stalls. (does not boot) (LP: #1814555)
- Revert "drm/i915/
* Userspace break as a result of missing patch backport (LP: #1813873)
- tty: Don't hold ldisc lock in tty_reopen() if ldisc present
linux-azure (4.18.0-
* linux-azure-edge: 4.18.0-
[ Ubuntu: 4.18.0-1008.8 ]
* linux-azure: 4.18.0-1008.8 -proposed tracker (LP: #1811415)
* Cosmic update: 4.18.19 upstream stable release (LP: #1810820)
- [Config] Update config after 4.18.0-14.15 rebase
* Packaging resync (LP: #1786013)
- [Packaging] update helper scripts
* linux: 4.18.0-14.15 -proposed tracker (LP: #1811406)
* CPU hard lockup with rigorous writes to NVMe drive (LP: #1810998)
- blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
- blk-wbt: move disable check into get_limit()
- blk-wbt: use wq_has_sleeper() for wq active check
- blk-wbt: fix has-sleeper queueing check
- blk-wbt: abstract out end IO completion handler
- blk-wbt: improve waking of tasks
* To reduce the Realtek USB cardreader power consumption (LP: #1811337)
- mmc: core: Introduce MMC_CAP_
- mmc: rtsx_usb_sdmmc: Don't runtime resume the device while changing led
- mmc: rtsx_usb_sdmmc: Re-work runtime PM support
- mmc: rtsx_usb_sdmmc: Re-work card detection/removal support
- memstick: rtsx_usb_ms: Add missing pm_runtime_
- misc: rtsx_usb: Use USB remote wakeup signaling for card insertion detection
- memstick: Prevent memstick host from getting runtime suspended during card
detection
- memstick: rtsx_usb_ms: Use ms_dev() helper
- memstick: rtsx_...
Changed in linux-azure (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
status: | Fix Committed → Fix Released |
Changed in linux-azure (Ubuntu Xenial): | |
status: | New → Fix Committed |
Launchpad Janitor (janitor) wrote : | #39 |
This bug was fixed in the package linux-azure - 4.18.0-1011.11
---------------
linux-azure (4.18.0-1011.11) cosmic; urgency=medium
* linux-azure: 4.18.0-1011.11 -proposed tracker (LP: #1816081)
* 4.15.0-1037 does not see all PCI devices on GPU VMs (LP: #1816106)
- Revert "PCI: hv: Make sure the bus domain is really unique"
linux-azure (4.18.0-1009.9) cosmic; urgency=medium
* Allow I/O schedulers to be loaded with modprobe in linux-azure
(LP: #1813211)
- [Config] linux-azure: Enable all IO schedulers as modules
* [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start() (LP: #1802021)
- srcu: Lock srcu_data structure in srcu_gp_start()
* CONFIG_
(LP: #1813866)
- [Config]: disable CONFIG_
[ Ubuntu: 4.18.0-15.16 ]
* Ubuntu boot failure. 4.18.0-14 boot stalls. (does not boot) (LP: #1814555)
- Revert "drm/i915/
* Userspace break as a result of missing patch backport (LP: #1813873)
- tty: Don't hold ldisc lock in tty_reopen() if ldisc present
-- Stefan Bader <email address hidden> Fri, 15 Feb 2019 17:16:24 +0100
Changed in linux-azure (Ubuntu): | |
status: | Confirmed → Fix Released |
Changed in linux (Ubuntu Bionic): | |
status: | Confirmed → Fix Committed |
Changed in linux (Ubuntu Cosmic): | |
status: | Confirmed → Fix Committed |
Terry Rudd (terrykrudd) wrote : | #40 |
This is the final reminder to please verify that the kernel in -proposed resolves the issue for which you've filed this bug report. Canonical is planning to release these kernels early next week. Thank you in advance!
Andrei S (darthside) wrote : | #41 |
Do you know when a fix for linux-xenial will be available? Looks like that's the only one remaining.
Terry Rudd (terrykrudd) wrote : | #42 |
Andrei, I will discuss with engineering to confirm availability and get back to you
Launchpad Janitor (janitor) wrote : | #43 |
This bug was fixed in the package linux-azure - 4.15.0-1040.44
---------------
linux-azure (4.15.0-1040.44) xenial; urgency=medium
* linux-azure: 4.15.0-1040.44 -proposed tracker (LP: #1817038)
* Packaging resync (LP: #1786013)
- [Packaging] resync retpoline extraction
* CONFIG_
(LP: #1813866)
- [Config]: disable CONFIG_
- [Config] Update configs
* Allow I/O schedulers to be loaded with modprobe in linux-azure
(LP: #1813211)
- [Config] linux-azure: Enable all IO schedulers as modules
* [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start() (LP: #1802021)
- srcu: Prohibit call_srcu() use under raw spinlocks
- srcu: Lock srcu_data structure in srcu_gp_start()
[ Ubuntu: 4.15.0-46.49 ]
* linux: 4.15.0-46.49 -proposed tracker (LP: #1814726)
* mprotect fails on ext4 with dax (LP: #1799237)
- x86/speculation
* kernel BUG at /build/
- iscsi target: fix session creation failure handling
- scsi: iscsi: target: Set conn->sess to NULL when iscsi_login_
fails
- scsi: iscsi: target: Fix conn_ops double free
* user_copy in user from ubuntu_
(LP: #1812198)
- selftests: user: return Kselftest Skip code for skipped tests
- selftests: kselftest: change KSFT_SKIP=4 instead of KSFT_PASS
- selftests: kselftest: Remove outdated comment
* RTL8822BE WiFi Disabled in Kernel 4.18.0-12 (LP: #1806472)
- SAUCE: staging: rtlwifi: allow RTLWIFI_DEBUG_ST to be disabled
- [Config] CONFIG_
- SAUCE: Add r8822be to signature inclusion list
* kernel oops in bcache module (LP: #1793901)
- SAUCE: bcache: never writeback a discard operation
* CVE-2018-18397
- userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails
- userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
- userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
- userfaultfd: shmem: add i_size checks
- userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
* Ignore "incomplete report" from Elan touchpanels (LP: #1813733)
- HID: i2c-hid: Ignore input report if there's no data present on Elan
touchpanels
* Vsock connect fails with ENODEV for large CID (LP: #1813934)
- vhost/vsock: fix vhost vsock cid hashing inconsistent
* SRU: Fix thinkpad 11e 3rd boot hang (LP: #1804604)
- ACPI / LPSS: Force LPSS quirks on boot
* Bionic update: upstream stable patchset 2019-01-17 (LP: #1812229)
- scsi: sd_zbc: Fix variable type and bogus comment
- KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in
parallel.
- x86/apm: Don't access __preempt_count with zeroed fs
- x86/events/
- x86/MCE: Remove min interval polling limitation
- fat: fix memory allocation failure handling of match_strdup()
- ALSA: hda/realtek - Add Panasonic CF-SZ6 headset jack quirk
- ARCv2:...
Changed in linux-azure (Ubuntu Xenial): | |
status: | Fix Committed → Fix Released |
status: | Fix Committed → Fix Released |
Brad Figg (brad-figg) wrote : | #45 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-cosmic |
overlord (lazamarius1) wrote : | #46 |
Hi Brad @brad-figg,
5 days are not enough to test this bug.
Citing David:
"@overlord: AFAIK, there is no simple reproducer test case for this issue. The ideal testing scenario for a bug and fix like this one would be for each user who reported this issue to use a test kernel with the fix in their environment and report back if the issue still manifests or not after some time has passed with that test kernel in your affected environment."
In most of the cases systems showed the bug symptoms after 40+ days uptime.
Plus, is there a xenial fix for linux-hwe yet? We really need this patch in 4.15 kernel.
Thanks,
Marius
Mauricio Faria de Oliveira (mfo) wrote : | #47 |
Hi Marius @lazamarius1,
Per the kernel.ubuntu.com schedule, the version for Bionic/linux -> Xenial/linux-hwe should land soon.
You can verify the version/timestamps for each package/release at the bottom of these pages
(the linux-hwe version comes a bit after the corresponding linux version)
https:/
https:/
As far as testing, yes, this issue might take longer to reproduce, but initial testing from another user that happened in order to first submit the fix to Ubuntu showed good results, so it's previously good sign of it invidivudally.
The integration of it with other fixes, i.e., testing with it in -proposed, will be done by that other user as well, so collectively w/ your testing that might increase chances of the issue still happening or not.
There's also regression testing of the kernel builds, which can spot failures, so that collaborates too.
Hope this helps,
Mauricio
Mauricio Faria de Oliveira (mfo) wrote : | #48 |
@lazamarius1,
Actually linux-hwe for Bionic with this fix has just been uploaded.
See in https:/
Changelog
linux-hwe (4.15.0-
...
* [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start() (LP: #1802021)
- srcu: Prohibit call_srcu() use under raw spinlocks
- srcu: Lock srcu_data structure in srcu_gp_start()
...
overlord (lazamarius1) wrote : | #49 |
Thanks Mauricio! @mfo
Will begin deploying this and let you guys know as soon as possible.
Andrei S (darthside) wrote : | #50 |
@mfo
Just to confirm that we installed the right proposed version of the kernel as this doesn't have any steps to reproduce. It only reproduces randomly after a certain amount of time.
These are the steps we followed to install
echo 'deb http://
echo -e "Package: *\nPin: release a=xenial-
apt-get -t xenial-proposed install -y linux-image-
reboot
Installed packages
dpkg -l | grep -i 4.15.0-47
ii linux-image-
ii linux-modules-
dpkg -l | grep -i hwe
ii linux-generic-
ii linux-headers-
ii linux-image-
uname -a
Linux myhostname 4.15.0-47-generic #50~16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Andrei S (darthside) wrote : | #51 |
As Marius mentioned we already deployed this patch on several instances and now we are monitoring them to see if this still happens. At least 2 of them were affected by the bug before patching so we had to reboot them.
If the above information (previous post) is correct and considering the time until this might reproduce or not, I think you might want to include this fix into existing release cycle.
I was experiencing a situation with moby dockerd entering a state similar to comment 0 here, running kubernetes and linux kernel 4.15.0-1037-azure. This was an 8 node cluster. Observed with combinations:
kubernetes 1.11.5 + moby runtime 3.0.1 + Ubuntu 16.04.5
kubernetes 1.11.7 + moby runtime 3.0.4 + Ubuntu 16.04.10
The longest window between outages prior was 4 days, with the shortest being less than a day.
I have observed 2 weeks of uptime on 8 nodes without observation of the original symptoms since upgrading the kubernetes node kernel to 4.15.0-1040-azure. I am confident the kernel patch has resolved our problem.
Ref https:/
Mauricio Faria de Oliveira (mfo) wrote : | #53 |
Updating bug tags to verification done.
As mentioned by users in this LP bug, the verification period of 5 days is _usually_ not enough to reproduce this problem, however, we have some datapoints that support the fix is good.
1) The fix has been first delivery in linux-azure, 3 weeks ago, and has reportedly resolved the issue for @alanjcastonguay: the issue was experienced within 4 days at the most, and hasn't happened for 2 weeks in 8 nodes (which is statistically very positive; and it helps that the fix is not specific to -azure).
2) One of the users who reported this in linux (-generic), has verified a test kernel with this fix for weeks, based upon which the fix has been submitted after linux-azure had it. The same user has verified -proposed for about a week now, and it's looking good.
3) Users in this LP bug have been running the -proposed kernel in multiple nodes for about a week now too, and haven't hit the issue yet.
On top of 1), with 2) and 3) combined, and the schedule for -proposed verification, this seems to be a reasonable compromise between results and test time.
cheers,
Mauricio
tags: |
added: verification-done-bionic verification-done-cosmic removed: verification-needed-bionic verification-needed-cosmic |
Launchpad Janitor (janitor) wrote : | #54 |
This bug was fixed in the package linux - 4.15.0-47.50
---------------
linux (4.15.0-47.50) bionic; urgency=medium
* linux: 4.15.0-47.50 -proposed tracker (LP: #1819716)
* Packaging resync (LP: #1786013)
- [Packaging] resync getabis
- [Packaging] update helper scripts
- [Packaging] resync retpoline extraction
* C++ demangling support missing from perf (LP: #1396654)
- [Packaging] fix a mistype
* arm-smmu-v3 arm-smmu-v3.3.auto: CMD_SYNC timeout (LP: #1818162)
- iommu/arm-smmu-v3: Fix unexpected CMD_SYNC timeout
* Crash in nvme_irq_check() when using threaded interrupts (LP: #1818747)
- nvme-pci: fix out of bounds access in nvme_cqe_pending
* CVE-2019-9213
- mm: enforce min addr even if capable() in expand_downwards()
* CVE-2019-3460
- Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
* amdgpu with mst WARNING on blanking (LP: #1814308)
- drm/amd/display: Don't use dc_link in link_encoder
- drm/amd/display: Move wait for hpd ready out from edp power control.
- drm/amd/display: eDP sequence BL off first then DP blank.
- drm/amd/display: Fix unused variable compilation error
- drm/amd/display: Fix warning about misaligned code
- drm/amd/display: Fix MST dp_blank REG_WAIT timeout
* tun/tap: unable to manage carrier state from userland (LP: #1806392)
- tun: implement carrier change
* CVE-2019-8980
- exec: Fix mem leak in kernel_read_file
* raw_skew in timer from the ubuntu_
(LP: #1811194)
- selftest: timers: Tweak raw_skew to SKIP when ADJ_OFFSET/other clock
adjustments are in progress
* [Packaging] Allow overlay of config annotations (LP: #1752072)
- [Packaging] config-check: Add an include directive
* CVE-2019-7308
- bpf: move {prev_,}insn_idx into verifier env
- bpf: move tmp variable into ax register in interpreter
- bpf: enable access to ax register also from verifier rewrite
- bpf: restrict map value pointer arithmetic for unprivileged
- bpf: restrict stack pointer arithmetic for unprivileged
- bpf: restrict unknown scalars of mixed signed bounds for unprivileged
- bpf: fix check_map_access smin_value test when pointer contains offset
- bpf: prevent out of bounds speculation on pointer arithmetic
- bpf: fix sanitation of alu op with pointer / scalar type from different
paths
- bpf: add various test cases to selftests
* CVE-2017-5753
- bpf: properly enforce index mask to prevent out-of-bounds speculation
- bpf: fix inner map masking to prevent oob under speculation
* BPF: kernel pointer leak to unprivileged userspace (LP: #1815259)
- bpf/verifier: disallow pointer subtraction
* squashfs hardening (LP: #1816756)
- squashfs: more metadata hardening
- squashfs metadata 2: electric boogaloo
- squashfs: more metadata hardening
- Squashfs: Compute expected length from inode size rather than block length
* efi/arm/arm64: Allow SetVirtualAddre
- efi/arm/arm64: Allow SetVirtualAddre
* Update ENA driver to version 2.0.3K (LP: #1816806)...
Changed in linux (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #55 |
This bug was fixed in the package linux - 4.18.0-17.18
---------------
linux (4.18.0-17.18) cosmic; urgency=medium
* linux: 4.18.0-17.18 -proposed tracker (LP: #1819624)
* Packaging resync (LP: #1786013)
- [Packaging] resync getabis
- [Packaging] update helper scripts
* C++ demangling support missing from perf (LP: #1396654)
- [Packaging] fix a mistype
* arm-smmu-v3 arm-smmu-v3.3.auto: CMD_SYNC timeout (LP: #1818162)
- iommu/arm-smmu-v3: Fix unexpected CMD_SYNC timeout
* Crash in nvme_irq_check() when using threaded interrupts (LP: #1818747)
- nvme-pci: fix out of bounds access in nvme_cqe_pending
* CVE-2019-9003
- ipmi: fix use-after-free of user->release_
* CVE-2019-9162
- netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
* CVE-2019-9213
- mm: enforce min addr even if capable() in expand_downwards()
* CVE-2019-3460
- Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
* tun/tap: unable to manage carrier state from userland (LP: #1806392)
- tun: implement carrier change
* CVE-2019-8980
- exec: Fix mem leak in kernel_read_file
* [Packaging] Allow overlay of config annotations (LP: #1752072)
- [Packaging] config-check: Add an include directive
* amdgpu with mst WARNING on blanking (LP: #1814308)
- drm/amd/display: Fix MST dp_blank REG_WAIT timeout
* CVE-2019-7308
- bpf: move {prev_,}insn_idx into verifier env
- bpf: move tmp variable into ax register in interpreter
- bpf: enable access to ax register also from verifier rewrite
- bpf: restrict map value pointer arithmetic for unprivileged
- bpf: restrict stack pointer arithmetic for unprivileged
- bpf: restrict unknown scalars of mixed signed bounds for unprivileged
- bpf: fix check_map_access smin_value test when pointer contains offset
- bpf: prevent out of bounds speculation on pointer arithmetic
- bpf: fix sanitation of alu op with pointer / scalar type from different
paths
- bpf: add various test cases to test_verifier
- bpf: add various test cases to selftests
* CVE-2017-5753
- bpf: fix inner map masking to prevent oob under speculation
* Use memblock quirk instead of delayed allocation for GICv3 LPI tables
(LP: #1816425)
- efi/arm: Revert "Defer persistent reservations until after paging_init()"
- arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve
table
* efi/arm/arm64: Allow SetVirtualAddre
- efi/arm/arm64: Allow SetVirtualAddre
* Update ENA driver to version 2.0.3K (LP: #1816806)
- net: ena: update driver version from 2.0.2 to 2.0.3
- net: ena: fix race between link up and device initalization
- net: ena: fix crash during failed resume from hibernation
* Silent "Unknown key" message when pressing keyboard backlight hotkey
(LP: #1817063)
- platform/x86: dell-wmi: Ignore new keyboard backlight change event
* CVE-2018-19824
- ALSA: usb-audio: Fix UAF decrement if card has no live interfaces in card.c
* CVE-2019-3459
- Bluetooth: Verify that l2cap_get...
Changed in linux (Ubuntu Cosmic): | |
status: | Fix Committed → Fix Released |
tags: | added: cscc |
Hi, Josh. Do you have an specific workload that triggers that issue?