tcmu user space crash results in kernel module hang.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
High
|
Kamal Mostafa | ||
Cosmic |
Fix Released
|
High
|
Kamal Mostafa | ||
linux-gcp (Ubuntu) |
Fix Released
|
Undecided
|
Kamal Mostafa | ||
Bionic |
Fix Released
|
High
|
Kamal Mostafa | ||
Cosmic |
Fix Released
|
High
|
Kamal Mostafa |
Bug Description
With the 4.15.0 kernel version used in bionic GCP/GKE, if the tcmu user space code crashes while handling a netlink message from the kernel, the kernel module will be stuck waiting for the response. This situation can only be resolved with a server reboot.
=== SRU Justification ===
[Impact]
With this particular user-space crash in the current Ubuntu image it results in a hang hobbles the entire physical server.
[Fix]
The request is for Canonical to backport the patch https:/
There were a number of mainline upstream commits to apply as part of the backport
[Test]
unavailable
[Regression Potential]
Low. it's all mainline commits, and affects only that driver.
Changed in linux-gcp (Ubuntu): | |
assignee: | nobody → Kamal Mostafa (kamalmostafa) |
Changed in linux-gcp (Ubuntu Bionic): | |
assignee: | nobody → Kamal Mostafa (kamalmostafa) |
Changed in linux-gcp (Ubuntu Bionic): | |
status: | New → In Progress |
importance: | Undecided → High |
Changed in linux-gcp (Ubuntu): | |
status: | New → Invalid |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
assignee: | nobody → Kamal Mostafa (kamalmostafa) |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Cosmic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Incomplete → Fix Released |
There may be some confusion between the requested patch versus the expected functionality: The requested patch (https:/ /patchwork. kernel. org/patch/ 10319623/) adds only a new **configfs** entry called "reset_netlink" but the returned test results (below) suggest that the tester was expecting a new **module parameter** called "reset_netlink". I also note that the discussion thread attached to the patchwork patch does mention the idea of doing it as a module parameter, but that idea is apparently rejected in favor of the configfs scheme implemented by the patch. Is it possible that the tester was originally working with an old version of the patch?
Also, the tester mentions expecting "block_netlink and reset_netlink" module params, but neither the patchwork patch nor mainline contains any reference to "block_netlink" at all.
My test kernel, with the reset_netlink patchwork patch (and many tcmu prerequisites from mainline) is available here: /git.launchpad. net/~kamalmosta fa/ubuntu/ +source/ linux/+ git/bionic/ log/?h= tcmu-patch /kernel. ubuntu. com/~kamal/ .tmp.4hthjg7s/
git source: https:/
binary pkgs: https:/
TEST RESULTS (via terrykrudd): ------- ------- ------- --
-------
Unfortunately the build you gave us does not seem to include the kernel
patch to reset and unblock the tcmu netlink socket.
|dhana@ gke-dhana- cluster- default- pool-596c314f- xn9f:~$ uname -a Linux cluster- default- pool-596c314f- xn9f 4.15.0-1028-gcp #29+tcmu0 dhana-cluster- default- pool-596c314f- xn9f:~$ dpkg -l | grep unsigned- 4.15.0- 1028-gcp 4.15.0- 1028.29+ tcmu0 amd64 4.15.0- 1028-gcp 4.15.0- 1028.29+ tcmu0 amd64 Linux kernel extra-4. 15.0-1028- gcp 4.15.0- 1028.29+ tcmu0 amd64 Linux
gke-dhana-
SMP Fri Mar 8 20:33:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
dhana@gke-
1028 ii linux-image-
Linux kernel image for version 4.15.0 on 64 bit x86 SMP ii
linux-modules-
extra modules for version 4.15.0 on 64 bit x86 SMP iU
linux-modules-
kernel extra modules for version 4.15.0 on 64 bit x86 SMP |
With the patch the /sys/module/ target_ core_user/ parameters/ should
include block_netlink and reset_netlink entries such as:
|[root@demo06 crash]# ls -al /sys/module/ target_ core_user/ parameters/ max_data_ area_mb
total 0 drwxr-xr-x 2 root root 0 Mar 11 16:47 . drwxr-xr-x 6 root root 0
Mar 8 11:37 .. -rw-r--r-- 1 root root 4096 Mar 8 14:31 block_netlink
-rw-r--r-- 1 root root 4096 Mar 11 16:47 global_
--w------- 1 root root 4096 Mar 8 14:31 reset_netlink |
But on the GKE instance with the patched kernel and modules we see:
|dhana@ gke-dhana- cluster- default- pool-596c314f- xn9f:~$ ls -al target_ core_user/ parameters/ total 0 drwxr-xr-x 2 root root max_data_ area_mb |
/sys/module/
0 Mar 11 23:32 . drwxr-xr-x 6 root root 0 Mar 11 23:32 .. -rw-r--r-- 1
root root 4096 Mar 11 23:32 global_
As you see the block_netlink and reset_netlink entries are not present.