linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220

Bug #1838115 reported by Colin Ian King on 2019-07-26
60
This bug affects 16 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Colin Ian King
Bionic
High
Unassigned
Disco
Critical
Colin Ian King

Bug Description

== SRU Justification BIONIC HWE, DISCO ==

Installed Bionic 18.04.2 i386 Desktop (using xubuntu) on a Lenovo x220i and upgraded to proposed. The 5.0.0.22 kernel crashes in various ways with video corruption being a main visible featured.

The CPU is a i3-2350M CPU, a 64 bit capable CPU, being booted with EUFI firmware disabled, so using traditional BIOS.

1. Crashes can be just complete hangs, no ability to switch virtual console
2. Crashes may just result in screen turning off, no video and hang and/or reboot
3. Crashes sometimes allow virtual console. Can see watchdog hang checks appearing on 1 or more CPUs.

Tried the i386 https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D kernels:

https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0.21/ - same issue
https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20.17/ - same issue
https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.61/ - OK

We therefore can conclude:

1. Issue appears between 4.19.61 and 4.20.17
2. Issue is in upstream kernel
3. Issue not a kernel patch per-se (e.g. security fix, ubuntu sauce patch, etc)

I repeated this with a VM installation and I don't see the issue, so this probably is a hardware (or firmware?) specific issue.

== Fix ==

Backport wiggle of upstream fix

3f8fd02b1bf1d7ba964485a56f2f4b53ae88c167 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")

== Test ==

Without the fix, i386 xubuntu on various Lenovo platforms crash during early boot with random video corruption, hangs,lockups or even reboots.

With the fix, it boots fine.

== Regression Potential ==

Higher than normal as this touches the mm sync and the fix has only just hit upstream so it has not much of a soak test. Testing with this shows it fixes a kitten killer breakage, so I think the risk vs benefit is worth considering

description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1838115

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Colin Ian King (colin-king) wrote :

One other data point, booting the upgraded machine to the 4.15 kernel and it works OK, so this corruption/hang is most probably not to do with changes in userspace on the upgrade to -proposed.

Colin Ian King (colin-king) wrote :

Tried the same on an HP Mini 210-1000, Atom N450 and it works fine, so this is a H/W specific issue.

Colin Ian King (colin-king) wrote :

Tried https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.1.20/ - less corruption and now the splash screen starts but as soon as splash transitions to the X display we get an immediate hang.

I believe the 5.0.0 crash occurs before the splash screen starts, so it's really early in the boot process.

Colin Ian King (colin-king) wrote :

Tried: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.2.3/
No splash screen, and screen turns off after several seconds and machine is completely hung after that.

Most probably all of this is a regression in the i915 driver for 32 bit platforms with Intel® HD Graphics 3000

Changed in linux (Ubuntu):
importance: Undecided → Critical
Sean Feole (sfeole) wrote :

Installed Bionic 18.04.2 i386 Desktop iso, (same one used by colin) on a a Dell PowerEdge R320 and upgraded to the 5.0.0.22. I was able to successfully boot the host on 5.0.0

ubuntu@ubuntu-PowerEdge-R320:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
ubuntu@ubuntu-PowerEdge-R320:~$ uname -a
Linux ubuntu-PowerEdge-R320 5.0.0-21-generic #22~18.04.1-Ubuntu SMP Thu Jul 4 17:23:57 UTC 2019 i686 i686 i686 GNU/Linux

ubuntu@ubuntu-PowerEdge-R320:~$ apt-cache policy linux-generic-hwe-18.04
linux-generic-hwe-18.04:
  Installed: 5.0.0.21.78
  Candidate: 5.0.0.21.78
  Version table:
 *** 5.0.0.21.78 500
        500 http://us.archive.ubuntu.com/ubuntu bionic-proposed/main i386 Packages

attached dmidecode of hardware

Sean Feole (sfeole) wrote :

   Static hostname: ubuntu-PowerEdge-R320
         Icon name: computer-server
           Chassis: server
        Machine ID: 0f66377cdd544d0285f8c82631d6dff0
           Boot ID: d1397f34572f43adbc0e7cee95a9fc3d
  Operating System: Ubuntu 18.04.2 LTS
            Kernel: Linux 5.0.0-21-generic
      Architecture: x86

$ sudo cat /var/log/installer/media-info
Xubuntu 18.04.2 LTS "Bionic Beaver" - Release i386 (20190210)

Colin Ian King (colin-king) wrote :

So I've tested it on a range of Lenovo's that I can get my hands on (kids, family etc).. and the following fail in the same way:

Lenovo X220 Intel(R) Core(TM) i3-2350M (HD Graphics 3000)
Lenovo X230 Intel(R) Core(TM) i5-3210M (HD Graphics 4000)
Lenovo L430 Intel(R) Core(TM) i3-3120M (HD Graphics 4000)
Lenovo T420 Intel(R) Core(TM) i5-2520M (HD Graphics 3000)

Colin Ian King (colin-king) wrote :

Looks like this is a kernel modeset bug, booting with i915.modeset=0 fixes the issue

Colin Ian King (colin-king) wrote :

Did a bisect on the drm i915 driver, came up with no root cause.
Did a full bisect between 4.18 and 4.20, bisected down to a PTI commit:

7757d607c6b31867777de42e1fb0210b9c5d8b70 is the first bad commit
commit 7757d607c6b31867777de42e1fb0210b9c5d8b70
Author: Joerg Roedel <email address hidden>
Date: Wed Jul 18 11:41:14 2018 +0200

    x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

    Allow PTI to be compiled on x86_32.

This seems to make sense, it's x86_32 specific and explains the more random nature of the crashes.

Colin Ian King (colin-king) wrote :

OK, I've proved that this is the core issue, booting 5.0.0-21.22~18.04.1 with nopti boots successfully.

Changed in linux (Ubuntu):
status: Incomplete → In Progress
Colin Ian King (colin-king) wrote :

Tried today's Linux tip, boots OK, so a reverse bisect to tip should prove to be instructive on a potential fix.

Colin Ian King (colin-king) wrote :

Upstream fix 3f8fd02b1bf1d7ba964485a56f2f4b53ae88c167 resolves the issue in linux tip:

commit 3f8fd02b1bf1d7ba964485a56f2f4b53ae88c167
Author: Joerg Roedel <email address hidden>
Date: Fri Jul 19 20:46:52 2019 +0200

    mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()

description: updated
description: updated
Colin Ian King (colin-king) wrote :

Fix sent to kernel team mailing list for review: https://lists.ubuntu.com/archives/kernel-team/2019-July/102627.html

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Andy Whitcroft (apw) on 2019-07-29
Changed in linux (Ubuntu Disco):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Colin Ian King (colin-king)
Colin Ian King (colin-king) wrote :

Tested the fix on same kit with amd64 kernel built and it works OK.

Stefan Bader (smb) on 2019-07-29
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
Dima (dima2017) wrote :

I've tested https://kernel.ubuntu.com/~cking/lp1827884/4.15 on xenial. It works.

tags: added: verification-done-xenial
Dima (dima2017) wrote :

user@user:~$ cat /proc/cmdline
BOOT_IMAGE=/@/boot/vmlinuz-4.15.0-57-generic root=UUID=50c3a930-a3e9-4444-b4ea-1646d276c7c6 ro rootflags=subvol=@ ipv6.disable=1 zswap.enabled=0 raid=noautodetect priority=low video=SVIDEO-1:d module_blacklist=r8169,mii,msr,jfs,xfs,bluetooth,hfs,hfsplus,ufs,minix,ntfs,joydev,mac_hid
user@user:~$
user@user:~$ uname -a
Linux user 4.15.0-57-generic #63~lp1827884 SMP Mon Jul 29 15:10:18 UTC 2019 i686 i686 i686 GNU/Linux
user@user:~$

Dima (dima2017) wrote :

I don't know how to add Xenial as affected.

Seth Forshee (sforshee) on 2019-07-30
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Colin Ian King (colin-king) wrote :

Xenial is not affected.

Colin Ian King (colin-king) wrote :

Tested disco -proposed 5.0.0-23, all OK.

tags: added: verification-done-disco
removed: verification-needed-disco

On 7/30/19 3:39 PM, Colin Ian King wrote:
> Tested disco -proposed 5.0.0-23, all OK.
>
> ** Tags removed: verification-needed-disco
> ** Tags added: verification-done-disco
>

Hello Colin,

Thanks for all your work. I am so bummed you have put in so much work
but I sincerely appreciate it. I hope you get some good sleep.

Terry

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.0.0-23.24

---------------
linux (5.0.0-23.24) disco; urgency=medium

  * disco/linux: 5.0.0-23.24 -proposed tracker (LP: #1838271)

  * linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220
    (LP: #1838115)
    - x86/mm: Check for pfn instead of page in vmalloc_sync_one()
    - x86/mm: Sync also unmappings in vmalloc_sync_all()
    - mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()
    - mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()

 -- Stefan Bader <email address hidden> Mon, 29 Jul 2019 16:49:23 +0200

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Dima (dima2017) wrote :

Xenial hwe kernels is affected. Non-hwe isn't affected indeed.

Dima (dima2017) wrote :

I've read Juerg's explanation [1]. I guess this commit will appear in xenial repos automatically. Thank you.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1827884

What about 4.15.0 kernel for bionic? the problem still exists, 4.15.0 kernel in -proposed does not contain this fix.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1827884

Dima (dima2017) wrote :

xenial-proposed still doesn't too. The fix for 4.15 does work (https://kernel.ubuntu.com/~cking/lp1827884/4.15). Just copy it to xenial-proposed or something.

(Sorry for adding wrong tag, now I know we need to wait for kernel-bot request).

tags: removed: verification-done-xenial
Launchpad Janitor (janitor) wrote :
Download full text (37.9 KiB)

This bug was fixed in the package linux - 5.2.0-10.11

---------------
linux (5.2.0-10.11) eoan; urgency=medium

  * eoan/linux: 5.2.0-10.11 -proposed tracker (LP: #1838113)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log

  * Eoan update: v5.2.4 upstream stable release (LP: #1838428)
    - bnx2x: Prevent load reordering in tx completion processing
    - caif-hsi: fix possible deadlock in cfhsi_exit_module()
    - hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
    - igmp: fix memory leak in igmpv3_del_delrec()
    - ipv4: don't set IPv6 only flags to IPv4 addresses
    - ipv6: rt6_check should return NULL if 'from' is NULL
    - ipv6: Unlink sibling route in case of failure
    - net: bcmgenet: use promisc for unsupported filters
    - net: dsa: mv88e6xxx: wait after reset deactivation
    - net: make skb_dst_force return true when dst is refcounted
    - net: neigh: fix multiple neigh timer scheduling
    - net: openvswitch: fix csum updates for MPLS actions
    - net: phy: sfp: hwmon: Fix scaling of RX power
    - net_sched: unset TCQ_F_CAN_BYPASS when adding filters
    - net: stmmac: Re-work the queue selection for TSO packets
    - net/tls: make sure offload also gets the keys wiped
    - nfc: fix potential illegal memory access
    - r8169: fix issue with confused RX unit after PHY power-down on RTL8411b
    - rxrpc: Fix send on a connected, but unbound socket
    - sctp: fix error handling on stream scheduler initialization
    - sctp: not bind the socket in sctp_connect
    - sky2: Disable MSI on ASUS P6T
    - tcp: be more careful in tcp_fragment()
    - tcp: fix tcp_set_congestion_control() use from bpf hook
    - tcp: Reset bytes_acked and bytes_received when disconnecting
    - vrf: make sure skb->data contains ip header to make routing
    - net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn
    - net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
    - net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
    - net: bridge: don't cache ether dest pointer on input
    - net: bridge: stp: don't cache eth dest pointer before skb pull
    - macsec: fix use-after-free of skb during RX
    - macsec: fix checksumming after decryption
    - netrom: fix a memory leak in nr_rx_frame()
    - netrom: hold sock when setting skb->destructor
    - selftests: txring_overwrite: fix incorrect test of mmap() return value
    - net/tls: fix poll ignoring partially copied records
    - net/tls: reject offload of TLS 1.3
    - net/mlx5e: Fix port tunnel GRE entropy control
    - net/mlx5e: Rx, Fix checksum calculation for new hardware
    - net/mlx5e: Fix return value from timeout recover function
    - net/mlx5e: Fix error flow in tx reporter diagnose
    - bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
    - mlxsw: spectrum_dcb: Configure DSCP map as the last rule is removed
    - net/mlx5: E-Switch, Fix default encap mode
    - mlxsw: spectrum: Do not process learned records with a dummy FID
    - dma-buf: balance refcount inbalance
    - dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc
    - Revert "gpio/spi: Fix spi-gpio...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Stefan Bader (smb) on 2019-08-12
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
importance: Undecided → High

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Valentyna (valia0906) wrote :

I`ve tested kernel version 4.15.0.59.61 on bionic i386. The problem seems to be solved.

tags: added: verification-done-bionic
removed: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Dima (dima2017) wrote :

It does work on xenial. (Seems we don't need xorg hwe for hwe kernels now.)

user@user:~$ cat /proc/cmdline
BOOT_IMAGE=/@/boot/vmlinuz-4.15.0-59-generic root=UUID=50c3a930-a3e9-4444-b4ea-1646d276c7c6 ro rootflags=subvol=@ ipv6.disable=1 zswap.enabled=0 raid=noautodetect priority=low video=SVIDEO-1:d module_blacklist=r8169,mii,msr,jfs,xfs,bluetooth,hfs,hfsplus,ufs,minix,ntfs,joydev,mac_hid
user@user:~$ uname -a
Linux user 4.15.0-59-generic #66~16.04.1-Ubuntu SMP Wed Aug 14 15:42:01 UTC 2019 i686 i686 i686 GNU/Linux

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (235.3 KiB)

This bug was fixed in the package linux - 4.15.0-60.67

---------------
linux (4.15.0-60.67) bionic; urgency=medium

  * bionic/linux: 4.15.0-60.67 -proposed tracker (LP: #1841086)

  * [Regression] net test from ubuntu_kernel_selftests failed due to bpf test
    compilation issue (LP: #1840935)
    - SAUCE: Fix "bpf: relax verifier restriction on BPF_MOV | BPF_ALU"

  * [Regression] failed to compile seccomp test from ubuntu_kernel_selftests
    (LP: #1840932)
    - Revert "selftests: skip seccomp get_metadata test if not real root"

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

linux (4.15.0-59.66) bionic; urgency=medium

  * bionic/linux: 4.15.0-59.66 -proposed tracker (LP: #1840006)

  * zfs not completely removed from bionic tree (LP: #1840051)
    - SAUCE: (noup) remove completely the zfs code

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * [18.04 FEAT] Enhanced hardware support (LP: #1836857)
    - s390: report new CPU capabilities
    - s390: add alignment hints to vector load and store

  * [18.04 FEAT] Enhanced CPU-MF hardware counters - kernel part (LP: #1836860)
    - s390/cpum_cf: Add support for CPU-MF SVN 6
    - s390/cpumf: Add extended counter set definitions for model 8561 and 8562

  * ideapad_laptop disables WiFi/BT radios on Lenovo Y540 (LP: #1837136)
    - platform/x86: ideapad-laptop: Remove no_hw_rfkill_list

  * Stacked onexec transitions fail when under NO NEW PRIVS restrictions
    (LP: #1839037)
    - SAUCE: apparmor: fix nnp subset check failure when, stacking

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665) // Tight
    timeout for bcache removal causes spurious failures (LP: #1796292)
    - SAUCE: bcache: fix deadlock in bcache_allocator

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665)
    - bcache: never writeback a discard operation
    - bcache: improve bcache_reboot()
    - bcache: fix writeback target calc on large devices
    - bcache: add journal statistic
    - bcache: fix high CPU occupancy during journal
    - bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
    - bcache: fix incorrect sysfs output value of strip size
    - bcache: fix error return value in memory shrink
    - bcache: fix using of loop variable in memory shrink
    - bcache: Fix indentation
    - bcache: Add __printf annotation to __bch_check_keys()
    - bcache: Annotate switch fall-through
    - bcache: Fix kernel-doc warnings
    - bcache: Remove an unused variable
    - bcache: Suppress more warnings about set-but-not-used variables
    - bcache: Reduce the number of sparse complaints about lock imbalances
    - bcache: Fix a compiler warning in bcache_device_init()
    - bcache: Move couple of string arrays to sysfs.c
    - bcache: Move couple of functions to sysfs.c
    - bcache: Replace bch_read_string_list() by __sysfs_match_string()

  * linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220
    (LP: #1838115)
    - x86/mm: Check for pfn instead of page in vmalloc_sync_one()
    - x86/mm: Sync also unmappings in vmalloc_sync_all()
    - mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers