Ubuntu18.04.01: [Power9] power8 Compat guest(RHEL7.6) crashes during guest boot with > 256G of memory (kernel/kvm)

Bug #1818645 reported by bugproxy on 2019-03-05
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Canonical Kernel Team
Bionic
Undecided
Unassigned

Bug Description

== SRU Justification ==

Rebooting a PowerPC VM with > 256G of memory results in a guest crash.

== Fix ==

Backport commit 46dec40fb741 ("KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function").

== Regression Potential ==

Low. Fix is trivial.

== Test Case ==

Create a PowerPC VM with > 256G of memory and reboot it repeatedly.

== Original description ==

== Comment: #0 - Satheesh Rajendran <email address hidden> - 2019-02-28 04:38:22 ==
---Problem Description---
Power8 Compat guest(RHEL 7.6) crashes during guest boot with > 256G of memory (kernel/kvm)

Contact Information = <email address hidden>

---uname output---
Host Kernel: 4.15.0-1016-ibm-gt

ii qemu-system-ppc 1:2.11+dfsg-1ubuntu7.8-1ibm3 ppc64el QEMU full system emulation binaries (ppc)

ii libvirt-bin 4.0.0-1ubuntu8.6 ppc64el programs for the libvirt library

Guest kernel: 3.10.0-957.5.1.el7.ppc64le (rhel7.6 zstream)

Machine Type = power9 ppc64le

---Steps to Reproduce---
 1. Boot a power8 compat guest with memory >256G
virsh define avocado-vt-vm1;virsh start --console avocado-vt-vm1 (guest xml sosreport attached)
----Guest crashes while booting

2019-02-28 10:36:44.752+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.8-1ibm3), hostname: cs-host-f37-ac922-3.pok.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-ppc64 -name guest=avocado-vt-vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-avocado-vt-vm1/master-key.aes -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off -m 264192 -realtime mlock=off -smp 256,sockets=256,cores=1,threads=1 -uuid f4e14f88-bf1b-4cc3-b6d6-958d514d6c18 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-avocado-vt-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/home/sath/avocado-fvt-wrapper/data/avocado-vt/images/rhel76-ppc64le.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:fc:fd:fe,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-15-avocado-vt-vm1/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2019-02-28 10:36:44.759+0000: 19598: info : libvirt version: 4.0.0, package: 1ubuntu8.6 (Christian Ehrhardt <email address hidden> Fri, 09 Nov 2018 07:42:01 +0100)
2019-02-28 10:36:44.759+0000: 19598: info : hostname: cs-host-f37-ac922-3
2019-02-28 10:36:44.759+0000: 19598: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x76d594111710
2019-02-28T10:36:44.781703Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0)
2019-02-28T10:36:44.781945Z qemu-system-ppc64: warning: Number of SMP cpus requested (256) exceeds the recommended cpus supported by KVM (128)
2019-02-28T10:36:44.781953Z qemu-system-ppc64: warning: Number of hotpluggable cpus requested (256) exceeds the recommended cpus supported by KVM (128)
2019-02-28 10:37:18.060+0000: shutting down, reason=crashed
2019-02-28T10:37:18.071969Z qemu-system-ppc64: terminating on signal 15 from pid 14056 (/usr/sbin/libvirtd)

*Additional Instructions for <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.

== Comment: #3 - Satheesh Rajendran <email address hidden> - 2019-02-28 04:51:45 ==
Possible Upstream fix: https://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git/commit/?h=kvm-ppc-next&id=46dec40fb741f00f1864580130779aeeaf24fb3d

CVE References

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-175853 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (frank-heimes) wrote :

Assuming that this happens with all guests with more than 256GB of memory, since it's a KVM host kernel side issue.

Changed in ubuntu-power-systems:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)

------- Comment From <email address hidden> 2019-03-06 02:58 EDT-------
(In reply to comment #13)
> Assuming that this happens with all guests with more than 256GB of memory,
> since it's a KVM host kernel side issue.

it happens only with Power8 compat(HPT) guests, and specifically older guest kernels, in this case 3.10.0-957.5.1.el7.ppc64le.

this upstream commit should possibly fix it, https://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git/commit/?h=kvm-ppc-next&id=46dec40fb741f00f1864580130779aeeaf24fb3d

Regards,
-Satheesh.

Manoj Iyer (manjo) on 2019-03-11
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Andrew Cloke (andrew-cloke) wrote :

Marking as “invalid” as this issue was not found with the generic Ubuntu kernel. I understand a parallel support case has been raised against the appropriate project through the support portal.

Changed in ubuntu-power-systems:
status: Triaged → Invalid
Changed in linux (Ubuntu):
status: New → Invalid
bugproxy (bugproxy) on 2019-03-13
tags: added: severity-critical
removed: severity-high
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-03-18 16:15 EDT-------
Removing the genesis keyword. The fix for this issue is already contained in the Ubuntu 18.04.1 for Genesis kernel (request was made through SalesForce). But I think it is still a valid Ubuntu 18.04.1 LTS issue.

Juerg Haefliger (juergh) wrote :

Nominating for Bionic as the referenced commit is a valid fix for Bionic.

Juerg Haefliger (juergh) on 2019-03-19
description: updated
description: updated
Juerg Haefliger (juergh) wrote :

Can you please give the following test kernel a try and let me know if that resolves the issue?
https://kernel.ubuntu.com/~juergh/lp1818645/

Andrew Cloke (andrew-cloke) wrote :

Changing bug status as this fix is now targeted for the generic Ubuntu kernel.

Changed in ubuntu-power-systems:
status: Invalid → Triaged
Changed in linux (Ubuntu):
status: Invalid → New
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-03-20 07:38 EDT-------
(In reply to comment #31)
> Can you please give the following test kernel a try and let me know if that
> resolves the issue?
> https://kernel.ubuntu.com/~juergh/lp1818645/

Tested with the above host kernel and issue is fixed.

Host:
# uname -a
Linux xxx 4.15.0-47-generic #50+lp1818645 SMP Tue Mar 19 07:47:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

Guest:
[root@localhost ~]# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Model: 2.2 (pvr 004e 1202)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: KVM
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 0-127
[root@localhost ~]# free -g
total used free shared buff/cache available
Mem: 445 2 441 0 1 441
Swap: 4 0 4
[root@localhost ~]# uname -a
Linux localhost.localdomain 3.10.0-957.5.1.el7.ppc64le #1 SMP Wed Dec 19 15:44:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

Regards,
-Satheesh

bugproxy (bugproxy) on 2019-03-20
tags: added: targetmilestone-inin18043
removed: targetmilestone-inin---
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
Frank Heimes (frank-heimes) wrote :

Since the fix is already included in the disco kernel I'm changing the root 'linux (Ubuntu)' entry (that is usually used to track the current development release kernel) to Fix Released.
And because the bionic entry is Fix Committed I adjust the project entry and change it to Fix Committed, too.

Changed in linux (Ubuntu):
status: New → Fix Released
Changed in ubuntu-power-systems:
status: Triaged → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

Hello IBM,

Could you please verify if the Bionic kernel currently in -proposed fixed the issue as expected?

Thank you.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-04-17 03:01 EDT-------
Tested on host with proposed kernel(4.15.0-48-generic #51) and the issue is fixed, this bug can be closed.

Host kernel: 4.15.0-48-generic #51
Guest Kernel: 3.10.0-957.5.1.el7.ppc64le
libvirt-bin 4.0.0-1ubuntu8.8
qemu-kvm 1:2.11+dfsg-1ubuntu7.12

Guest Mem: 400G

Guest console:

Red Hat Enterprise Linux Server 7.6 (Maipo)
Kernel 3.10.0-957.5.1.el7.ppc64le on an ppc64le

localhost login: root
Password:
Last login: Wed Apr 17 02:54:58 on hvc0
[root@localhost ~]# ls
anaconda-ks.cfg original-ks.cfg perl5
[root@localhost ~]# free -g
total used free shared buff/cache available
Mem: 395 3 390 0 1 390
Swap: 4 0 4
[root@localhost ~]# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 256
NUMA node(s): 1
Model: 2.2 (pvr 004e 1202)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: KVM
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 0-255
[root@localhost ~]# dmesg -l3
[root@localhost ~]#

Regards,
-Satheesh

Thank you, Satheesh.

Marking bug as verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic
bugproxy (bugproxy) on 2019-04-22
tags: added: targetmilestone-inin18041
removed: targetmilestone-inin18043
Launchpad Janitor (janitor) wrote :
Download full text (14.6 KiB)

This bug was fixed in the package linux - 4.15.0-48.51

---------------
linux (4.15.0-48.51) bionic; urgency=medium

  * linux: 4.15.0-48.51 -proposed tracker (LP: #1822820)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * 3b080b2564287be91605bfd1d5ee985696e61d3c in ubuntu_btrfs_kernel_fixes
    triggers system hang on i386 (LP: #1812845)
    - btrfs: raid56: properly unmap parity page in finish_parity_scrub()

  * [P9][LTCTest][Opal][FW910] cpupower monitor shows multiple stop Idle_Stats
    (LP: #1719545)
    - cpupower : Fix header name to read idle state name

  * [amdgpu] screen corruption when using touchpad (LP: #1818617)
    - drm/amdgpu/gmc: steal the appropriate amount of vram for fw hand-over (v3)
    - drm/amdgpu: Free VGA stolen memory as soon as possible.

  * [SRU][B/C/OEM]IOMMU: add kernel dma protection (LP: #1820153)
    - ACPICA: AML parser: attempt to continue loading table after error
    - ACPI / property: Allow multiple property compatible _DSD entries
    - PCI / ACPI: Identify untrusted PCI devices
    - iommu/vt-d: Force IOMMU on for platform opt in hint
    - iommu/vt-d: Do not enable ATS for untrusted devices
    - thunderbolt: Export IOMMU based DMA protection support to userspace
    - iommu/vt-d: Disable ATS support on untrusted devices

  * Add basic support to NVLink2 passthrough (LP: #1819989)
    - powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is
      enabled
    - powerpc/powernv: call OPAL_QUIESCE before OPAL_SIGNAL_SYSTEM_RESET
    - powerpc/powernv: Export opal_check_token symbol
    - powerpc/powernv: Make possible for user to force a full ipl cec reboot
    - powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
    - powerpc/powernv: Move npu struct from pnv_phb to pci_controller
    - powerpc/powernv/npu: Move OPAL calls away from context manipulation
    - powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
    - powerpc/pseries/npu: Enable platform support
    - powerpc/pseries: Remove IOMMU API support for non-LPAR systems
    - powerpc/powernv/npu: Check mmio_atsd array bounds when populating
    - powerpc/powernv/npu: Fault user page into the hypervisor's pagetable

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next: hinic: fix a problem in free_tx_poll()
    - hinic: remove ndo_poll_controller
    - net-next/hinic: add checksum offload and TSO support
    - hinic: Fix l4_type parameter in hinic_task_set_tunnel_l4
    - net-next/hinic:replace multiply and division operators
    - net-next/hinic:add rx checksum offload for HiNIC
    - net-next/hinic:fix a bug in set mac address
    - net-next/hinic: fix a bug in rx data flow
    - net: hinic: fix null pointer dereference on pointer hwdev
    - hinic: optmize rx refill buffer mechanism
    - net-next/hinic:add shutdown callback
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * [CONFIG] please enable highdpi font FONT_TER16x32 (LP: #1819881)
    - Fonts: New Terminus large console font
    - [Config]: enable highdpi Terminus 16x32 font support

  * [19.04 FEAT] qeth: Enhanced link...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments