[Ubuntu 18.04.1] POWER9 - Nvidia Volta - Kernel changes to enable Nvidia driver on bare metal

Bug #1772991 reported by bugproxy on 2018-05-23
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

== SRU Justification ==
12 kernel patches have been identified as needed to support Nvidia Volta
on bare metal. All are accepted upstream in 4.17. Three of those are
already in bionic, leaving a total of 9 remaining commits needed in bionic.
This pull request is for those other 9 commits.

== Regression Potential ==
All of the commits are specific to powerpc.

== Test Case ==
A test kernel was built with these patches and tested by IBM.

== Comment: #0 - Barry B. Arndt <email address hidden> - 2018-05-23 13:40:33 ==
12 kernel patches have been identified as needed to support Nvidia Volta on bare metal. All are accepted upstream in 4.17. Three of those are already in bionic, leaving a total of 9 remaining commits needed in bionic. Those 9 commits are:

720c84046c26 powerpc/npu-dma.c: Fix crash after __mmu_notifier_register failure
2b74e2a9b39d powerpc/powernv/npu: Fix deadlock in mmio_invalidate()
5ee573e8ef03 powerpc/powernv/mce: Don't silently restart the machine
fb5924fddf9e powerpc/mm: Flush cache on memory hot(un)plug
7fd6641de28f powerpc/powernv/memtrace: Let the arch hotunplug code flush cache
28a5933e8d36 powerpc/powernv/npu: Add lock to prevent race in concurrent context init/destroy
a1409adac748 powerpc/powernv/npu: Prevent overwriting of pnv_npu2_init_contex() callback parameters
d0cf9b561ca9 powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address range
75ecfb49516c powerpc/mce: Fix a bug where mce loops on memory UE.

We cherry-picked the commits, and all applied cleanly. The resultant kernel built successfully and loaded.

CVE References

bugproxy (bugproxy) on 2018-05-23
tags: added: architecture-ppc64le bugnameltc-168165 severity-high targetmilestone-inin18041
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: p9 triage-g
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I built a Bionic test kernel with the 9 commits posted in the bug description. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1772991

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

------- Comment From <email address hidden> 2018-05-24 10:20 EDT-------
HI Canonical,

Fred (thank!) tested this kernel and it is working fine. He was even able to install CUDA stack on AC922 and it worked great!

Joseph Salisbury (jsalisbury) wrote :
description: updated
Changed in ubuntu-power-systems:
status: New → In Progress
Breno Leitão (breno-leitao) wrote :

hi Joseph,

I tried to discover if the patchset above was accepted/acked but I didn't find anything. Do you know if the patchset would make the SRU criteria?

Joseph Salisbury (jsalisbury) wrote :

We are still waiting for a couple of acks at this point. The patch set should meeting the SRU criteria, just waiting for review. I'll resend the SRU request if the acks don't happen soon.

Joseph Salisbury (jsalisbury) wrote :

These patches are now applied to the Bionic master-next repo.

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

Hello IBM,

Could you please verify the fix(es) with the Bionic kernel currently in -proposed?

Thank you.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-25 09:59 EDT-------
I am working on this and plan to finish today or tomorrow (6/26).

Douglas Lehr (dllehr-ibm) wrote :

I can verify from IBM's standpoint, that we were able to use Volta gpu's with POWER9. The Nvidia driver enabled correctly.

Thank you Douglas for the verification!

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (49.5 KiB)

This bug was fixed in the package linux - 4.15.0-24.26

---------------
linux (4.15.0-24.26) bionic; urgency=medium

  * linux: 4.15.0-24.26 -proposed tracker (LP: #1776338)

  * Bionic update: upstream stable patchset 2018-06-06 (LP: #1775483)
    - drm: bridge: dw-hdmi: Fix overflow workaround for Amlogic Meson GX SoCs
    - i40e: Fix attach VF to VM issue
    - tpm: cmd_ready command can be issued only after granting locality
    - tpm: tpm-interface: fix tpm_transmit/_cmd kdoc
    - tpm: add retry logic
    - Revert "ath10k: send (re)assoc peer command when NSS changed"
    - bonding: do not set slave_dev npinfo before slave_enable_netpoll in
      bond_enslave
    - ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
    - ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
    - KEYS: DNS: limit the length of option strings
    - l2tp: check sockaddr length in pppol2tp_connect()
    - net: validate attribute sizes in neigh_dump_table()
    - llc: delete timers synchronously in llc_sk_free()
    - tcp: don't read out-of-bounds opsize
    - net: af_packet: fix race in PACKET_{R|T}X_RING
    - tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
    - net: fix deadlock while clearing neighbor proxy table
    - team: avoid adding twice the same option to the event list
    - net/smc: fix shutdown in state SMC_LISTEN
    - team: fix netconsole setup over team
    - packet: fix bitfield update race
    - tipc: add policy for TIPC_NLA_NET_ADDR
    - pppoe: check sockaddr length in pppoe_connect()
    - vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
    - amd-xgbe: Add pre/post auto-negotiation phy hooks
    - sctp: do not check port in sctp_inet6_cmp_addr
    - amd-xgbe: Improve KR auto-negotiation and training
    - strparser: Do not call mod_delayed_work with a timeout of LONG_MAX
    - amd-xgbe: Only use the SFP supported transceiver signals
    - strparser: Fix incorrect strp->need_bytes value.
    - net: sched: ife: signal not finding metaid
    - tcp: clear tp->packets_out when purging write queue
    - net: sched: ife: handle malformed tlv length
    - net: sched: ife: check on metadata length
    - llc: hold llc_sap before release_sock()
    - llc: fix NULL pointer deref for SOCK_ZAPPED
    - net: ethernet: ti: cpsw: fix tx vlan priority mapping
    - virtio_net: split out ctrl buffer
    - virtio_net: fix adding vids on big-endian
    - KVM: s390: force bp isolation for VSIE
    - s390: correct module section names for expoline code revert
    - microblaze: Setup dependencies for ASM optimized lib functions
    - commoncap: Handle memory allocation failure.
    - scsi: mptsas: Disable WRITE SAME
    - cdrom: information leak in cdrom_ioctl_media_changed()
    - m68k/mac: Don't remap SWIM MMIO region
    - block/swim: Check drive type
    - block/swim: Don't log an error message for an invalid ioctl
    - block/swim: Remove extra put_disk() call from error path
    - block/swim: Rename macros to avoid inconsistent inverted logic
    - block/swim: Select appropriate drive on device open
    - block/swim: Fix array bounds check
    - block/swim: Fix IO error at end of medium
    -...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers