[Witherspoon-DD2.2][Ubu 18.10] [4.18.0-7-generic ] OS booting thrown with nouveau errors; OS booted successfully

Bug #1794055 reported by bugproxy on 2018-09-24
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
linux (Ubuntu)
High
Canonical Kernel Team
Cosmic
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Kalpana Shetty <email address hidden> - 2018-09-15 23:55:13 ==
---Problem Description---
[Witherspoon-DD2.2][Ubu 18.10] [4.18.0-7-generic ] OS booting thrown with nouveau errors

Contact Information = <email address hidden>, <email address hidden>

---uname output---
root@ltc-wcwsp3:~# uname -a Linux ltc-wcwsp3 4.18.0-7-generic #8-Ubuntu SMP Tue Aug 28 18:20:56 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = Witherspoon DD2.2 LC

Steps:
1. Netinstall Ubu 18.10 on Witherspoon-LC-DD2.2 6GPU system ------> PASS
2. Boot the OS ---> PASS but error thrown on the console related open source NVIDIA driver.

  [Disk: sdb2 / c0302064-c5a3-49a7-8bd4-402283e6fcbe]
    Ubuntu, with Linux 4.18.0-7-generic (recovery mode)
    Ubuntu, with Linux 4.18.0-7-generic
    Ubuntu
  [Disk: nvme0n1p2 / c5d042f1-812e-49e0-94b2-ade477084061]
    Ubuntu, with Linux 4.18.0-7-generic (recovery mode)
 * Ubuntu, with Linux 4.18.0-7-generic
    Ubuntu

  System information
  System configuration
  System status log
  Language
  Rescan devices
  Retrieve config from URL
  Plugins (0)
  Exit to shell
 ??????????????????????????????????????????????????????????????????????????????
 Enter=accept, e=edit, n=new, x=exit, l=language, g=log, h=help
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
[ 57.513329] kexec_core: Starting new kernel
[ 149.358703978,5] OPAL: Switch to big-endian OS
[ 153.355498935,5] OPAL: Switch to little-endian OS
[ 2.943735] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
[ 2.943738] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
[ 3.132733] vio vio: uevent: failed to send synthetic uevent
[ 4.058698] nouveau 0004:04:00.0: gr: failed to load gr/sw_nonctx
[ 4.129215] nouveau 0004:04:00.0: DRM: failed to create kernel channel, -22
[ 19.126509] nouveau 0004:04:00.0: DRM: failed to idle channel 0 [DRM]
[ 19.281450] nouveau 0004:05:00.0: gr: failed to load gr/sw_nonctx
[ 19.351322] nouveau 0004:05:00.0: DRM: failed to create kernel channel, -22
[ 34.350509] nouveau 0004:05:00.0: DRM: failed to idle channel 0 [DRM]
[ 34.502063] nouveau 0004:06:00.0: gr: failed to load gr/sw_nonctx
[ 34.572144] nouveau 0004:06:00.0: DRM: failed to create kernel channel, -22
[ 49.570509] nouveau 0004:06:00.0: DRM: failed to idle channel 0 [DRM]
[ 49.734754] nouveau 0035:03:00.0: gr: failed to load gr/sw_nonctx
[ 49.805057] nouveau 0035:03:00.0: DRM: failed to create kernel channel, -22
[ 64.802510] nouveau 0035:03:00.0: DRM: failed to idle channel 0 [DRM]
[ 64.955442] nouveau 0035:04:00.0: gr: failed to load gr/sw_nonctx
[ 65.025537] nouveau 0035:04:00.0: DRM: failed to create kernel channel, -22

[ 80.022509] nouveau 0035:04:00.0: DRM: failed to idle channel 0 [DRM]
[ 80.181169] nouveau 0035:05:00.0: gr: failed to load gr/sw_nonctx
[ 80.251481] nouveau 0035:05:00.0: DRM: failed to create kernel channel, -22
[ 95.250509] nouveau 0035:05:00.0: DRM: failed to idle channel 0 [DRM]
/dev/nvme0n1p2: recovering journal
/dev/nvme0n1p2: clean, 72569/97681408 files, 7384418/390701312 blocks
-.mount
kmod-static-nodes.service
dev-hugepages.mount
dev-mqueue.mount
sys-kernel-debug.mount
ufw.service
lvm2-lvmetad.service
systemd-remount-fs.service
systemd-random-seed.service
systemd-sysusers.service
keyboard-setup.service
systemd-tmpfiles-setup-dev.service
lvm2-monitor.service
finalrd.service
console-setup.service
swapfile.swap
ebtables.service
systemd-udevd.service
systemd-journald.service
systemd-journal-flush.service
systemd-tmpfiles-setup.service
systemd-update-utmp.service
[ 100.997765] vio vio: uevent: failed to send synthetic uevent
systemd-udev-trigger.service
systemd-timesyncd.service
apparmor.service
lvm2-pvscan@8:3.service
systemd-modules-load.service
sys-kernel-config.mount
sys-fs-fuse-connections.mount
systemd-sysctl.service
ondemand.service
dbus.service
irqbalance.service
opal-prd.service
lxcfs.service
atd.service
cron.service
iprdump.service
iprinit.service
systemd-logind.service
iprupdate.service
systemd-networkd.service
rsyslog.service
polkit.service
accounts-daemon.service
lxd-containers.service
networkd-dispatcher.service
var-lib-lxcfs.mount
tmp-selftest\x2dmountpoint\x2d039055037.mount
snapd.service
snapd.seeded.service
systemd-resolved.service
systemd-networkd-wait-online.service
blk-availability.service
systemd-user-sessions.service
apport.service

Ubuntu Cosmic Cuttlefish (development branch) ltc-wcwsp3 hvc0

ltc-wcwsp3 login:

== Comment: #2 - Kalpana Shetty <email address hidden> - 2018-09-16 00:07:26 ==
sosreport -> http://9.114.13.132/repo/bugs/ubu/sosreport-BZ171506.171506-20180915235600.tar.xz

== Comment: #3 - Kalpana Shetty <email address hidden> - 2018-09-16 00:33:02 ==

== Comment: #4 - Praveen K. Pandey <email address hidden> - 2018-09-19 05:52:23 ==
facing nouveau related error on power8 system as well

[ 4.764818] nouveau 0002:01:00.0: fifo: fault 00 [READ] at 0000000000020000 engine 0c [HOST6] client 06 [GPC0/L1_2] reason 02 [PTE] on channel 0 [03ffb18000 DRM]
[ 4.942169] nouveau 000a:01:00.0: fifo: fault 00 [READ] at 0000000000020000 engine 0c [HOST6] client 06 [GPC0/L1_2] reason 02 [PTE] on channel 0 [03ffb18000 DRM]
/dev/sdb2: clean, 132397/61054976 files, 5995714/244188416 blocks
[ 11.206278] vio vio: uevent: failed to send synthetic uevent
[ OK ] Started Show Plymouth Boot Screen.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Started Forward Password Requests to Plymouth Directory Watch.
plymouth-start.service
[ OK ] Started ebtables ruleset management.

== Comment: #5 - Chandni Verma <email address hidden> - 2018-09-20 16:41:49 ==
--- screening ---

From provided dmesg, I notice:

1294 [ 19.281478] nouveau 0004:05:00.0: bios: version 88.00.13.00.02
1295 [ 19.282753] nouveau 0004:05:00.0: Direct firmware load for nvidia/gv100/gr/sw_nonctx.bin failed with error -2
1296 [ 19.282755] nouveau 0004:05:00.0: gr: failed to load gr/sw_nonctx
1297 [ 19.282813] nouveau 0004:05:00.0: Using 32-bit DMA via iommu

..

1322 [ 34.367713] nouveau 0004:06:00.0: NVIDIA GV100 (140000a1)
1323 [ 34.497152] nouveau 0004:06:00.0: bios: version 88.00.13.00.02
1324 [ 34.502736] nouveau 0004:06:00.0: Direct firmware load for nvidia/gv100/gr/sw_nonctx.bin failed with error -2
1325 [ 34.502738] nouveau 0004:06:00.0: gr: failed to load gr/sw_nonctx
1326 [ 34.502797] nouveau 0004:06:00.0: Using 32-bit DMA via iommu

..

upto 6 instances of the above...

Looks like an NVIDIA firmware issue.

== Comment: #6 - Luciano Chavez <email address hidden> - 2018-09-20 17:03:31 ==
(In reply to comment #5)
> --- screening ---
>
> From provided dmesg, I notice:
>
>
> 1294 [ 19.281478] nouveau 0004:05:00.0: bios: version 88.00.13.00.02
> 1295 [ 19.282753] nouveau 0004:05:00.0: Direct firmware load for
> nvidia/gv100/gr/sw_nonctx.bin failed with error -2
> 1296 [ 19.282755] nouveau 0004:05:00.0: gr: failed to load gr/sw_nonctx
> 1297 [ 19.282813] nouveau 0004:05:00.0: Using 32-bit DMA via iommu
>
> ..
>
> 1322 [ 34.367713] nouveau 0004:06:00.0: NVIDIA GV100 (140000a1)
> 1323 [ 34.497152] nouveau 0004:06:00.0: bios: version 88.00.13.00.02
> 1324 [ 34.502736] nouveau 0004:06:00.0: Direct firmware load for
> nvidia/gv100/gr/sw_nonctx.bin failed with error -2
> 1325 [ 34.502738] nouveau 0004:06:00.0: gr: failed to load gr/sw_nonctx
> 1326 [ 34.502797] nouveau 0004:06:00.0: Using 32-bit DMA via iommu
>
> ..
>
> upto 6 instances of the above...
>
>
> Looks like an NVIDIA firmware issue.

Well, I think those message mean that the nouveau module can't find the firmware file as opposed to it being a FW issue. Might be a packaging issue if this is actually not causing any real issues. Probably best to mirror this to Canonical for their comment.

== Comment: #10 - Chandni Verma <email address hidden> - 2018-09-24 03:25:35 ==

bugproxy (bugproxy) wrote : dmesg
  • dmesg Edit (112.3 KiB, application/octet-stream)

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-171506 severity-high targetmilestone-inin1810
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Andrew Cloke (andrew-cloke) wrote :

It looks like this issue is seen with the "in development" 18.10 4.18 kernel. Does the same issue occur with 18.04 (4.15 kernel)?

Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Manoj Iyer (manjo) wrote :

Looks like we don't ship the nvidia/gv100/gr/sw_nonctx.bin in our linux-firmware package. This firmware is also missing in git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git and would need to be upstreamed first to the linux-firmware tree so that we can sync to Ubuntu.

Currently the following nvidia firmware is available upstream:

linux-firmware$ find . -name sw_nonctx.bin
./nvidia/gm204/gr/sw_nonctx.bin
./nvidia/gp107/gr/sw_nonctx.bin
./nvidia/gp108/gr/sw_nonctx.bin
./nvidia/gm206/gr/sw_nonctx.bin
./nvidia/gp100/gr/sw_nonctx.bin
./nvidia/gm20b/gr/sw_nonctx.bin
./nvidia/gp10b/gr/sw_nonctx.bin
./nvidia/gp106/gr/sw_nonctx.bin
./nvidia/gm200/gr/sw_nonctx.bin
./nvidia/gk20a/sw_nonctx.bin
./nvidia/gp104/gr/sw_nonctx.bin
./nvidia/gp102/gr/sw_nonctx.bin

Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Manoj Iyer (manjo) on 2018-10-01
Changed in ubuntu-power-systems:
status: New → Incomplete
Changed in linux (Ubuntu):
status: New → Incomplete
Andrew Cloke (andrew-cloke) wrote :

Marking as incomplete awaiting nvidia firmware for this card landing in linux-firmware upstream.

Changed in linux (Ubuntu):
status: Incomplete → In Progress
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Cosmic):
assignee: Joseph Salisbury (jsalisbury) → nobody
status: In Progress → Incomplete
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)

------- Comment From <email address hidden> 2018-10-29 10:34 EDT-------
Is there anything IBM can help with on making the nvidia firmware for this card available in linux-firmware upstream?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-11-12 16:03 EDT-------
It looks like this is upstream now (added right after the check for it):
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia/gv100/

Kalpana, do you want to just add the gv100 code to your /lib/firmware/nvidia to see if that resolves it?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-11-12 20:21 EDT-------
(In reply to comment #20)
> It looks like this is upstream now (added right after the check for it):
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
> tree/nvidia/gv100/
>
> Kalpana, do you want to just add the gv100 code to your /lib/firmware/nvidia
> to see if that resolves it?

sure, let me try this.

tags: added: kernel-da-key
Manoj Iyer (manjo) wrote :

After you have verified that the adding the firmware fixes this for you, please add a note here so that we can start the SRU process of adding that firmware to Ubuntu.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-12-06 02:22 EDT-------
I recreated the problem where I could see the errors in dmesg (and the console) and then added the firmware to /lib/firmware/nvidia/gx100. After that:
mranweil@ltc-wspoon5:~$ dmesg|grep -i nouv
[ 6.632529] nouveau 0004:04:00.0: enabling device (0140 -> 0142)
[ 6.632613] nouveau 0004:04:00.0: Using 32-bit DMA via iommu
[ 6.632721] nouveau 0004:04:00.0: NVIDIA GV100 (140000a1)
<snip>
[ 7.061963] nouveau 0035:03:00.0: DRM: Pointer to TMDS table invalid
[ 7.061966] nouveau 0035:03:00.0: DRM: DCB version 4.1
[ 7.063141] nouveau 0035:03:00.0: DRM: MM: using COPY for buffer copies
[ 7.063154] [drm] Initialized nouveau 1.3.1 20120801 for 0035:03:00.0 on minor 2
mranweil@ltc-wspoon5:~$

So looks like the firmware from the current git tree addresses the error messages. I didn't do anything further with the driver.

Changed in ubuntu-power-systems:
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Cosmic):
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments