Comment 0 for bug 2007746

Revision history for this message
Mustafa Kemal Gilor (mustafakemalgilor) wrote :

[ Impact ]

 * Microsoft Azure NV-series instances with NVidia GRID drivers started to experience xserver crashes while following Microsoft's official guide to installing Nvidia drivers [1].

 * Root cause analysis showed that it was due to having a device with BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the hyperv_drm kernel module is loaded.

 * Removing either the BusID specification or unloading the hyperv_drm kernel module seems to fix the crash.

 * The crash is happening while X.server is trying to enumerate PCI devices. X.server dereferences a NULL pointer while trying to access to the PCI device info.

 * The reason why it only happens while the hyperv_drm kernel module is loaded is that the hyperv_drm module does not expose PCI hardware information since it's a virtual device.

 * The upstream patch [2] addresses the issue and it's confirmed that the xserver with the patch does not experience the crash.

 * Ubuntu Focal `xorg-server` package does not include the patch [2] at the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).

 [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
 [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928

[ Test Plan ]

Part (a) is quoted from Microsoft's official guide [1].

Part (a):

 * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
   - e.g. `NV36adms A10`
 * Install updates, required tooling, and the desktop environment:
   - sudo apt-get update
   - sudo apt-get upgrade -y
   - sudo apt-get dist-upgrade -y
   - sudo apt-get install build-essential ubuntu-desktop -y
   - sudo apt-get install linux-azure -y
 * Disable nouveau kernel driver:
   # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
   blacklist nouveau
   blacklist lbm-nouveau
 * Reboot the VM, re-connect, and then stop X server:
   - sudo reboot
   # wait for the reboot, reconnect, and continue:
   - sudo systemctl stop lightdm.service
 * Download and install the NVidia GRID driver:
   - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
   - chmod +x NVIDIA-Linux-x86_64-grid.run
   - sudo ./NVIDIA-Linux-x86_64-grid.run
   - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
 * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
   - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
 * Edit /etc/nvidia/grid.conf
   - sudo nano /etc/nvidia/grid.conf
   # Append the following lines:
   IgnoreSP=FALSE
   EnableUI=FALSE
   # Remove this line if present:
   FeatureType=0
   # And save.
 * Reboot the VM

 Part (b):

  * Ensure that the hyperv_drm kernel module is loaded:
    - sudo modprobe hyperv_drm
  * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
  * try to start the `xserver`:
    - sudo startx
  * `xserver` should crash with a similar output to the following:
  X.Org X Server 1.20.13
  X Protocol Version 11, Revision 0
  Build Operating System: linux Ubuntu
  Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
  Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
  Build Date: 07 February 2023 12:48:13PM
  xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
  Current version of pixman: 0.38.4
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.
  Markers: (--) probed, (**) from config file, (==) default setting,
    (++) from command line, (!!) notice, (II) informational,
    (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
  (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
  (==) Using config file: "/etc/X11/xorg.conf"
  (==) Using system config directory "/usr/share/X11/xorg.conf.d"
  (EE)
  (EE) Backtrace:
  (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
  (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
  (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
  (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
  (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
  (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
  (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
  (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
  (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
  (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
  (EE)
  (EE) Segmentation fault at address 0x124
  (EE)
  Fatal server error:
  (EE) Caught signal 11 (Segmentation fault). Server aborting
  (EE)
  (EE)
  Please consult the The X.Org Foundation support
     at http://wiki.x.org
   for help.
  (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
  (EE)
  (EE) Server terminated with error (1). Closing log file.
  ^Cxinit: giving up
  xinit: unable to connect to X server: Connection refused
  xinit: unexpected signal 2

[ Where problems could occur ]

 * The regression risk is low, given that the patch is well-isolated and basically adds a null check that is already assumed to be there in the first place.

[ Other Info ]

 * workaround #1: unload hyperv_drm kernel module:
   - sudo modprobe -r hyperv_drm
 * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
   Section "Device"
      Identifier "Device0"
      Driver "nvidia"
      VendorName "NVIDIA Corporation"
      # BusID "PCI:0@32828:0:0"
      Option "HardDPMS" "false"
      Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
   EndSection