driver installation cost about 20 mins on 64 cores system

Bug #1688431 reported by Alex Tu on 2017-05-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
Undecided
Unassigned
nvidia-graphics-drivers-375 (Ubuntu)
High
Alberto Milone

Bug Description

Ubuntu version: 16.04
Kernel: 4.4.0-67-generic

Issue:
With current makefile setting "make -j$(nproc)".

And on a 64 cores system the nvidia driver[1] installation stucks in "Building initial module for 4.4.0-67-generic" for about 20 mins.

Workaround:
repack driver to change setting to "make -j16" , then it just spend about 3 mins to pass "Building initial module"

Investigation:
from iotop, there are 35 processes were using >90% CPU, 23 processes >50% , it might could be a evidence that too heavy IO accessing by -j$(nproc) caused whole system hangs up when nv driver installation.

 htop:
 http://paste.ubuntu.com/24514786/

[1] https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa/+packages

Alex Tu (alextu) wrote :

add verbose debug message to check what it was doing after printing "Building initial module for 4.4.0-67-generic",

And attached tarball include the message for -j$(nproc) and -j16
├── make-nvidia-j16-verbose-complete.log : the complete message for buiding in -j16
├── make-nvidia-jnproc-verbose-complete.log : the complete message for buiding in -j$(proc)
├── make-nvidia-jnproc-verbose.log : the message copied when saw "Building initial module for 4.4.0-67-generic"
├── make-nvidia-jnproc-verbose-2.log : the message copied when saw "Building initial module for 4.4.0-67-generic" and stucked for a while.

Changed in nvidia-graphics-drivers-375 (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Alberto Milone (albertomilone)
tags: added: originate-from-1675061 somerville
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nvidia-graphics-drivers-375 - 375.66-0ubuntu1

---------------
nvidia-graphics-drivers-375 (375.66-0ubuntu1) artful; urgency=medium

  * New upstream release:
    - Added support for the following GPUs:
      o GeForce GTX 1080 Ti
      o Quadro P3000
      o Quadro M520
      o TITAN Xp
    - Fixed a bug that could cause EGL applications to crash when
      calling eglInitialize() multiple times on X11-backed displays.
    - Fixed a regression that could cause rendering corruption on a
      monitor connected via DisplayPort upon a modeset event (for
      example, changing resolutions or power cycling the monitor).
    - Fixed a bug that could cause OpenGL applications to crash when
      VT switching between multiple X servers.
    - Fixed a bug that caused the system to become unresponsive after
      resuming from power management suspend/hibernate.  Additional
      symptoms of this bug included display flickering and "Xid 56"
      errors in the kernel log.
    - Fixed a bug that caused backlight brightness to not be
      controllable on some notebooks with DisplayPort internal
      panels.
    - Fixed a bug that left HDMI and DisplayPort audio muted after a
      framebuffer console mode was restored. For some displays, this
      caused the display to remain blank.
    - Fixed a bug that caused audio over DisplayPort to stop working
      when the monitor was unplugged and plugged back in or awoken
      from DPMS power-saving mode.
    - Restored support for the following GPU:
      GRID K520
    - Fixed a regression that caused corruption in certain
      applications, such as window border shadows in Unity, after
      resuming from suspend.
    - Fixed a bug that could cause some applications to crash when
      running with PRIME Sync.
    - Fixed a bug that prevented PRIME Sync from working on notebooks
      with GeForce GTX 4xx and 5xx series GPUs.
    - Fixed a bug that caused OpenGL apps to have excessive CPU usage
      when running with PRIME Sync but without native displays
      enabled.
    - Fixed a bug that could cause PRIME Sync to deadlock in the
      kernel, particularly common on Linux 4.10.
    - Fixed a bug that caused PRIME Sync to run slowly on systems
      with Pascal GPUs.

  [ Alberto Milone ]
  * debian/templates/dkms_nvidia.conf.in:
    - Drop buildfix_kernel_4.10.patch.
    - Limit the amount of cores to a maximum of 16 (LP: #1688431).

  [ Jeremy Bicha ]
  * Depend on xserver-xorg-legacy (LP: #1559576).

 -- Alberto Milone <email address hidden> Fri, 05 May 2017 15:13:39 +0200

Changed in nvidia-graphics-drivers-375 (Ubuntu):
status: In Progress → Fix Released
Changed in hwe-next:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers