rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
---Problem Description---
Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.
---uname output---
Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
2 x NVIDIA K80 GPU adapter:
$ lspci | grep NV
0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Machine Type = 8247-42L
---System Hang---
Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover.
Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55.
The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time.
We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49).
Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those.
We'd welcome any suggestions for how to collect additional data for these occurrences.
I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved.
From the kern.log,
The Call trace points to some kind of deadlock in aufs -
May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task 0 99183 99173 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798643] [c000000fa64673a0] [c0000000000cf004] wake_up_
May 18 15:17:55 dldev1 kernel: [713670.798671] [c000000fa6467570] [c000000fa64675d0] 0xc000000fa64675d0
May 18 15:17:55 dldev1 kernel: [713670.798676] [c000000fa64675d0] [c000000000a1b050] __schedule+
May 18 15:17:55 dldev1 kernel: [713670.798679] [c000000fa64677f0] [c000000fa6467850] 0xc000000fa6467850
May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 00001000005d9410 0 99427 99405 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798691] [c0000017efdd3460] [c0000017efdd34a0] 0xc0000017efdd34a0 (unreliable)
May 18 15:17:55 dldev1 kernel: [713670.798695] [c0000017efdd3630] [c0000017efdd3690] 0xc0000017efdd3690
May 18 15:17:55 dldev1 kernel: [713670.798698] [c0000017efdd3690] [c000000000a1b050] __schedule+
May 18 15:17:55 dldev1 kernel: [713670.798702] [c0000017efdd38b0] [c000000000a1f128] rwsem_down_
May 18 15:17:55 dldev1 kernel: [713670.798706] [c0000017efdd3940] [c000000000a1e538] down_write+
May 18 15:17:55 dldev1 kernel: [713670.798716] [c0000017efdd3970] [d00000001ead562c] do_ii_write_
May 18 15:17:55 dldev1 kernel: [713670.798724] [c0000017efdd39a0] [d00000001eac0e98] aufs_read_
May 18 15:17:55 dldev1 kernel: [713670.798733] [c0000017efdd39e0] [d00000001ead8208] aufs_d_
May 18 15:17:55 dldev1 kernel: [713670.798737] [c0000017efdd3aa0] [c0000000002c88f8] lookup_
May 18 15:17:55 dldev1 kernel: [713670.798740] [c0000017efdd3b10] [c0000000002cb620] path_lookupat+
May 18 15:17:55 dldev1 kernel: [713670.798743] [c0000017efdd3be0] [c0000000002cbe68] filename_
May 18 15:17:55 dldev1 kernel: [713670.798746] [c0000017efdd3c30] [c0000000002cde04] user_path_
May 18 15:17:55 dldev1 kernel: [713670.798749] [c0000017efdd3d20] [c0000000002be744] vfs_fstatat+
May 18 15:17:55 dldev1 kernel: [713670.798753] [c0000017efdd3d80] [c0000000002bee14] SyS_newlstat+
May 18 15:17:55 dldev1 kernel: [713670.798757] [c0000017efdd3e30] [c000000000009258] system_
Other info from the log -
May 18 16:00:00 dldev1 kernel: [ 11.802963] nvidia 0002:03:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803114] nvidia 0002:04:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803221] nvidia 0006:03:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803349] nvidia 0006:04:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803598] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:03:00.0 on minor 0
May 18 16:00:00 dldev1 kernel: [ 11.803654] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:04:00.0 on minor 1
May 18 16:00:00 dldev1 kernel: [ 11.803700] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:03:00.0 on minor 2
May 18 16:00:00 dldev1 kernel: [ 11.803742] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:04:00.0 on minor 3
May 18 16:00:00 dldev1 kernel: [ 11.803749] NVRM: loading NVIDIA UNIX ppc64le Kernel Module 352.93 Tue Apr 5 17:31:42 PDT 2016
....
May 18 16:00:01 dldev1 kernel: [ 278.416678] aufs 3.x-rcN-20150105
May 18 16:00:01 dldev1 kernel: [ 278.952335] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
May 18 16:00:01 dldev1 kernel: [ 278.953351] Bridge firewalling registered
May 18 16:00:01 dldev1 kernel: [ 278.957164] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
May 18 16:00:01 dldev1 kernel: [ 278.997765] ip_tables: (C) 2000-2006 Netfilter Core Team
May 18 16:00:01 dldev1 kernel: [ 279.013814] nvidia 0002:03:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:03 dldev1 kernel: [ 280.480596] IPv6: ADDRCONF(
May 18 16:00:04 dldev1 kernel: [ 282.035907] nvidia 0002:04:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:07 dldev1 kernel: [ 285.065144] nvidia 0006:03:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:10 dldev1 kernel: [ 288.150339] nvidia 0006:04:00.0: Using 64-bit DMA iommu bypass
...
May 18 16:01:55 dldev1 kernel: [ 393.265785] aufs au_opts_
May 18 16:01:56 dldev1 kernel: [ 393.328552] aufs au_opts_
May 18 16:01:56 dldev1 kernel: [ 393.366023] device veth0beb24c entered promiscuous mode
May 18 16:01:56 dldev1 kernel: [ 393.367228] IPv6: ADDRCONF(
May 18 16:01:56 dldev1 kernel: [ 393.367235] docker0: port 1(veth0beb24c) entered forwarding state
May 18 16:01:56 dldev1 kernel: [ 393.367314] docker0: port 1(veth0beb24c) entered forwarding state
May 18 16:01:56 dldev1 kernel: [ 393.367825] docker0: port 1(veth0beb24c) entered disabled state
May 18 16:01:56 dldev1 kernel: [ 393.526038] docker0: port 1(veth0beb24c) entered disabled state
May 18 16:01:56 dldev1 kernel: [ 393.530143] device veth0beb24c left promiscuous mode
May 18 16:01:56 dldev1 kernel: [ 393.530148] docker0: port 1(veth0beb24c) entered disabled state
> Hi Breno,
>
> As per Ubuntu support, all new fixes will target 16.04 and Canonical will
> not automatically fix 14.04. If one needs a fix in 14.04, it has to go
> through as a special request to Canonical. So should we check if this
> problem happens on Ubuntu 16.04?
That is correct. Ubuntu 14.04 fixes will be SRU. I.e, they need to be fixed in 16.10 first, and then backport to 16.04 an 14.04.
Kernel 4.4 for Ubuntu 16.04.5 is already in proposed. Does it contain the bug also? I understand that kernel 3.19 is already EOL.
The requested docker info; we're using the docker packages provided by Canonical:
$ dpkg -l | egrep 'docker|aufs'
ii aufs-tools 1:3.2+20130722-1.1 ppc64el Tools to manage aufs filesystems
ii docker.io 1.9.1~git201512
$ docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-
Git commit: 18bfacb
Built: Thu Dec 17 19:19:02 UTC 2015
OS/Arch: linux/ppc64le
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-
Git commit: 18bfacb
Built: Thu Dec 17 19:19:02 UTC 2015
OS/Arch: linux/ppc64le
$ docker info
Containers: 8
Images: 27
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /var/lib/
Backing Filesystem: extfs
Dirs: 43
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-58-generic
Operating System: Ubuntu 14.04.4 LTS
CPUs: 192
Total Memory: 127.5 GiB
Name: dldev1
ID: NKEF:PLNM:
WARNING: No swap limit support
I had said: "we're using the docker packages provided by Canonical".
That's not true. The 'docker.io' package we're using is from:
deb http://
The aufs.ko does come from Canonical:
$ dpkg -S aufs.ko
linux-image-
Another occurrence this afternoon, again while building a software project in a Docker container.
Yesterday was building 'Caffe', which is mostly C or C++. Today was building 'TensorFlow' which includes a lot of Java.
Still seeing this hang on Ubuntu 16.04 with 4.4.0-38 kernel.
Similar scenario--building TensorFlow in a container often causes the problem. I assume because the build is creating lots of filesystem artifacts.
Seems that running 'sync' every few seconds during the build makes the problem less likely to strike.
Will attach kernel log from an instance on 16.04 / 4.4.0-38 from Oct 7th.
Changed in linux (Ubuntu): | |
status: | Incomplete → Invalid |
assignee: | Taco Screen team (taco-screen-team) → nobody |
tags: | added: cscc |
Default Comment by Bridge