rcu_sched detected stalls with kernel 3.19.0-58, NVIDIA driver, and docker

Bug #1633223 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

---Problem Description---
Seeing occasional rcu_sched detected stalls on 14.04 LTS with kernel 3.19.0-58. The system is running docker containers, and has the NVIDIA GPU driver loaded. We've seen about 4 stalls in the last month, all with the 3.19.0-58 kernel, and with the NVIDIA 352.93 and 361.49 drivers.

---uname output---
Linux dldev1 3.19.0-58-generic #64~14.04.1-Ubuntu SMP Fri Mar 18 19:05:01 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
2 x NVIDIA K80 GPU adapter:
$ lspci | grep NV
0002:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0002:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0006:04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Machine Type = 8247-42L

---System Hang---
 Usual symptom is that the system is unresponsive except maybe for ping and writing the stall-detection messages to the console. Login/getty isn't available either via ssh nor on the console. System must be power cycled to recover.

Attached is the kernel log from a stall detection on May 18th. The detection first occurs at: May 18 15:17:55.

The system is later rebooted and those messages indicate the kernel (3.19.0-58) and NVIDIA driver version (352.93) that were active at the time.

We've suffered 3 or 4 stalls since, all with the same kernel, but some with a newer NVIDIA driver (361.49).

Unfortunately, information about the newer stalls wasn't preserved in the various log files (and we're not capturing the console constantly), so we don't have detailed data for those.

We'd welcome any suggestions for how to collect additional data for these occurrences.

I can't say for sure that we haven't seen the stalls on other systems, but they're occuring fairly frequently on this system, and it's unusual in that it's running both Docker and NVIDIA GPU driver. So maybe aufs or the NVIDIA driver are somehow involved.

From the kern.log,

The Call trace points to some kind of deadlock in aufs -

May 18 15:17:55 dldev1 kernel: [713670.798624] Task dump for CPU 3:
May 18 15:17:55 dldev1 kernel: [713670.798628] cc1 R running task 0 99183 99173 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798633] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798643] [c000000fa64673a0] [c0000000000cf004] wake_up_worker+0x44/0x60 (unreliable)
May 18 15:17:55 dldev1 kernel: [713670.798671] [c000000fa6467570] [c000000fa64675d0] 0xc000000fa64675d0
May 18 15:17:55 dldev1 kernel: [713670.798676] [c000000fa64675d0] [c000000000a1b050] __schedule+0x370/0x900
May 18 15:17:55 dldev1 kernel: [713670.798679] [c000000fa64677f0] [c000000fa6467850] 0xc000000fa6467850
May 18 15:17:55 dldev1 kernel: [713670.798682] Task dump for CPU 75:
May 18 15:17:55 dldev1 kernel: [713670.798684] cc1 D 00001000005d9410 0 99427 99405 0x00040004
May 18 15:17:55 dldev1 kernel: [713670.798688] Call Trace:
May 18 15:17:55 dldev1 kernel: [713670.798691] [c0000017efdd3460] [c0000017efdd34a0] 0xc0000017efdd34a0 (unreliable)
May 18 15:17:55 dldev1 kernel: [713670.798695] [c0000017efdd3630] [c0000017efdd3690] 0xc0000017efdd3690
May 18 15:17:55 dldev1 kernel: [713670.798698] [c0000017efdd3690] [c000000000a1b050] __schedule+0x370/0x900
May 18 15:17:55 dldev1 kernel: [713670.798702] [c0000017efdd38b0] [c000000000a1f128] rwsem_down_write_failed+0x288/0x400
May 18 15:17:55 dldev1 kernel: [713670.798706] [c0000017efdd3940] [c000000000a1e538] down_write+0x88/0x90
May 18 15:17:55 dldev1 kernel: [713670.798716] [c0000017efdd3970] [d00000001ead562c] do_ii_write_lock+0x8c/0xd0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798724] [c0000017efdd39a0] [d00000001eac0e98] aufs_read_lock+0xb8/0xd0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798733] [c0000017efdd39e0] [d00000001ead8208] aufs_d_revalidate+0x98/0x7a0 [aufs]
May 18 15:17:55 dldev1 kernel: [713670.798737] [c0000017efdd3aa0] [c0000000002c88f8] lookup_fast+0x368/0x3b0
May 18 15:17:55 dldev1 kernel: [713670.798740] [c0000017efdd3b10] [c0000000002cb620] path_lookupat+0x180/0x970
May 18 15:17:55 dldev1 kernel: [713670.798743] [c0000017efdd3be0] [c0000000002cbe68] filename_lookup+0x58/0x140
May 18 15:17:55 dldev1 kernel: [713670.798746] [c0000017efdd3c30] [c0000000002cde04] user_path_at_empty+0x84/0xe0
May 18 15:17:55 dldev1 kernel: [713670.798749] [c0000017efdd3d20] [c0000000002be744] vfs_fstatat+0x84/0x140
May 18 15:17:55 dldev1 kernel: [713670.798753] [c0000017efdd3d80] [c0000000002bee14] SyS_newlstat+0x34/0x60
May 18 15:17:55 dldev1 kernel: [713670.798757] [c0000017efdd3e30] [c000000000009258] system_call+0x38/0xd0

Other info from the log -

May 18 16:00:00 dldev1 kernel: [ 11.802963] nvidia 0002:03:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803114] nvidia 0002:04:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803221] nvidia 0006:03:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803349] nvidia 0006:04:00.0: enabling device (0140 -> 0142)
May 18 16:00:00 dldev1 kernel: [ 11.803598] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:03:00.0 on minor 0
May 18 16:00:00 dldev1 kernel: [ 11.803654] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0002:04:00.0 on minor 1
May 18 16:00:00 dldev1 kernel: [ 11.803700] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:03:00.0 on minor 2
May 18 16:00:00 dldev1 kernel: [ 11.803742] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0006:04:00.0 on minor 3
May 18 16:00:00 dldev1 kernel: [ 11.803749] NVRM: loading NVIDIA UNIX ppc64le Kernel Module 352.93 Tue Apr 5 17:31:42 PDT 2016
....

May 18 16:00:01 dldev1 kernel: [ 278.416678] aufs 3.x-rcN-20150105
May 18 16:00:01 dldev1 kernel: [ 278.952335] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
May 18 16:00:01 dldev1 kernel: [ 278.953351] Bridge firewalling registered
May 18 16:00:01 dldev1 kernel: [ 278.957164] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
May 18 16:00:01 dldev1 kernel: [ 278.997765] ip_tables: (C) 2000-2006 Netfilter Core Team
May 18 16:00:01 dldev1 kernel: [ 279.013814] nvidia 0002:03:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:03 dldev1 kernel: [ 280.480596] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
May 18 16:00:04 dldev1 kernel: [ 282.035907] nvidia 0002:04:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:07 dldev1 kernel: [ 285.065144] nvidia 0006:03:00.0: Using 64-bit DMA iommu bypass
May 18 16:00:10 dldev1 kernel: [ 288.150339] nvidia 0006:04:00.0: Using 64-bit DMA iommu bypass
...
May 18 16:01:55 dldev1 kernel: [ 393.265785] aufs au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the permission bits on the lower branch
May 18 16:01:56 dldev1 kernel: [ 393.328552] aufs au_opts_verify:1612:docker[3444]: dirperm1 breaks the protection by the permission bits on the lower branch
May 18 16:01:56 dldev1 kernel: [ 393.366023] device veth0beb24c entered promiscuous mode
May 18 16:01:56 dldev1 kernel: [ 393.367228] IPv6: ADDRCONF(NETDEV_UP): veth0beb24c: link is not ready
May 18 16:01:56 dldev1 kernel: [ 393.367235] docker0: port 1(veth0beb24c) entered forwarding state
May 18 16:01:56 dldev1 kernel: [ 393.367314] docker0: port 1(veth0beb24c) entered forwarding state
May 18 16:01:56 dldev1 kernel: [ 393.367825] docker0: port 1(veth0beb24c) entered disabled state
May 18 16:01:56 dldev1 kernel: [ 393.526038] docker0: port 1(veth0beb24c) entered disabled state
May 18 16:01:56 dldev1 kernel: [ 393.530143] device veth0beb24c left promiscuous mode
May 18 16:01:56 dldev1 kernel: [ 393.530148] docker0: port 1(veth0beb24c) entered disabled state

> Hi Breno,
>
> As per Ubuntu support, all new fixes will target 16.04 and Canonical will
> not automatically fix 14.04. If one needs a fix in 14.04, it has to go
> through as a special request to Canonical. So should we check if this
> problem happens on Ubuntu 16.04?

That is correct. Ubuntu 14.04 fixes will be SRU. I.e, they need to be fixed in 16.10 first, and then backport to 16.04 an 14.04.

Kernel 4.4 for Ubuntu 16.04.5 is already in proposed. Does it contain the bug also? I understand that kernel 3.19 is already EOL.

The requested docker info; we're using the docker packages provided by Canonical:

$ dpkg -l | egrep 'docker|aufs'
ii aufs-tools 1:3.2+20130722-1.1 ppc64el Tools to manage aufs filesystems
ii docker.io 1.9.1~git20151210-18bfacb ppc64el Linux container runtime

$ docker version
Client:
 Version: 1.9.1
 API version: 1.21
 Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
 Git commit: 18bfacb
 Built: Thu Dec 17 19:19:02 UTC 2015
 OS/Arch: linux/ppc64le

Server:
 Version: 1.9.1
 API version: 1.21
 Go version: go1.4.2 gccgo (GCC) 5.2.1 20151016 (Advance-Toolchain-) [ibm/gcc-5-branch, revision: 229493 merged from gcc-5-branch, revision 228917]
 Git commit: 18bfacb
 Built: Thu Dec 17 19:19:02 UTC 2015
 OS/Arch: linux/ppc64le

$ docker info
Containers: 8
Images: 27
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-58-generic
Operating System: Ubuntu 14.04.4 LTS
CPUs: 192
Total Memory: 127.5 GiB
Name: dldev1
ID: NKEF:PLNM:HHXN:3DAD:3CLU:Z2GP:AZLR:MZRO:ADYA:5GOV:YZRN:DWOS
WARNING: No swap limit support

I had said: "we're using the docker packages provided by Canonical".

That's not true. The 'docker.io' package we're using is from:

deb http://ftp.unicamp.br/pub/ppc64el/ubuntu/14_04/docker-ppc64el/ trusty main

The aufs.ko does come from Canonical:

$ dpkg -S aufs.ko
linux-image-extra-3.19.0-58-generic: /lib/modules/3.19.0-58-generic/kernel/ubuntu/aufs/aufs.ko

Another occurrence this afternoon, again while building a software project in a Docker container.

Yesterday was building 'Caffe', which is mostly C or C++. Today was building 'TensorFlow' which includes a lot of Java.

Still seeing this hang on Ubuntu 16.04 with 4.4.0-38 kernel.

Similar scenario--building TensorFlow in a container often causes the problem. I assume because the build is creating lots of filesystem artifacts.

Seems that running 'sync' every few seconds during the build makes the problem less likely to strike.

Will attach kernel log from an instance on 16.04 / 4.4.0-38 from Oct 7th.

Revision history for this message
bugproxy (bugproxy) wrote : Kernel log from May 18 stall detection

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-142751 severity-high targetmilestone-inin16042
Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from morning of June 20th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 21st

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 22nd

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from morning of June 24th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th (3rd hang)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th (4th hang)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from afternoon of Oct 7th

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Felix Abecassis (flx42) wrote :

Did you see #1533043? Check your version of aufs (`modinfo aufs`)

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-14 08:49 EDT-------
Sorry, I had missed the question.

On the 14.04 system:

$ modinfo aufs
filename: /lib/modules/3.19.0-58-generic/kernel/ubuntu/aufs/aufs.ko
alias: fs-aufs
version: 3.x-rcN-20150105
description: aufs -- Advanced multi layered unification filesystem
author: Junjiro R. Okajima <email address hidden>
license: GPL
srcversion: 8F9DB2EF45822D994AFF58A
depends:
intree: Y
vermagic: 3.19.0-58-generic SMP mod_unload modversions
signer: Magrathea: Glacier signing key
sig_key: 6E:0E:39:1E:F5:B2:57:BE:88:F7:C4:C1:4B:4B:5C:56:74:8E:DF:C5
sig_hashalgo: sha512
parm: brs:use <sysfs>/fs/aufs/si_*/brN (int)
parm: allow_userns:allow unprivileged to mount under userns (bool)

The kernel on the 16.04 system has been updated since the hang, but at the time, aufs was:

$ modinfo /lib/modules/4.4.0-38-generic/kernel/fs/aufs/aufs.ko
filename: /lib/modules/4.4.0-38-generic/kernel/fs/aufs/aufs.ko
alias: fs-aufs
version: 4.x-rcN-20160111
description: aufs -- Advanced multi layered unification filesystem
author: Junjiro R. Okajima <email address hidden>
license: GPL
srcversion: BD1C47D04FE5C2D88F63AAB
depends:
intree: Y
vermagic: 4.4.0-38-generic SMP mod_unload modversions
parm: brs:use <sysfs>/fs/aufs/si_*/brN (int)
parm: allow_userns:allow unprivileged to mount under userns (bool)

Revision history for this message
bugproxy (bugproxy) wrote : Kernel log from May 18 stall detection

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from morning of June 20th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 21st

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 22nd

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from morning of June 24th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th (3rd hang)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dldev1' hang from afternoon of June 29th (4th hang)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from afternoon of Oct 7th

Default Comment by Bridge

Revision history for this message
Felix Abecassis (flx42) wrote :

On 16.04, the version of aufs looks correct, maybe it's a different bug. Did you also try with Docker 1.12 on 16.04?
Looking at the other bug, there is a Docker container than can help diagnose if you are encountering the aufs issue:
docker run -it --rm akihirosuda/test18180

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-14 18:37 EDT-------
Our systems are now upgraded to:

$ uname -a
Linux dlfire5 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:45:41 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

$ dpkg -l docker.io
...
ii docker.io 1.12.1-0ubuntu13~16.04.1 ppc64el

$ modinfo aufs
filename: /lib/modules/4.4.0-43-generic/kernel/fs/aufs/aufs.ko
alias: fs-aufs
version: 4.x-rcN-20160111
description: aufs -- Advanced multi layered unification filesystem
author: Junjiro R. Okajima <email address hidden>
license: GPL
srcversion: BD1C47D04FE5C2D88F63AAB
depends:
intree: Y
vermagic: 4.4.0-43-generic SMP mod_unload modversions
parm: brs:use <sysfs>/fs/aufs/si_*/brN (int)
parm: allow_userns:allow unprivileged to mount under userns (bool)

I'll be doing more builds in the coming days. Will report if I see the problem again.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-18 11:04 EDT-------
Suffered another hang on a system with:

$ uname -a
Linux dlfire5 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:45:41 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

$ dpkg -l | grep docker.io
ii docker.io 1.12.1-0ubuntu13~16.04.1 ppc64el Linux container runtime

$ modinfo aufs
filename: /lib/modules/4.4.0-43-generic/kernel/fs/aufs/aufs.ko
alias: fs-aufs
version: 4.x-rcN-20160111
description: aufs -- Advanced multi layered unification filesystem
author: Junjiro R. Okajima <email address hidden>
license: GPL
srcversion: BD1C47D04FE5C2D88F63AAB
depends:
intree: Y
vermagic: 4.4.0-43-generic SMP mod_unload modversions
parm: brs:use <sysfs>/fs/aufs/si_*/brN (int)
parm: allow_userns:allow unprivileged to mount under userns (bool)

Attaching kernel log file "dlfire5-oct18-kern.log"

Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from morning of Oct 18th

------- Comment (attachment only) From <email address hidden> 2016-10-18 11:04 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-18 12:53 EDT-------
Same host hung again under similar workload.
Will attach kernel log from new hang as "dlfire5-oct18b-kern.log".

Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from morning of Oct 18th (2nd hang)

------- Comment (attachment only) From <email address hidden> 2016-10-18 12:55 EDT-------

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Revisiting this bug, is this issue still persisting?

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-02-27 17:37 EDT-------
The system where we had seen most of these hangs is now running kernel 4.4.0-62 (and corresponding aufs.ko module still based on 4.x-rcN-20160111).

NVIDIA driver has been updated to 361.119.

The system's been running the new code levels since beginning of February, and has been running the same workload (build GPU-related projects in docker containers) as earlier.

We have not seen any of these hangs during the last month.

Manoj Iyer (manjo)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
assignee: Taco Screen team (taco-screen-team) → nobody
Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from morning of Oct 18th
Revision history for this message
bugproxy (bugproxy) wrote : host 'dlfire5' hang from morning of Oct 18th (2nd hang)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-02-08 19:50 EDT-------
We haven't seen the problem since moving to (and beyond) the kernel and NVIDIA driver levels in my previous update.

Feel free to close. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-11-05 00:32 EDT-------
(In reply to comment #37)
> We haven't seen the problem since moving to (and beyond) the kernel and
> NVIDIA driver levels in my previous update.
>
> Feel free to close. Thank you!

Closing in LTC bugzilla.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.