Ubuntu16.04.01VM:Docker-Powervm aufs bad file panic while running tests in a docker container

Bug #1650062 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Seth Forshee

Bug Description

== Comment: #0 - Vinutha GS - 2016-12-13 02:47:35 ==
When some of the base and io tests were run inside a docker container, the par crashed and below are the stack trace and other details.

Steps to re-create -
1. Install 16.04.02 on a PowerVM lpar.
2. Ran setup general.
3. Ran docker scripts[home grown scripts] which does docker package installation and other setups required to run STAF cases inside docker container.
4. We have docker image using which we launch containers and start tests inside containers.
If complete details are required on how to execute scripts, please let me know.
5. STAF Base and IO tests were started inside containers successfully, after sometime, I see partition is in XMON.

Docker info -
docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.12.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 0
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 4.4.0-53-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: ppc64le
CPUs: 24
Total Memory: 49.89 GiB
Name: bamlp3
ID: I7VI:G4RJ:RHTQ:WNGV:52FK:K7AZ:YDJQ:KFUM:P3UA:MZ3I:5XUY:WV3N
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
61f2b8ab0a86 32d545c3ea01 "/bin/sh -c ./staf_io" 24 minutes ago Up 24 minutes bamlp3-io
151da0322172 590e44f15214 "/bin/sh -c ./staf_ba" 30 minutes ago Up 30 minutes bamlp3-base

Stack trace -
8:mon> t
[c000000a5e147d10] d00000000a04ca98 aufs_flush_nondir+0x38/0x50 [aufs]
[c000000a5e147d40] c0000000002e0428 filp_close+0x68/0xe0
[c000000a5e147dc0] c00000000030f71c __close_fd+0xcc/0x150
[c000000a5e147e00] c0000000002e04d4 SyS_close+0x34/0x90
[c000000a5e147e30] c000000000009204 system_call+0x38/0xb4
--- Exception: c00 (System Call) at 00003fff8bc217d8
SP (3fffd85203b0) is in userspace
8:mon> e
cpu 0x8: Vector: 300 (Data Access) at [c000000a5e147a40]
    pc: d00000000a04bdd4: au_do_flush+0x44/0x220 [aufs]
    lr: d00000000a04ca98: aufs_flush_nondir+0x38/0x50 [aufs]
    sp: c000000a5e147cc0
   msr: 8000000000009033
   dar: 28
 dsisr: 40000000
  current = 0xc000000a8b7fc8e0
  paca = 0xc00000000fb44c00 softe: 0 irq_happened: 0x01
    pid = 11936, comm = remap_file_page
8:mon>

Release details -
uname -r
4.4.0-53-generic

 uname -a
Linux bamlp4 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

== Comment: #6 - Vinutha GS - 2016-12-14 03:16:18 ==
Please find the attached sosreport.
Also i have followed the steps for k-dump, It is enabled now.
I'm going to start the tests once again.

== Comment: #12 - Kevin W. Rudd - 2016-12-14 16:06:46 ==
The basic reason for the panic is that close was called on a file
that was no longer valid. The f_count value was -8 for some reason,
so it passed the following check in filep_close():

        if (!file_count(filp)) {
                printk(KERN_ERR "VFS: Close: file count is 0\n");
                return 0;
        }

It then blew up in au_do_flush() because f_inode was NULL.

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-149901 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : dmesg from kdump

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : file pointer passed to au_do_flush()

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-01-04 10:07 EDT-------
Canonical.

We have no official aufs expertise over here, so we are looking to you for further assistance with this issue.

Thanks.

Revision history for this message
Seth Forshee (sforshee) wrote :

From the attachment file->f_inode is NULL. That shouldn't be the case for a file passed to the aufs flush callback, so it's not a bug for aufs to assume it will be non-NULL. However in this case the file has a negative reference count, so there must be some earlier problem with reference counting.

I'm not all that familiar with aufs, so I'll have to spend some time looking for unbalanced fputs in aufs. It's possible hoewever that the problem lies outside of aufs.

Revision history for this message
Seth Forshee (sforshee) wrote :

I haven't found any obvious source in aufs for the extra fputs which I suspect are causing this problem. If you could either give me more information so I can run the same tests myself (I'm guessing the problem isn't arch-specific) or else reproduce the problem with the kernel patched with something like the following, perhaps we can catch it in the act. However we also might just catch legitimate fputs since the erroneous ones could occur while the refcount is still positive ...

diff --git a/fs/file_table.c b/fs/file_table.c
index df66450fb443..d4911a6e8331 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -264,7 +264,9 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);

 void fput(struct file *file)
 {
- if (atomic_long_dec_and_test(&file->f_count)) {
+ long cnt = atomic_long_dec_return(&file->f_count);
+ WARN_ON(cnt < 0);
+ if (cnt == 0) {
                struct task_struct *task = current;

                if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {

Manoj Iyer (manjo)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Seth Forshee (sforshee)
Revision history for this message
bugproxy (bugproxy) wrote : Console logs

------- Comment (attachment only) From <email address hidden> 2017-02-03 06:26 EDT-------

bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-04-06 16:33 EDT-------
Hello Canonical.

Since we have been unable to reproduce this issue with a newer 4.8.0-34 kernel with debugging in place, we are closing it on our end as unreproducible. Thanks for the assistance.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.