Creating conntrack entry failure with kernel 4.4.0-89

Bug #1709032 reported by Jakub Libosvar on 2017-08-07
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Critical
Unassigned
linux (Ubuntu)
Critical
Kleber Sacilotto de Souza
Xenial
Critical
Kleber Sacilotto de Souza

Bug Description

The functional job failure rate is at 100%. Every time some test gets stuck and job is killed after timeout.

logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%5C%22gate-neutron-dsvm-functional-ubuntu-xenial%5C%22%20AND%20tags%3Aconsole%20AND%20message%3A%5C%22Killed%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20timeout%20-s%209%5C%22

2017-08-05 12:36:50.127672 | /home/jenkins/workspace/gate-neutron-dsvm-functional-ubuntu-xenial/devstack-gate/functions.sh: line 1129: 15261 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd"

There are a few test executors left, which means there are more tests stuck:

stack 15468 15445 15468 0.0 0.0 328 796 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpDTLPoX
stack 15469 15468 15469 1.5 1.8 139332 150008 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpDTLPoX
stack 15470 15445 15470 0.0 0.0 328 700 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpICNqRQ
stack 15471 15470 15471 1.6 2.0 152056 164812 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpICNqRQ
stack 15474 15445 15474 0.0 0.0 328 792 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpe646Tl
stack 15475 15474 15475 1.6 1.9 149972 162516 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpe646Tl
stack 15476 15445 15476 0.0 0.0 328 804 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpv2ovhz
stack 15477 15476 15477 1.2 1.8 136760 149160 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpv2ovhz
stack 15478 15445 15478 0.0 0.0 328 712 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpDqXE8S
stack 15479 15478 15479 1.5 1.9 148784 161004 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpDqXE8S
stack 15480 15445 15480 0.0 0.0 328 804 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpTmmShS
stack 15482 15480 15482 1.6 1.9 148856 161516 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpTmmShS

tags: added: functional-tests gate-failure
Changed in neutron:
importance: Undecided → Critical
status: New → Confirmed

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/491403
Reason: It's communication with ovsdb getting stuck

On mobile at the moment so I can't easily find it, but there was a recent change to that kernel detection for the ovs compile from source script. I would confirm that didn't mess up what got compiled.

Jakub Libosvar (libosvar) wrote :

We don't use that in functional tests. We compile kernel module only for fullstack.

Jakub Libosvar (libosvar) wrote :

Also worth to note that this issue is not reproducible on Ocata branch.

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/491403

With enormous help of Daniel Alvarez, we were able to root cause analyse this issue. We merged functional tests using conntrack [1] and then later Ubuntu kernel at the gate was bumped from 4.4.0-87.110 to 4.4.0-89.112. The newer kernel contains a bug in conntrack and all conntrack calls get stuck, the process can't be even killed by sending a SIGKILL signal.

http://logs.openstack.org/53/490753/3/check/gate-neutron-dsvm-functional-ubuntu-xenial-nv/68f386d/logs/syslog.txt.gz#_Aug_09_05_28_58

Download full text (4.6 KiB)

The reproducer is following:

kernel: 4.4.0-89-generic
conntrack: 1:1.4.3-3
conntrackd: 1:1.4.3-3

Create a conntrack entry:

sudo conntrack -I --protonum tcp --src 1.2.3.4 --sport 65535 --dst 8.8.8.8 --dport 60000 --state ESTABLISHED --timeout 120

Trace from dmesg:
 [ 2964.587682] ------------[ cut here ]------------
 [ 2964.588883] kernel BUG at /build/linux-YaBj6t/linux-4.4.0/net/netfilter/nf_conntrack_extend.c:91!
 [ 2964.589954] invalid opcode: 0000 [#1] SMP
 [ 2964.590556] Modules linked in: br_netfilter bridge openvswitch libcrc32c nf_conntrack_netlink nfnetlink ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nls_utf8 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 isofs nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables hid_generic ppdev crct10dif_pclmul crc32_pclmul usbhid hid snd_pcsp ghash_clmulni_intel joydev aesni_intel snd_pcm input_leds parport_pc aes_x86_64 i2c_piix4 snd_timer lrw evbug parport snd gf128mul 8250_fintek mac_hid serio_raw glue_helper soundcore ablk_helper cryptd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc autofs4 ttm drm_kms_helper
 [ 2964.598769] syscopyarea sysfillrect sysimgblt fb_sys_fops drm psmouse pata_acpi floppy
 [ 2964.599587] CPU: 0 PID: 12029 Comm: conntrack Not tainted 4.4.0-89-generic #112-Ubuntu
 [ 2964.600347] Hardware name: Fedora Project OpenStack Nova, BIOS 1.9.1-5.el7_3.1 04/01/2014
 [ 2964.601178] task: ffff8802331b5940 ti: ffff8800ba5dc000 task.ti: ffff8800ba5dc000
 [ 2964.602169] RIP: 0010:[<ffffffffc0368211>] [<ffffffffc0368211>] __nf_ct_ext_add_length+0x141/0x1b0 [nf_conntrack]
 [ 2964.603408] RSP: 0018:ffff8800ba5df9a0 EFLAGS: 00010246
 [ 2964.604043] RAX: 0000000000000009 RBX: ffff880234303180 RCX: 0000000002080020
 [ 2964.604802] RDX: 0000000000000000 RSI: 0000000000000009 RDI: 0000000000000000
 [ 2964.606483] RBP: ffff8800ba5df9e8 R08: ffff88023fc1a0c0 R09: ffff8800bb108560
 [ 2964.607298] R10: ffff8800bb108500 R11: 000000003a8d6867 R12: ffff8800bb108500
 [ 2964.608090] R13: ffff8800ba5dfb58 R14: ffffffff81ef5f00 R15: ffff8800ba5dfa94
 [ 2964.608883] FS: 00007f4784492700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [ 2964.609895] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 2964.610542] CR2: 00007f4784071520 CR3: 00000000bab56000 CR4: 00000000000006f0
 [ 2964.611327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 2964.612120] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 2964.612873] Stack:
 [ 2964.613197] 0000006000000078 0000000000000009 ffff880234303180 fffffffffffffff4
 [ 2964.614137] ffff880234303180 0000000000000002 ffff8800ba5dfb58 ffffffff81ef5f00
 [ 2964.615091] ffff8800ba5dfa94 ffff8800ba5dfa70 ffffffffc03a4c34 0000000000000000
 [ 2964.616096] Call Trace:
 [ 2964.616429] [<ffffffffc03a4c34>] ctnetlink_create_conntrack+0x244/0x4d0 [nf_conntrack_netlink]
 [ 2964.617433] [<ffffffffc035fecd>] ? __nf_conntrack_find_get+0x34d/0x370 [nf_conntrack]
 [ 2964.618392] [<ffffffffc03a728b>] ctnetlink_new_conntrack+...

Read more...

summary: - functional job tests get stuck
+ Creating conntrack entry failure with kernel 4.4.0-89
affects: neutron → linux
Daniel Alvarez (dalvarezs) wrote :

New kernel Ubuntu-4.4.0-89.112HEADUbuntu-4.4.0-89.112master was
tagged 9 days ago (07/31/2017) [0].

From a quick look, the only commit around this function is [1].

[0] http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?id=64de31ed97a03ec1b86fd4f76e445506dce55b02
[1] http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?id=2ad4caea651e1cc0fc86111ece9f9d74de825b78

James Page (james-page) wrote :

Re-confirmed - I see the same regression in behaviour between -81 -> -89 using the test case in #11.

Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with a revert of commit 2ad4caea651. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1709032/

Can you test this kernel and see if it resolves this bug?

Jakub Libosvar (libosvar) wrote :

Thanks Joseph, I tested both, creating a conntrack entry using conntrack CLI and running the Neutron functional tests, both are fine with the revert.

Jay Vosburgh (jvosburgh) wrote :

The panic appears to be fixed upstream via:

commit 9c3f3794926a997b1cab6c42480ff300efa2d162
Author: Liping Zhang <email address hidden>
Date: Sat Mar 25 16:35:29 2017 +0800

    netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister

    If one cpu is doing nf_ct_extend_unregister while another cpu is doing
    __nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover,
    there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to
    NULL, so it's possible that we may access invalid pointer.
[...]

Jakub Libosvar (libosvar) wrote :

Is it possible to backport the patch to 4.4.0 kernel in Xenial LTS version?

Daniel Alvarez (dalvarezs) wrote :

Are you sure 9c3f3794 fixes it? I'm not a kernel expert but since the backtrace differs and, according to the commit message, I'd say it's a different issue.
We could try backporting it manually and check but maybe reverting 2ad4caea651 suffices until
ubuntu guys fix the bug in a future tag?

Joseph Salisbury (jsalisbury) wrote :

Just to confirm, the revert did fix the bug as posted in comment #15?

Joseph Salisbury (jsalisbury) wrote :

I'll build a Xenial test kernel with the commit Jay mentioned in comment #16 and post it shortly.

Joseph Salisbury (jsalisbury) wrote :

I built a xenial test kernel with a pick of commit 9c3f379492 and WITHOUT commit 2ad4caea651 reverted.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1709032/patched

Can you give that kernel a test?

Jakub Libosvar (libosvar) wrote :

I cherry-picked 9c3f3794 to Ubuntu-4.4.0-89.112 tag and built the kernel and I'm able to create a conntrack entry without any issues or traces, so it *does* fix the issue. I assume what I had is the same as what Joseph provided :) Thanks for helping us out guys!

Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Critical
Changed in linux (Ubuntu):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.4.0-92.115

---------------
linux (4.4.0-92.115) xenial; urgency=low

  * linux: 4.4.0-92.115 -proposed tracker (LP: #1709812)

  * Creating conntrack entry failure with kernel 4.4.0-89 (LP: #1709032)
    - Revert "netfilter: synproxy: fix conntrackd interaction"

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 10 Aug 2017 10:29:51 +0200

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

Xenial kernel 4.4.0-92.115 had a quick fix for the issue, which was to revert the patch that introduced the regression ("netfilter: synproxy: fix conntrackd interaction"). The kernel build that's in -proposed (currently 4.4.0-93.116) reapplies this patch and applies on top the proper fix found upstream ("netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister").

Could someone please verify if the kernel in -proposed really fixes the issue?

Thank you.

tags: removed: kernel-key

Hi @libosvar,

Could you please verify the follow-up fix with the xenial kernel in -proposed?

Thank you.

Jakub Libosvar (libosvar) wrote :

We enabled back our failing CI tests after switching to kernel 4.4.0-92.115 and everything went back to normal. Current 4.4.0-93.116 works just fine too. Thanks a lot for a quick help! I think you can close this bug as resolved :)

tags: added: verification-done-xenial
removed: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers