DPDK: with Netns sometimes kernel crash is seen

Bug #1691245 reported by Vinod Nair
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Fix Committed
High
Vinod Nair
Trunk
Fix Committed
High
Vinod Nair

Bug Description

With Lbaas on dpdk , vrouter crashes or the kernel crashes. This is to track the kernel crash

6459.111459] ------------[ cut here ]------------
[16459.126781] kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/net/core/dev.c:6943!
[16459.161261] invalid opcode: 0000 [#1] SMP
[16459.179984] Modules linked in: rte_kni(OE) igb_uio(OE) veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables uio nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc bonding input_leds joydev ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dm_multipath irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw ioatdma gf128mul glue_helper ablk_helper cryptd sb_edac edac_core shpchp lpc_ich wmi ipmi_si ipmi_msghandler 8250_fintek mac_hid vhost_net
[16459.451793] vhost macvtap macvlan lp parport ses enclosure hid_generic i40e ixgbe igb vxlan mpt3sas ip6_udp_tunnel usbhid i2c_algo_bit mdio udp_tunnel dca ahci hid raid_class ptp libahci scsi_transport_sas pps_core fjes [last unloaded: igb_uio]
[16459.595452] CPU: 25 PID: 23353 Comm: ip Tainted: G OE 4.4.0-31-generic #50~14.04.1-Ubuntu
[16459.673869] Hardware name: Supermicro SSG-6027R-E1R12L/X9DRD-7LN4F, BIOS 3.0 07/09/2013
[16459.757228] task: ffff8818528b1b80 ti: ffff88183f0bc000 task.ti: ffff88183f0bc000
[16459.845449] RIP: 0010:[<ffffffff816f6e88>] [<ffffffff816f6e88>] netdev_run_todo+0x2f8/0x300
[16459.938533] RSP: 0018:ffff88183f0bfbe8 EFLAGS: 00010206
[16459.986174] RAX: ffff8820351ce090 RBX: ffff8820351ce000 RCX: 0000000000000020
[16460.034563] RDX: 0000000000000100 RSI: 0000000000000100 RDI: ffffffff81d385a0
[16460.082646] RBP: ffff88183f0bfc30 R08: 0000000000000003 R09: 0000000000000000
[16460.130450] R10: 00000000353f9d01 R11: ffffea0080d4fe40 R12: ffff8820351ce450
[16460.178029] R13: 00000001003da50a R14: 00000001003da50a R15: 00000000000000fa
[16460.224782] FS: 00007f5d5ebea740(0000) GS:ffff88203f440000(0000) knlGS:0000000000000000
[16460.317730] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16460.364254] CR2: 0000000000610a68 CR3: 000000184daca000 CR4: 00000000000406e0
[16460.410719] Stack:
[16460.455625] ffff88203458b100 ffff8820351ce000 ffff88183f0bfbf8 ffff88183f0bfbf8
[16460.545971] ffff88203458b100 ffff88203458b100 ffff882035ad7000 0000000000000020
[16460.634642] 0000000000000000 ffff88183f0bfc48 ffffffff8170259d ffff88203358e000
[16460.725128] Call Trace:
[16460.768897] [<ffffffff8170259d>] rtnetlink_rcv+0x2d/0x30
[16460.812784] [<ffffffff81724d95>] netlink_unicast+0xf5/0x1e0
[16460.855893] [<ffffffff81725259>] netlink_sendmsg+0x3d9/0x630
[16460.898045] [<ffffffff8137cb7d>] ? aa_sock_msg_perm+0x5d/0x140
[16460.939790] [<ffffffff816d6d58>] sock_sendmsg+0x38/0x50
[16460.980983] [<ffffffff816d7680>] ___sys_sendmsg+0x270/0x290
[16461.021279] [<ffffffff811915f7>] ? lru_cache_add_active_or_unevictable+0x27/0x90
[16461.100271] [<ffffffff811b18b5>] ? handle_pte_fault+0xaf5/0x1470
[16461.140124] [<ffffffff812140a2>] ? __dentry_kill+0x162/0x1e0
[16461.179255] [<ffffffff816d7fd2>] __sys_sendmsg+0x42/0x80
[16461.217240] [<ffffffff816d8022>] SyS_sendmsg+0x12/0x20
[16461.254629] [<ffffffff817f6f36>] entry_SYSCALL_64_fastpath+0x16/0x75
[16461.292016] Code: 00 00 48 c7 c7 b8 97 b6 81 e8 e5 69 98 ff e9 14 ff ff ff be

root@cmbu-starwars-01:/var/crash/201705152254# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
root@cmbu-starwars-01:/var/crash/201705152254# uname -a
Linux cmbu-starwars-01 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Kernel Coredump in : /cs-shared/bugs/1691245
Version: 4.0.0.0-5 Mitaka

Vinod Nair (vinodnair)
summary: - DPDK: with Lbbas sometimes kernel crash is also seen
+ DPDK: with Lbaas sometimes kernel crash is also seen
description: updated
Revision history for this message
Kiran (kiran-kn80) wrote : Re: DPDK: with Lbaas sometimes kernel crash is also seen

Vinod, as discussed, can you pl downgrade the kernel to 3.13.0.x and also use my debug image and see if the crash happens.

Jeba Paulaiyan (jebap)
tags: added: blocker crashes
Revision history for this message
Vinod Nair (vinodnair) wrote :

I had done the test earlier by downgrading to 3.13.106, kernel crash is not seen sofar but vrouter crashes all the time
#0 dpdk_if_tx (vif=<optimized out>, pkt=<optimized out>) at vrouter/dpdk/vr_dpdk_interface.c:1666

Jeba Paulaiyan (jebap)
tags: added: releasenote
removed: blocker
information type: Proprietary → Public
Revision history for this message
Vinod Nair (vinodnair) wrote :

Crash is seen if you have netns in the compute like SNAT or lbaaas with the default kernel of the os.
In this case it was 4.4.0-31-generic

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

summary: - DPDK: with Lbaas sometimes kernel crash is also seen
+ DPDK: with Netns sometimes kernel crash is seen
Revision history for this message
Kiran (kiran-kn80) wrote :

Hi Vinod,
Had a discussion with Raja. Can you pl test this with DPDK 17.02?

- Kiran

Revision history for this message
Vinod Nair (vinodnair) wrote :

17.x also has a kernel crash

Revision history for this message
Vinod Nair (vinodnair) wrote :

Crash seen with 4.4 kernel , which comes with Ubuntu 14.04.5

No longer see crash with 3.13.0.110 kernel

Revision history for this message
Raja Sivaramakrishnan (raja-u) wrote :

Vinod, can you please check if this can be reproduced?

Revision history for this message
ram yadav (ryadav) wrote :

Vinod,
Please reproduce with running the sanity in serialized form instead of parallel as its done today to see if we observe the issue. This will also help us figure out if the failure observed have a pattern and is repeatable.

Thanks,
Ram

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/35412
Submitter: ryadav (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/35413
Submitter: ryadav (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/35413
Committed: http://github.com/Juniper/contrail-vrouter/commit/40fbd6d20f8ce803834be5fea8b8620245cff564
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 40fbd6d20f8ce803834be5fea8b8620245cff564
Author: ram-yadav <email address hidden>
Date: Fri Sep 8 14:46:08 2017 -0700

Increase the time to wait for start-up of contrail-vrouter-dpdk

Increased the startup time to 240 seconds for contrail-vrouter-dpdk as
startup taking longer on ubuntu 16.04.

Change-Id: I56f445c3c5f08954fdffcbfa2fac0172acbcede7
Closes-Bug: #1691245

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/35412
Committed: http://github.com/Juniper/contrail-vrouter/commit/15630c56d6b356a0b976cf1d54d019aebebac470
Submitter: Zuul (<email address hidden>)
Branch: master

commit 15630c56d6b356a0b976cf1d54d019aebebac470
Author: ram-yadav <email address hidden>
Date: Fri Sep 8 14:46:08 2017 -0700

Increase the time to wait for start-up of contrail-vrouter-dpdk

Increased the startup time to 240 seconds for contrail-vrouter-dpdk as
startup taking longer on ubuntu 16.04.

Change-Id: I56f445c3c5f08954fdffcbfa2fac0172acbcede7
Closes-Bug: #1691245

Revision history for this message
ram yadav (ryadav) wrote :

I was planning to recreate this problem by running traffic between two VM on CB build 4.0.1.0-50.
But ran into multiple issue with sm-lite and contrail-vrouter-dpdk timeout which was fixed. But with CB 50 rabbitMq keeps failing. Looks like its a known issue and is fixed in latest release.
I will try to do the same to recreate the issue with latest build.

Revision history for this message
ram yadav (ryadav) wrote :

Managed to bring up contrail cluster with two dpdk compute nodes.
Running bidirectional throughput between the VM"s using dpdk pktgen.
Haven't observed any kernel panic yet, the dpdk pktgen has been running for 18 hours now.
Will run it for one more day and see if kernel panic is observed, else we will depend on Vinod's effort to recreate this issue.

Revision history for this message
ram yadav (ryadav) wrote :

Bidiectional VM traffic has been running for 6 days now. No kernel panic yet.
At this point we will wait for Vinod to recreate the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.