SYMC: Seeing kernel backtrace and vrouter process stuck in INIT state with contrail 2.01 and kernel 3.13

Bug #1449162 reported by Varun Lodaya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Incomplete
High
Anand H. Krishnan
OpenContrail
Incomplete
High
Anand H. Krishnan

Bug Description

We are seeing these frequent kernel backtraces with contrail running on 2.01-41 and kernel version 3.13. The only way to recover after this is a hypervisor reboot. Need to dig into the root-cause of this as this is seriously affecting our Uptime.

Following is the backtrace:
2015-04-25T12:28:05.944331+00:00 b0c010ash2018 kernel: [535885.190684] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
2015-04-25T12:28:05.944348+00:00 b0c010ash2018 kernel: [535885.215308] IP: [<ffffffff811b0657>] kmem_cache_alloc+0x77/0x1f0
2015-04-25T12:28:05.944349+00:00 b0c010ash2018 kernel: [535885.230779] PGD 12df28d067 PUD 12d69d0067 PMD 0
2015-04-25T12:28:05.944350+00:00 b0c010ash2018 kernel: [535885.247356] Oops: 0000 [#1] SMP
2015-04-25T12:28:05.944351+00:00 b0c010ash2018 kernel: [535885.264374] Modules linked in: veth vhost_net macvtap macvlan vhost xt_conntrack iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 8021q mrp garp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables nbd vrouter(OX) ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dm_multipath crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel scsi_dh xfs gpio_ich joydev ipmi_devintf aesni_intel msr ablk_helper lp mac_hid cryptd sb_edac edac_core lrw gf128mul parport mei_me wmi ioatdma lpc_ich mei ipmi_si glue_helper libcrc32c shpchp acpi_power_meter aes_x86_64 hid_generic usbhid hid ixgbe igb megaraid_sas dca i2c_algo_bit ptp pps_core mdio
2015-04-25T12:28:05.944354+00:00 b0c010ash2018 kernel: [535885.575624] CPU: 28 PID: 45796 Comm: openstack-statu Tainted: G OX 3.13.0-49-generic #81~precise1-Ubuntu
2015-04-25T12:28:05.944355+00:00 b0c010ash2018 kernel: [535885.654294] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.4.3 07/09/2014
2015-04-25T12:28:05.944356+00:00 b0c010ash2018 kernel: [535885.737038] task: ffff8812f2ce0000 ti: ffff8812c70fe000 task.ti: ffff8812c70fe000
2015-04-25T12:28:05.944356+00:00 b0c010ash2018 kernel: [535885.826571] RIP: 0010:[<ffffffff811b0657>] [<ffffffff811b0657>] kmem_cache_alloc+0x77/0x1f0
2015-04-25T12:28:05.944357+00:00 b0c010ash2018 kernel: [535885.923753] RSP: 0018:ffff8812c70ffd90 EFLAGS: 00010282
2015-04-25T12:28:05.944358+00:00 b0c010ash2018 kernel: [535885.974387] RAX: 0000000000000000 RBX: 0000000001200011 RCX: 000000000003e8bc
2015-04-25T12:28:05.944359+00:00 b0c010ash2018 kernel: [535886.078298] RDX: 000000000003e8bb RSI: 00000000000000d0 RDI: 00000000000162a0
2015-04-25T12:28:05.944360+00:00 b0c010ash2018 kernel: [535886.188858] RBP: ffff8812c70ffde0 R08: ffff88181fbd62a0 R09: ffffffff8108be94
2015-04-25T12:28:05.944363+00:00 b0c010ash2018 kernel: [535886.302853] R10: ffff88187fffbf00 R11: 00007fff8841b000 R12: 0000000000000001
2015-04-25T12:28:05.944365+00:00 b0c010ash2018 kernel: [535886.418492] R13: ffff88181f403900 R14: ffff88181f403900 R15: 00000000000000d0
2015-04-25T12:28:05.944387+00:00 b0c010ash2018 kernel: [535886.536915] FS: 00007fbc18506700(0000) GS:ffff88181fbc0000(0000) knlGS:0000000000000000
2015-04-25T12:28:05.944388+00:00 b0c010ash2018 kernel: [535886.656231] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2015-04-25T12:28:05.944393+00:00 b0c010ash2018 kernel: [535886.716245] CR2: 0000000000000001 CR3: 00000012d6b18000 CR4: 00000000001427e0
2015-04-25T12:28:05.944394+00:00 b0c010ash2018 kernel: [535886.833304] Stack:
2015-04-25T12:28:05.944395+00:00 b0c010ash2018 kernel: [535886.889899] ffffffff8108be94 ffff8817ef6a0aa8 ffff8817f7189880 0000000000000000
2015-04-25T12:28:05.944395+00:00 b0c010ash2018 kernel: [535887.004037] ffff8812c70ffde0 0000000001200011 0000000000000000 ffffffff81c452a0
2015-04-25T12:28:05.944396+00:00 b0c010ash2018 kernel: [535887.117912] 0000000000000000 0000000000000000 ffff8812c70ffe10 ffffffff8108be94
2015-04-25T12:28:05.944397+00:00 b0c010ash2018 kernel: [535887.232613] Call Trace:
2015-04-25T12:28:05.944397+00:00 b0c010ash2018 kernel: [535887.288242] [<ffffffff8108be94>] ? alloc_pid+0x24/0x2e0
2015-04-25T12:28:05.944400+00:00 b0c010ash2018 kernel: [535887.344143] [<ffffffff8108be94>] alloc_pid+0x24/0x2e0
2015-04-25T12:28:05.944401+00:00 b0c010ash2018 kernel: [535887.398737] [<ffffffff8106998a>] copy_process.part.27+0x8ca/0xf50
2015-04-25T12:28:05.944401+00:00 b0c010ash2018 kernel: [535887.452662] [<ffffffff8110331b>] ? audit_filter_rules.isra.7+0x55b/0xad0
2015-04-25T12:28:05.944402+00:00 b0c010ash2018 kernel: [535887.506223] [<ffffffff8106a090>] copy_process+0x80/0x90
2015-04-25T12:28:05.944402+00:00 b0c010ash2018 kernel: [535887.558678] [<ffffffff8106a1d2>] do_fork+0x62/0x280
2015-04-25T12:28:05.944403+00:00 b0c010ash2018 kernel: [535887.610125] [<ffffffff81103924>] ? audit_filter_syscall+0x94/0xe0
2015-04-25T12:28:05.944412+00:00 b0c010ash2018 kernel: [535887.661481] [<ffffffff8106a476>] SyS_clone+0x16/0x20
2015-04-25T12:28:05.944413+00:00 b0c010ash2018 kernel: [535887.711599] [<ffffffff8176ead9>] stub_clone+0x69/0x90
2015-04-25T12:28:05.944414+00:00 b0c010ash2018 kernel: [535887.760582] [<ffffffff8176e77d>] ? system_call_fastpath+0x1a/0x1f
2015-04-25T12:28:05.944414+00:00 b0c010ash2018 kernel: [535887.808776] Code: 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 5e 01 00 00 48 85 c0 0f 84 55 01 00 00 49 63 45 20 49 8b 7d 00 48 8d 4a 01 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 b7 49 63
2015-04-25T12:28:05.944414+00:00 b0c010ash2018 kernel: [535887.808776] Code: 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 5e 01 00 00 48 85 c0 0f 84 55 01 00 00 49 63 45 20 49 8b 7d 00 48 8d 4a 01 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 b7 49 63
2015-04-25T12:28:05.944415+00:00 b0c010ash2018 kernel: [535887.954161] RIP [<ffffffff811b0657>] kmem_cache_alloc+0x77/0x1f0
2015-04-25T12:28:05.944416+00:00 b0c010ash2018 kernel: [535888.001499] RSP <ffff8812c70ffd90>
2015-04-25T12:28:05.944416+00:00 b0c010ash2018 kernel: [535888.047379] CR2: 0000000000000001
2015-04-25T12:28:05.944419+00:00 b0c010ash2018 kernel: [535888.157161] ---[ end trace 8e74a782f5824da3 ]---
2015-04-25T12:28:06.040139+00:00 b0c010ash2018 kernel: [535888.207652] [sched_delayed] sched: RT throttling activated

Tags: vrouter
tags: added: vrouter
Revision history for this message
Anand H. Krishnan (anandhk) wrote :

Hi

Can you please check whether there is a steady memory leak before this crash happened? It is difficult to see what has gone wrong without a dump.

Thanks,

Revision history for this message
Varun Lodaya (varun-lodaya) wrote : Re: [Bug 1449162] Re: Seeing kernel backtrace and vrouter process stuck in INIT state with contrail 2.01 and kernel 3.13
Download full text (7.3 KiB)

Hey Anand,

It¹s kind of tricky to figure out if there was a memory leak after the
reboot. We are trying to plot it for the existing hypervisors to see if
they are running into memory leaks. Do you want us to check any specific
stuff?
Btw, we have seen more hypervisors with kernel 3.13 crashing in last 2
days.

Thanks,
Varun

On 4/28/15, 2:21 AM, "Anand H. Krishnan" <email address hidden> wrote:

>Hi
>
>Can you please check whether there is a steady memory leak before this
>crash happened? It is difficult to see what has gone wrong without a
>dump.
>
>Thanks,
>
>--
>You received this bug notification because you are subscribed to the bug
>report.
>https://bugs.launchpad.net/bugs/1449162
>
>Title:
> Seeing kernel backtrace and vrouter process stuck in INIT state with
> contrail 2.01 and kernel 3.13
>
>Status in OpenContrail:
> New
>
>Bug description:
> We are seeing these frequent kernel backtraces with contrail running
> on 2.01-41 and kernel version 3.13. The only way to recover after this
> is a hypervisor reboot. Need to dig into the root-cause of this as
> this is seriously affecting our Uptime.
>
> Following is the backtrace:
> 2015-04-25T12:28:05.944331+00:00 b0c010ash2018 kernel: [535885.190684]
>BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
> 2015-04-25T12:28:05.944348+00:00 b0c010ash2018 kernel: [535885.215308]
>IP: [<ffffffff811b0657>] kmem_cache_alloc+0x77/0x1f0
> 2015-04-25T12:28:05.944349+00:00 b0c010ash2018 kernel: [535885.230779]
>PGD 12df28d067 PUD 12d69d0067 PMD 0
> 2015-04-25T12:28:05.944350+00:00 b0c010ash2018 kernel: [535885.247356]
>Oops: 0000 [#1] SMP
> 2015-04-25T12:28:05.944351+00:00 b0c010ash2018 kernel: [535885.264374]
>Modules linked in: veth vhost_net macvtap macvlan vhost xt_conntrack
>iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
>nf_nat nf_conntrack 8021q mrp garp bridge stp llc ip6table_filter
>ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables nbd
>vrouter(OX) ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
>iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding
>x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dm_multipath
>crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel scsi_dh xfs
>gpio_ich joydev ipmi_devintf aesni_intel msr ablk_helper lp mac_hid
>cryptd sb_edac edac_core lrw gf128mul parport mei_me wmi ioatdma lpc_ich
>mei ipmi_si glue_helper libcrc32c shpchp acpi_power_meter aes_x86_64
>hid_generic usbhid hid ixgbe igb megaraid_sas dca i2c_algo_bit ptp
>pps_core mdio
> 2015-04-25T12:28:05.944354+00:00 b0c010ash2018 kernel: [535885.575624]
>CPU: 28 PID: 45796 Comm: openstack-statu Tainted: G OX
>3.13.0-49-generic #81~precise1-Ubuntu
> 2015-04-25T12:28:05.944355+00:00 b0c010ash2018 kernel: [535885.654294]
>Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.4.3 07/09/2014
> 2015-04-25T12:28:05.944356+00:00 b0c010ash2018 kernel: [535885.737038]
>task: ffff8812f2ce0000 ti: ffff8812c70fe000 task.ti: ffff8812c70fe000
> 2015-04-25T12:28:05.944356+00:00 b0c010ash2018 kernel: [535885.826571]
>RIP: 0010:[<ffffffff811b0657>] [<ffffffff811b0657>]
>kmem_cache_alloc+0x77/0x1f0
> ...

Read more...

Changed in juniperopenstack:
importance: Undecided → High
Changed in opencontrail:
importance: Undecided → High
Changed in opencontrail:
assignee: nobody → Anand H. Krishnan (anandhk)
summary: - Seeing kernel backtrace and vrouter process stuck in INIT state with
- contrail 2.01 and kernel 3.13
+ SYMC: Seeing kernel backtrace and vrouter process stuck in INIT state
+ with contrail 2.01 and kernel 3.13
Changed in opencontrail:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.