contrail vrouter compute nodes shutdown abruptly with kernel call trace

Bug #1683305 reported by kalagesan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.21.x
Fix Committed
Undecided
Unassigned
R3.0
Fix Committed
Undecided
Unassigned
R3.0.3.x
New
Undecided
Unassigned
R3.1
Fix Committed
High
Unassigned
R3.2
Fix Committed
Undecided
Unassigned
R3.2.3.x
Fix Committed
Undecided
Unassigned
Trunk
Fix Committed
Undecided
Unassigned

Bug Description

Sudden shutdown on few of the compute nodes in production setup.

Contrail Version is V2.21.2 Build 55

A Kernel Call trace was output in the customer's environment, causing Compute Node (kw1ap-vscp0107n) to shutdown.

Below is the time when shutdown occurred.
- kw1ap-vscp0107n(Ubuntu 14.04) === Feb 23 05:43
And there was additional information from the customer.
To analyze another problem, the following vRouter commands was executed at intervals of 10 seconds. (between 18/1/2017 and 15/2/2017)
- vrfstats --dump
- dropstats
- vif --list
During this period 12 Compute Nodes units got shutdown.
Among them, it is said that five units Kernel Call Traces were output and sudden shutdown occurred.
After Feb 15 customer's stopped executing the command, and only issue occurred kw1ap-vscp0107n.Customer powered on the machines to recover them.

Ubuntu 14.04 : 3 units
Redhat 7.1 : 2 units
Below is the time when shutdown occurred.
- kw1bp-vscp0038n(Rhel 7.1) === Feb 10 11:10
- kw1ap-vscp0031n(Rhel 7.1) === Jan 22 03:49
- kw1ap-vscp0027n(Ubuntu 14.04) === Jan 20 01:07
- kw1ap-vscp0090n(Ubuntu 14.04) === Feb 14 09:39
- os5ap-vscp0023n(Ubuntu 14.04) === Jan 23 09:59

It seems that ComputeNode will become unstable by periodically executing the vrouter analysis command from the Customer's analysis result.

Customer confirmed there is no crash files(/var/crashes) from the compute nodes where this issue happened.

I checked /var/log/messages from one of the compute node kw1bp-vscp0038n(Rhel 7.1) where issue occurred on Feb 10 11:10 , I see vrouter related information in the kernel call trace errors.

Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241354] WARNING: at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241355] list_del corruption, ffffea0012217d60->next is LIST_POISON1 (dead000000100100)
Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241355] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver vhost_net macvtap macvla
sv3 nfs_acl nfs lockd sunrpc fscache binfmt_misc mptctl mptbase sg xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat n
f_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT iptable_filter ip_tables tun bridge vrouter(OF) 8021q garp stp mrp llc bonding vfa
owerclamp coretemp iTCO_wdt iTCO_vendor_support sr_mod cdrom ipmi_devintf intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul cr
sni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich i2c_i801 mfd_core hpilo hpwdt ioatdma
cpufreq acpi_power_meter dm_mirror dm_region_hash dm_log dm_mod xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif
lt i2c_algo_bit drm_kms_helper ttm ixgbe drm tg3 mdio i2c_core dca ptp hpsa pps_core
Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241386] CPU: 6 PID: 31894 Comm: skipcpio Tainted: GF B W O-------------- 3.10.0-229.el7.x86_64 #1
Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241387] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Feb 10 11:10:46 kw1bp-vscp0038n-jp1 kernel: [14735550.241387] ffff881186493a80 0000000060fee02e ffff881186493a38 ffffffff81603f36

Logs are in root@10219.48.123, pwd:Jtaclab123
Logs upload path: /home/kannan/computeshutdownkernel

Latest update from customer:

 current status:

The issue has been occurred more than once even after stopped executing vRouter commands.
However, the core or crash file has been not generated.
The customer is considering investigation from the OS side.
About investigating vRouter side, customer understand that analysis is difficult if there is no crash file, but they want us to continue the investigation on the vrouter side in parallel?

These files are "/var/log/contrail" and "vrouter --info" of each nodes which the issue occurred recently.
- 20170412_kw1bp-vscp0039n_var_log.tar.gz
- 20170412_kw1bp-vscp0055n_var_log.tar.gz
- 20170412_kw1bp-vscp0058n_var_log.tar.gz
- 20170412_vrouter_info.zip

About each nodes timestamp of the issue are following.

kw1bp-vscp0039n
kw1bp-vscp0055n
kw1bp-vscp0058n
kw1ap-vscp0010n(collecting kernel dump)

-------------------------------------------------------------------------------------------------------------------------------- -----------
<kw1bp-vscp0039n>(refer to "messages.8")
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.570279] BUG: Bad page state in process vhost-5576 pfn:1fc5bbe
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.600095] page:ffffea007f16ef80 count:-1 mapcount:0 mapping: (null) index:0x0
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.638139] page flags: 0x2fffff00000000()
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.658229] page dumped because: nonzero _count
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680159] Modules linked in: fuse btrfs zlib_deflate raid6_pq xor msdos ext4 mbcache jbd2 rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver vhost_net macvtap macvlan softdog nfsv3 nfs_acl nfs lockd sunrpc fscache binfmt_misc mptctl mptbase sg xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT iptable_filter ip_tables tun bridge vrouter(OF) 8021q garp stp mrp llc bonding vfat fat nf_conntrack ipmi_devintf dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support sr_mod cdrom intel_powerclamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich i2c_i801 mfd_core hpilo hpwdt ioatdma shpchp wmi ipmi_si ipmi_msghandler pcc_cpufreq acpi_power_meter xfs libcrc32c usb_storage sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ixgbe tg3 mdio i2c_core dca ptp hpsa pps_core
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680243] CPU: 0 PID: 5587 Comm: vhost-5576 Tainted: GF B O-------------- 3.10.0-229.el7.x86_64 #1
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680245] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680247] ffffea007f16ef80 00000000b93cafda ffff8802e2d63988 ffffffff81603f36
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680253] ffff8802e2d639b0 ffffffff815ff178 ffff881fffa169e0 0000000000003639
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680258] 0000000000000001 ffff8802e2d63ab8 ffffffff8115fe68 ffff8802e2d639e8
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680264] Call Trace:
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680274] [<ffffffff81603f36>] dump_stack+0x19/0x1b
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680278] [<ffffffff815ff178>] bad_page.part.59+0xdf/0xfc
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680284] [<ffffffff8115fe68>] get_page_from_freelist+0x848/0x9b0
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680290] [<ffffffff810a0728>] ? __wake_up_common+0x58/0x90
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680294] [<ffffffff81160169>] __alloc_pages_nodemask+0x199/0xb90
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680320] [<ffffffffa0471e27>] ? kvm_apic_set_irq+0x27/0x30 [kvm]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680334] [<ffffffffa0472147>] ? kvm_irq_delivery_to_apic_fast+0x177/0x280 [kvm]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680340] [<ffffffff814ebba1>] ? __kmalloc_reserve.isra.27+0x31/0x90
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680343] [<ffffffff814ec88d>] ? __alloc_skb+0x5d/0x2d0
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680348] [<ffffffff8119efd9>] alloc_pages_current+0xa9/0x170
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680351] [<ffffffff814e6a67>] sock_alloc_send_pskb+0x257/0x380
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680366] [<ffffffff810a0728>] ? __wake_up_common+0x58/0x90
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680372] [<ffffffffa02bc700>] tun_get_user+0x1b0/0x8b0 [tun]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680377] [<ffffffffa02bce5a>] tun_sendmsg+0x5a/0x80 [tun]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680382] [<ffffffffa0558a88>] handle_tx+0x1b8/0x540 [vhost_net]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680387] [<ffffffffa0558e45>] handle_tx_kick+0x15/0x20 [vhost_net]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680391] [<ffffffffa055579b>] vhost_worker+0xfb/0x1e0 [vhost_net]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680395] [<ffffffffa05556a0>] ? memory_access_ok.isra.9+0xd0/0xd0 [vhost_net]
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680400] [<ffffffff8109726f>] kthread+0xcf/0xe0
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680404] [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680409] [<ffffffff81613cfc>] ret_from_fork+0x7c/0xb0
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.680412] [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Mar 24 03:51:32 kw1bp-vscp0039n-jp1 sh[1120]: abrt-dump-oops: Found oopses: 1
<snip>
Mar 24 03:54:32 kw1bp-vscp0039n-jp1 logger: os-prober: debug: /dev/sda4: is active swap
===> shutdown after this log

-------------------------------------------------------------------------------------------------------------------------------- -----------
<kw1bp-vscp0055n>(refer to "syslog.14")
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.858764] BUG: Bad page state in process qemu-system-x86 pfn:188b45f
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.894628] page:ffffea00622d17c0 count:-1 mapcount:0 mapping: (null) index:0x0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.937382] page flags: 0x2ffff0000000000()
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961859] Modules linked in: vhost_net vhost macvtap macvlan rpcsec_gss_krb5 nfsv4 softdog nfsv3 mptctl mptbase ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vrouter(OX) openvswitch gre vxlan ip_tunnel libcrc32c gpio_ich dm_multipath x86_pkg_temp_thermal intel_powerclamp scsi_dh coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nfsd ipmi_devintf auth_rpcgss hpwdt hpilo lpc_ich ioatdma nf_conntrack nfs_acl wmi acpi_power_meter ipmi_si mac_hid nfs 8021q lockd garp stp sunrpc fscache mrp llc bonding lp parport nls_iso8859_1 nbd ixgbe tg3 dca ptp pps_core hpsa mdio
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961912] CPU: 0 PID: 12824 Comm: qemu-system-x86 Tainted: G OX 3.13.0-40-generic #69-Ubuntu
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961914] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961915] 0000000000003639 ffff881fffa03ad0 ffffffff8171f226 ffffea00622d17c0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961919] ffff881fffa03ae8 ffffffff81719d40 0000000000000000 ffff881fffa03bc8
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961922] ffffffff811584ca ffff88207fffce08 0000000200000000 ffff881fffa03b40
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961925] Call Trace:
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961926] <IRQ> [<ffffffff8171f226>] dump_stack+0x45/0x56
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961936] [<ffffffff81719d40>] bad_page.part.61+0xcf/0xe8
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961940] [<ffffffff811584ca>] get_page_from_freelist+0x85a/0x930
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961945] [<ffffffff81524978>] ? tun_net_xmit+0x1e8/0x320
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961947] [<ffffffff81158724>] __alloc_pages_nodemask+0x184/0xb80
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961951] [<ffffffff8161cf80>] ? __skb_get_hash+0x20/0x140
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961984] [<ffffffffa02f921f>] ? linux_rx_handler+0x2cf/0x7b0 [vrouter]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961988] [<ffffffff811974f3>] alloc_pages_current+0xa3/0x160
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961993] [<ffffffff816122fc>] __netdev_alloc_frag+0x8c/0x140
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.961996] [<ffffffff81614397>] __netdev_alloc_skb+0x77/0xc0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962006] [<ffffffffa01329e4>] ixgbe_clean_rx_irq+0xf4/0xb10 [ixgbe]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962011] [<ffffffffa0134140>] ixgbe_poll+0x460/0x800 [ixgbe]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962016] [<ffffffff81623c02>] net_rx_action+0x152/0x250
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962020] [<ffffffff8106cc1c>] __do_softirq+0xec/0x2c0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962022] [<ffffffff8106d165>] irq_exit+0x105/0x110
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962026] [<ffffffff81732066>] do_IRQ+0x56/0xc0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962031] [<ffffffffa1ee1360>] ? vmx_invpcid_supported+0x20/0x20 [kvm_intel]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962034] [<ffffffff8172772d>] common_interrupt+0x6d/0x6d
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962035] <EOI> [<ffffffff81730850>] ? irq_entries_start+0x310/0x400
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962041] [<ffffffffa1ee162c>] ? vmx_handle_external_intr+0x6c/0x70 [kvm_intel]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962055] [<ffffffffa0462d66>] vcpu_enter_guest+0x766/0xce0 [kvm]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962060] [<ffffffffa1ee6088>] ? vmx_deliver_posted_interrupt+0x38/0x90 [kvm_intel]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962072] [<ffffffffa04818c5>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962080] [<ffffffffa04672a8>] kvm_arch_vcpu_ioctl_run+0x1e8/0x460 [kvm]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962087] [<ffffffffa0451042>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962090] [<ffffffff8172b324>] ? __do_page_fault+0x204/0x560
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962094] [<ffffffff8118224b>] ? change_protection+0x9b/0xb0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962098] [<ffffffff811d0bf0>] do_vfs_ioctl+0x2e0/0x4c0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962106] [<ffffffffa045c504>] ? kvm_on_user_return+0x74/0x80 [kvm]
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962109] [<ffffffff811d0e51>] SyS_ioctl+0x81/0xa0
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962111] [<ffffffff8172fd2d>] system_call_fastpath+0x1a/0x1f
Mar 18 17:30:54 kw1ap-vscp0055n-jp1 kernel: [970355.962112] Disabling lock debugging due to kernel taint
===> shutdown after this log

-------------------------------------------------------------------------------------------------------------------------------- -----------
<kw1bp-vscp0058n>(refer to "syslog.22")
Mar 15 05:19:08 kw1ap-vscp0058n-jp1 kernel: [17631790.061221] BUG: Bad page state in process cat pfn:4785ba
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.087547] page:ffffea0011e16e80 count:-1 mapcount:0 mapping: (null) index:0x0
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.125715] page flags: 0x2ffff0000000000()
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146040] Modules linked in: rpcsec_gss_krb5 nfsv4 vhost_net vhost macvtap macvlan softdog nfsv3 ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge mptctl mptbase ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vrouter(OX) openvswitch gre vxlan ip_tunnel libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 gpio_ich dm_multipath scsi_dh lrw ioatdma nfsd gf128mul glue_helper ablk_helper hpwdt ipmi_si wmi hpilo lpc_ich mac_hid acpi_power_meter cryptd auth_rpcgss nfs_acl nfs lockd ipmi_devintf sunrpc nf_conntrack 8021q fscache garp stp mrp llc bonding lp parport nls_iso8859_1 ixgbe tg3 dca nbd ptp pps_core hpsa mdio
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146090] CPU: 28 PID: 30443 Comm: cat Tainted: G B OX 3.13.0-40-generic #69-Ubuntu
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146091] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146093] ffff881fffc17830 ffff8816cbe1db80 ffffffff8171f226 ffffea0011e16e80
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146096] ffff8816cbe1db98 ffffffff81719d40 0000000000000001 ffff8816cbe1dc78
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146111] ffffffff811584ca ffff88207fffce08 0000000200000000 ffff880a9a30b000
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146114] Call Trace:
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146120] [<ffffffff8171f226>] dump_stack+0x45/0x56
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146124] [<ffffffff81719d40>] bad_page.part.61+0xcf/0xe8
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146140] [<ffffffff811584ca>] get_page_from_freelist+0x85a/0x930
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146148] [<ffffffff81158724>] __alloc_pages_nodemask+0x184/0xb80
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146165] [<ffffffff810a3cfe>] ? dequeue_task_fair+0x44e/0x660
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146167] [<ffffffff8109d4e5>] ? sched_clock_cpu+0xb5/0x100
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146172] [<ffffffff81012609>] ? __switch_to+0x169/0x4c0
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146181] [<ffffffff811974f3>] alloc_pages_current+0xa3/0x160
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146190] [<ffffffff811c62b8>] pipe_write+0x258/0x520
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146197] [<ffffffff811bd21a>] do_sync_write+0x5a/0x90
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146199] [<ffffffff811bd9a4>] vfs_write+0xb4/0x1f0
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146201] [<ffffffff811bd87e>] ? vfs_read+0xee/0x160
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146206] [<ffffffff811be3d9>] SyS_write+0x49/0xa0
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146209] [<ffffffff8172fd2d>] system_call_fastpath+0x1a/0x1f
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.146554] BUG: Bad page state in process cat pfn:fffffe787ffe8de0
Mar 15 05:19:09 kw1ap-vscp0058n-jp1 kernel: [17631790.177003] page:ffff881fffa37810 count:0 mapcount:32 mapping: (null) index:0xba0000003b
Mar 15 05:19:10 kw1ap-vscp0058n-jp1 snmpd[11189]: Cannot statfs /var/lib/nova/instances#012: Permission denied
Mar 15 05:19:10 kw1ap-vscp0058n-jp1 snmpd[11189]: Cannot statfs /var/lib/nova/mnt/b096e83694c3aae72d8881d3254ebfcc#012: Permission denied
===> shutdown after this log

-------------------------------------------------------------------------------------------------------------------------------- -----------
<kw1ap-vscp0010n>(collecting kernel dump)
- Host
kw1ap-vscp0010n

- OS
Ubuntu 14.04 LTS/Linux 3.13.0-40-generic x86_64

- timestamp
Apr 12 05:03:28 Call Trace occur
Apr 12 05:14:52 start of collecting kdump (* it was started about 10 minutes after call trace was occurred.)
Apr 12 05:38:12 end of collecting kdump

- kernel dump
dump.201704120524

- Call Trace
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.382254] BUG: Bad page state in process swapper/3 pfn:1a84c1
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.429096] page:ffffea0006a13040 count:-1 mapcount:0 mapping: (null) index:0x0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.481189] page flags: 0x2ffff0000000000()
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514108] Modules linked in: vrouter(OX) nfsv3 vhost_net vhost macvtap macvlan softdog rpcsec_gss_krb5 nfsv4 mptctl mptbase ip6table_filter ip6_ tables iptable_filter ip_tables ebtable_nat ebtables x_tables cpuid x86_pkg_temp_thermal intel_powerclamp gpio_ich coretemp kvm_intel kvm dm_multipath crct10dif_pclmul scsi_dh crc32_pclmul ghash_c lmulni_intel aesni_intel aes_x86_64 nfsd lrw gf128mul auth_rpcgss nfs_acl glue_helper ablk_helper cryptd ipmi_si usb_storage nfs lpc_ich hpwdt acpi_power_meter hpilo ioatdma wmi mac_hid ipmi_devin tf lockd sunrpc fscache nf_conntrack 8021q garp stp mrp llc bonding lp parport nls_iso8859_1 nbd ixgbe tg3 dca ptp pps_core mdio hpsa [last unloaded: vrouter]
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514150] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G OX 3.13.0-40-generic #69-Ubuntu
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514152] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 07/20/2015
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514153] 0000000000003639 ffff881fffa63ad0 ffffffff8171f226 ffffea0006a13040
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514157] ffff881fffa63ae8 ffffffff81719d40 0000000000000000 ffff881fffa63bc8
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514160] ffffffff811584ca ffff88207fffce08 0000000200000000 ffff883fd09c9868
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514163] Call Trace:
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514164] <IRQ> [<ffffffff8171f226>] dump_stack+0x45/0x56
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514174] [<ffffffff81719d40>] bad_page.part.61+0xcf/0xe8
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514178] [<ffffffff811584ca>] get_page_from_freelist+0x85a/0x930
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514180] [<ffffffff81158724>] __alloc_pages_nodemask+0x184/0xb80
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514185] [<ffffffff8161cf80>] ? __skb_get_hash+0x20/0x140
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514194] [<ffffffffa03f721f>] ? linux_rx_handler+0x2cf/0x7b0 [vrouter]
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514197] [<ffffffff811974f3>] alloc_pages_current+0xa3/0x160
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514202] [<ffffffff816122fc>] __netdev_alloc_frag+0x8c/0x140
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514205] [<ffffffff81614397>] __netdev_alloc_skb+0x77/0xc0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514214] [<ffffffffa01209e4>] ixgbe_clean_rx_irq+0xf4/0xb10 [ixgbe]
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514219] [<ffffffffa0122140>] ixgbe_poll+0x460/0x800 [ixgbe]
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514223] [<ffffffff81623c02>] net_rx_action+0x152/0x250
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514227] [<ffffffff8106cc1c>] __do_softirq+0xec/0x2c0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514229] [<ffffffff8106d165>] irq_exit+0x105/0x110
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514233] [<ffffffff81732066>] do_IRQ+0x56/0xc0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514236] [<ffffffff8172772d>] common_interrupt+0x6d/0x6d
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514237] <EOI> [<ffffffff815d1d82>] ? cpuidle_enter_state+0x52/0xc0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514243] [<ffffffff815d1ea9>] cpuidle_idle_call+0xb9/0x1f0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514248] [<ffffffff8101d2be>] arch_cpu_idle+0xe/0x30
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514251] [<ffffffff810bee55>] cpu_startup_entry+0xc5/0x290
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514255] [<ffffffff8104134d>] start_secondary+0x21d/0x2d0
Apr 12 05:03:28 kw1ap-vscp0010n-jp1 kernel: [35286576.514256] Disabling lock debugging due to kernel taint

-------------------------------------------------------------------------------------------------------------------------------- -----------

In addition, the customers are upgrading their vrouters sequentially to build 57, but the following versions are mixed at the present time.
2.21.2-36
2.21.3-55
2.21.3-57

However, it seems that the log of "Bad page state" is occurred only in 2.21.2-36.
<log of Bad page state>
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.570279] BUG: Bad page state in process vhost-5576 pfn:1fc5bbe
Mar 24 03:51:31 kw1bp-vscp0039n-jp1 kernel: [18388342.600095] page:ffffea007f16ef80 count:-1 mapcount:0 mapping: (null) index:0x0

About the log of "Bad page state". The customer is guessing following.
Although the unused memory area was normally "count:0", it was "count:-1" due to "double free". as result log was output.
In Contrail vrouter 2.21, several bugs related to "double free" were hit with searching.
Are there relevance to these?
===
<search result>
https://github.com/Juniper/contrail-vrouter/commit/b1da8d0b14d6aa62c510899100ccc1756d3b7c50
https://bugs.launchpad.net/juniperopenstack/+bugs?field.searchtext=double+free& search=Search&field.status%3Alist=NEW&field.status%3Alist=INCOMPLETE_WITH_RESPONSE& field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED& field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED& field.assignee=&field.bug_reporter=&field.omit_dupes=on&field.has_patch=&field.has_no_package=
===

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/31048
Submitter: Anand H. Krishnan (<email address hidden>)

tags: added: vrouter
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/31048
Committed: http://github.com/Juniper/contrail-vrouter/commit/3e274254f25563c3fe8ed564c9d3ef77880b7dcd
Submitter: Zuul (<email address hidden>)
Branch: master

commit 3e274254f25563c3fe8ed564c9d3ef77880b7dcd
Author: Anand H. Krishnan <email address hidden>
Date: Fri May 5 15:00:18 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: Ief768873bf4ea57bd88e3f809392bec92fc41932
Closes-Bug: #1683305

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/31125
Submitter: Anand H. Krishnan (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/31156
Submitter: Anand H. Krishnan (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/31125
Committed: http://github.com/Juniper/contrail-vrouter/commit/fcf28b4ec975c4de18afc0b859c4ace5684d0eef
Submitter: Zuul (<email address hidden>)
Branch: R2.21.x

commit fcf28b4ec975c4de18afc0b859c4ace5684d0eef
Author: Anand H. Krishnan <email address hidden>
Date: Mon May 8 15:08:00 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: I867dc6eac82e1ebc7a0f496b928d66ff62409542
Closes-Bug: #1683305

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/31156
Committed: http://github.com/Juniper/contrail-vrouter/commit/d16e1662d50bddd9ea545c25b4266fd916a96ec0
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit d16e1662d50bddd9ea545c25b4266fd916a96ec0
Author: Anand H. Krishnan <email address hidden>
Date: Fri May 5 15:00:18 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: Ief768873bf4ea57bd88e3f809392bec92fc41932
Closes-Bug: #1683305

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/31284
Submitter: Anand H. Krishnan (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/31286
Submitter: Anand H. Krishnan (<email address hidden>)

Jeba Paulaiyan (jebap)
tags: added: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/31286
Committed: http://github.com/Juniper/contrail-vrouter/commit/92a8f11a7df7e65f8deaaafc51fb1b33f51ac0a2
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit 92a8f11a7df7e65f8deaaafc51fb1b33f51ac0a2
Author: Anand H. Krishnan <email address hidden>
Date: Mon May 8 15:08:00 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: I867dc6eac82e1ebc7a0f496b928d66ff62409542
Closes-Bug: #1683305

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/31284
Committed: http://github.com/Juniper/contrail-vrouter/commit/ccb1df4579af30460f1ebd6a5fbe84c2526181b1
Submitter: Zuul (<email address hidden>)
Branch: R3.0

commit ccb1df4579af30460f1ebd6a5fbe84c2526181b1
Author: Anand H. Krishnan <email address hidden>
Date: Mon May 8 15:08:00 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: I867dc6eac82e1ebc7a0f496b928d66ff62409542
Closes-Bug: #1683305

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2.3.x

Review in progress for https://review.opencontrail.org/32594
Submitter: Vinay Vithal Mahuli (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32594
Committed: http://github.com/Juniper/contrail-vrouter/commit/497f2bf997bd22faff8e959b7758720b11052b09
Submitter: Zuul (<email address hidden>)
Branch: R3.2.3.x

commit 497f2bf997bd22faff8e959b7758720b11052b09
Author: Anand H. Krishnan <email address hidden>
Date: Fri May 5 15:00:18 2017 +0530

A cloned skbuff should not be sent to napi_gro_receive

When skb_gro_receive merges skbs/fragments, it does not take
a reference to the underlying page, thus transferring the
original reference it had to the first packet of the packet
bundle. This will work fine for non-cloned packets.

However, for cloned packets, when the gro-ed packet is
eventually freed (because the original skb was not cloned and
hence reference was 1), the merged skb's frags also get freed
(put_page-ed) without taking into account the other references
that were held for the fragments (dataref). This will result in
bad page problems.

Hence, avoid GRO for cloned packets.

Change-Id: Ief768873bf4ea57bd88e3f809392bec92fc41932
Closes-Bug: #1683305

information type: Proprietary → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.