system crash due to kernel 2.6.32-504.1.3.el6.mos61.x86_64 vlan handling regression

Bug #1486537 reported by Chaz Chandler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Linux
6.1.x
Won't Fix
High
MOS Linux
7.0.x
Invalid
Undecided
MOS Linux
8.0.x
Invalid
Undecided
MOS Linux

Bug Description

As also reported in https://access.redhat.com/solutions/1168493:
"Experienced a crash of prod server due to a possible regression in VLAN and GRO handling code."

<4>Aug 18 11:45:39 node-81 kernel: ------------[ cut here ]------------
<4>Aug 18 11:45:39 node-81 kernel: invalid opcode: 0000 [#1] SMP
<4>Aug 18 11:45:39 node-81 kernel: last sysfs file: /sys/devices/system/node/node1/meminfo
<4>Aug 18 11:45:39 node-81 kernel: CPU 2
<4>Aug 18 11:45:39 node-81 kernel: tunnel6 af_key xt_conntrack ipt_REDIRECT iptable_mangle ipt_REJECT ipt_MASQUERADE iptable_nat nf_nat veth nf_conntrack_ipv4 nf_defrag_ipv4 xt_NOTRACK iptable_raw xt_comment xt_multiport bridge bonding 8021q garp stp llc openvswitch(U) gre libcrc32c nls_utf8 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) eth_ipoib(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) ext2 knem(U) serio_raw tg3 ptp pps_core k10temp amd64_edac_mod edac_core edac_mce_amd shpchp forcedeth sg i2c_nforce2 i2c_core ext4 jbd2 mbcache sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas pata_amd ata_generic pata_acpi sata_nv dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>Aug 18 11:45:39 node-81 kernel:
<4>Aug 18 11:45:39 node-81 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-504.1.3.el6.mos61.x86_64 #1 Sun Microsystems Sun Fire X2200 M2 /S39
<4>Aug 18 11:45:39 node-81 kernel: RIP: 0010:[<ffffffff814533bf>] [<ffffffff814533bf>] skb_segment+0x77f/0x7b0
<4>Aug 18 11:45:39 node-81 kernel: RSP: 0018:ffff880028303a30 EFLAGS: 00010206
<4>Aug 18 11:45:39 node-81 kernel: RAX: 0000000000000000 RBX: ffff880384421d80 RCX: 0000000000000000
<4>Aug 18 11:45:39 node-81 kernel: RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff880384421d80
<4>Aug 18 11:45:39 node-81 kernel: RBP: ffff880028303b00 R08: 00000000000005ee R09: ffff8802101807c0
<4>Aug 18 11:45:39 node-81 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88028c2699c0
<4>Aug 18 11:45:39 node-81 kernel: R13: 00000000000005ee R14: ffff8802101807c0 R15: 00000000000005a8
<4>Aug 18 11:45:39 node-81 kernel: FS: 00007f2bb3e96700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
<4>Aug 18 11:45:39 node-81 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>Aug 18 11:45:39 node-81 kernel: CR2: 0000000000cdc038 CR3: 00000004882e8000 CR4: 00000000000007e0
<4>Aug 18 11:45:39 node-81 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>Aug 18 11:45:39 node-81 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Aug 18 11:45:39 node-81 kernel: Process swapper (pid: 0, threadinfo ffff8806164bc000, task ffff880616492040)
<4>Aug 18 11:45:39 node-81 kernel: Stack:
<4>Aug 18 11:45:39 node-81 kernel: ffff8802101807c0 ffffffff000005ee ffff880247a103e0 1800ffffa03da6dc
<4>Aug 18 11:45:39 node-81 kernel: <d> 000000000000003e 0000003e002699c0 00ffe8f900000000 0000000000000000
<4>Aug 18 11:45:39 node-81 kernel: <d> 0000000000000046 ffffffffffffffba 0000000000000000 ffff880384421d80
<4>Aug 18 11:45:39 node-81 kernel: Call Trace:
<4>Aug 18 11:45:39 node-81 kernel: <IRQ>
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff814a6814>] tcp_tso_segment+0xf4/0x300
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff814caadf>] inet_gso_segment+0x10f/0x2b0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff814613d8>] skb_mac_gso_segment+0xa8/0x280
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81461607>] __skb_gso_segment+0x57/0xc0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81461683>] skb_gso_segment+0x13/0x20
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8146172b>] dev_hard_start_xmit+0x9b/0x480
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81458bf4>] ? skb_tx_hash+0x14/0x20
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa039075e>] ? parent_select_q+0xe/0x10 [eth_ipoib]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81461d4d>] dev_queue_xmit+0x1bd/0x320
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa04177a8>] br_dev_queue_push_xmit+0x88/0xc0 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa0417838>] br_forward_finish+0x58/0x60 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa04178ea>] __br_forward+0xaa/0xd0 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8148bfa6>] ? nf_hook_slow+0x76/0x120
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa041796d>] br_forward+0x5d/0x70 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa04187a8>] br_handle_frame_finish+0x168/0x460 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffffa0418c4a>] br_handle_frame+0x1aa/0x250 [bridge]
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8145c929>] __netif_receive_skb+0x529/0x750
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8145cbea>] process_backlog+0x9a/0x100
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff814620a3>] net_rx_action+0x103/0x2f0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81015463>] ? native_sched_clock+0x13/0x80
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81015463>] ? native_sched_clock+0x13/0x80
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8107d8b1>] __do_softirq+0xc1/0x1e0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8100c30c>] call_softirq+0x1c/0x30
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8100fc15>] do_softirq+0x65/0xa0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8107d765>] irq_exit+0x85/0x90
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff810331b5>] smp_call_function_single_interrupt+0x35/0x40
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8100bdb3>] call_function_single_interrupt+0x13/0x20
<4>Aug 18 11:45:39 node-81 kernel: <EOI>
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81040f8b>] ? native_safe_halt+0xb/0x10
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8152f446>] ? notifier_call_chain+0x16/0x80
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8101687d>] default_idle+0x4d/0xb0
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81016943>] c1e_idle+0x63/0x120
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff81009fc6>] cpu_idle+0xb6/0x110
<4>Aug 18 11:45:39 node-81 kernel: [<ffffffff8152223c>] start_secondary+0x2be/0x301
<4>Aug 18 11:45:39 node-81 kernel: Code: 06 0f 1f 00 48 89 df 48 8b 1f e8 9d cb ff ff 48 85 db 75 f0 48 c7 45 88 f4 ff ff ff e9 e5 fc ff ff f0 41 ff 86 e4 00 00 00 eb 97 <0f> 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 0f 0b
<4>Aug 18 11:45:39 node-81 kernel: RSP <ffff880028303a30>
<4>Aug 18 11:45:39 node-81 kernel: ---[ end trace 12e21933558f88b6 ]---

Causes hard system crash. Recommended fix from Red Hat is:

    RHEL 6.6: Upgrade to kernel-2.6.32-504.23.4.el6 from Errata RHSA-2015:1081 or later
    RHEL 6.5 EUS: Upgrade to kernel-2.6.32-431.61.2.el6 from Errata RHSA-2015:1583 or later

Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Chaz, could you attach fuel version and diagnostic snapshot?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

We have switched to newer kernel in 7.0 already, so 7.0 is not affected.

Changed in fuel:
milestone: none → 6.1-updates
status: Incomplete → Confirmed
importance: Undecided → High
Revision history for this message
Pavel Boldin (pboldin) wrote :

We really need to get the patch do_tcp_sendpages to the RHEL upstream and update the kernel as well.

Revision history for this message
Chaz Chandler (charles-l-chandler) wrote :

Sorry, premature submit. Let's try that again:

Fuel version (from "yum info fuel"):
Version : 6.1.0
Release : 6023.1

more to follow as i gather info

Revision history for this message
Chaz Chandler (charles-l-chandler) wrote :

How I discovered the issue:
- Create a new cluster with 2014.2.2-6.1 on centos w/ HA, Ceph, KVM, Neutron w/ VLANs
- Install mellanox fuel plugin 1.0.0 and select only "Mellanox Openstack features" (since my hardware doesn't support SR-IOV and I don't need iSER for cinder since I'm running ceph)
- Setup: 1 controller, 4 compute, 2 storage nodes
- Deploy
- Started a VM and installed cloudbreak
- Configured cloudbreak and openstack to talk to each other to deploy ambari+HDP clusters
- Created a basic ambari+HDP cluster config and deployed it
- A short time into deploying the cluster the controller node segfaults with the above message and gets stuck (no reboot / crashdump). Happens every time.

I'm sure this isn't just a My guess is that cloudbreak gives neutron a pretty good workout and at some point while configuring networking runs into this regression edge case.

Revision history for this message
Chaz Chandler (charles-l-chandler) wrote :

can't generate a dump with 'fuel snapshot' as I get this error :

File "/usr/lib64/python2.6/httplib.py", line 686, in _set_hostport raise
InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Changed in fuel:
milestone: 6.1-updates → 9.0
status: Confirmed → New
no longer affects: fuel/future
Changed in fuel:
status: New → Invalid
Revision history for this message
Chaz Chandler (charles-l-chandler) wrote :

By the way, I'm still seeing something highly similar with mos 7 on Ubuntu (see attached). We know redhat addressed the bug amd backported the fix to 2.6 but do we know if ubuntu ever addressed it in trusty?

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Won't fix for 6.x series as they are unsupported and there was no progress on the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.