Ubuntu 17.04: machine crashes with Oops in dccp_v4_ctl_send_reset while running stress-ng.

Bug #1654073 reported by bugproxy on 2017-01-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Tim Gardner
Zesty
High
Tim Gardner

Bug Description

== Comment: #0 - PAVITHRA R. PRAKASH - 2016-12-28 03:39:50 ==
---Problem Description---

Ubuntu 17.04: machine crashes with Oops while running stress-ng.

---Steps followed-----

1. Install 17.04 on NV machine.
2. apt-get install stress-ng
3. stress-ng -a 0

Logs
====
dccp af_alg joydev input_leds mac_hid at24 nvmem_core ofpart cmdlinepart powernv_flash mtd opal_prd powernv_rng ipmi_powernv ipmi_msghandler ibmpowernv uio_pdrv_genirq uio vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_tran[165938083361,3] OPAL: Trying a CPU re-init with flags: 0x1
[166456504189,3] OPAL: CPU 0x29 not in OPAL !
[167510150475,3] OPAL: Trying a CPU re-init with flags: 0x2
[168022446397,3] OPAL: CPU 0x29 not in OPAL !
sport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ses enclosure scsi_transport_sas hid_generic ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt bnx2x fb_sys_fops drm aacraid tg3 usbhid uas hid usb_storage mdio ahci libcrc32c libahci crc32c_vpmsum
[ 237.047216] CPU: 33 PID: 34694 Comm: stress-ng-dccp Not tainted 4.9.0-11-generic #12-Ubuntu
[ 237.047315] task: c000003312a69400 task.stack: c000003779698000
[ 237.047402] NIP: d00000002e7b0a7c LR: d00000002e7b21cc CTR: c000000000a0dd00
[ 237.047509] REGS: c000003fff68f670 TRAP: 0300 Not tainted (4.9.0-11-generic)
[ 237.047613] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>[ 237.049028] CR: 24002282 XER: 20000000
[ 237.049103] CFAR: c000000000008a60 DAR: 00000000000002f4 DSISR: 40000000 SOFTE: 1
GPR00: d00000002e7b21cc c000003fff68f8f0 d00000002e7bb670 0000000000000001
GPR04: c000002b5dbdc400 c000003c74c0a460 0000000000000474 c000003c74c0a474
GPR08: c000003c74c0a000 0000000000000000 c00000348cd25200 0000000000000000
GPR12: 0000000000002200 c000000007b72900 c000003fff68c000 0000000000000000
GPR16: 0000000000000000 0000000000000040 0000000000000001 0000000000002713
GPR20: 000000000000cc84 000000000100007f 000000000100007f c0000000013b2f00
GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000004
GPR28: c0000000013b2f00 c000003c74c0a474 0000000000000000 c000002b5dbdc400
NIP [d00000002e7b0a7c] dccp_v4_ctl_send_reset+0xa4/0x2f0 [dccp_ipv4]
[ 237.051403] LR [d00000002e7b21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
[ 237.051486] Call Trace:
[ 237.051529] [c000003fff68f8f0] [000000002713cc84] 0x2713cc84 (unreliable)
[ 237.051649] [c000003fff68f970] [d00000002e7b21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
[ 237.051779] [c000003fff68fa50] [c000000000a01e40] ip_local_deliver_finish+0x170/0x350
[ 237.051932] [c000003fff68faa0] [c000000000a0276c] ip_local_deliver+0x5c/0x130
[ 237.052038] [c000003fff68fb10] [c000000000a02278] ip_rcv_finish+0x258/0x510
[ 237.052151] [c000003fff68fba0] [c000000000a02b44] ip_rcv+0x304/0x420
[ 237.052263] [c000003fff68fc30] [c0000000009a28bc] __netif_receive_skb_core+0x97c/0xda0
[ 237.052388] [c000003fff68fd10] [c0000000009a7ab4] process_backlog+0xd4/0x1e0
[ 237.052489] [c000003fff68fd80] [c0000000009a6f0c] net_rx_action+0x35c/0x480
[ 237.052603] [c000003fff68fe90] [c000000000b22a6c] __do_softirq+0x18c/0x3fc
[ 237.052726] [c000003fff68ff90] [c000000000029fb0] call_do_softirq+0x14/0x24
[ 237.052848] [c00000377969b920] [c00000000001765c] do_softirq_own_stack+0x5c/0xa0
[ 237.052992] [c00000377969b960] [c0000000000cfd48] do_softirq.part.3+0x68/0x90
[ 237.053112] [c00000377969b990] [c0000000000cfe44] __local_bh_enable_ip+0xd4/0x100
[ 237.053240] [c00000377969b9b0] [c000000000a06724] ip_finish_output2+0x244/0x460
[ 237.053372] [c00000377969ba50] [c000000000a0977c] ip_output+0xcc/0x180
[ 237.053485] [c00000377969bae0] [c000000000a08c78] ip_local_out+0x68/0x90
[ 237.053607] [c00000377969bb20] [d000000021966978] dccp_transmit_skb+0x320/0x550 [dccp]
[ 237.053739] [c00000377969bb90] [d00000002196732c] dccp_connect+0xf4/0x1f0 [dccp]
[ 237.053890] [c00000377969bc10] [d00000002e7b0320] dccp_v4_connect+0x308/0x400 [dccp_ipv4]
[ 237.054213] [c00000377969bc90] [c000000000a51678] __inet_stream_connect+0x158/0x400
[ 237.065276] [c00000377969bd20] [c000000000a51978] inet_stream_connect+0x58/0x90
[ 237.074757] [c00000377969bd60] [c00000000097eeac] SyS_connect+0x10c/0x130
[ 237.092889] [c00000377969be30] [c00000000000bd84] system_call+0x38/0xe0
[ 237.107020] Instruction dump:
[ 237.107085] 7ca82a14 f9210020 f9210028 ebdc0b98 f9210030 f9210038 f9210040 f9210048
[ 237.117663] 419e01c4 e86a00ae 2fa30000 419e01b8 <893e02f4> e95e0060 7c9f2378 897e0149
[ 237.133181] ---[ end trace ef25e246c86e0bcc ]---
[ 237.133270]
[ 237.146244] Sending IPI to other CPUs
[ 237.147330] IPI complete

== Comment: #8 - Kevin W. Rudd - 2017-01-03 15:16:06 ==
The panic happened because the control socks had been cleared:

  dccp = {
    v4_ctl_sk = 0x0,
    v6_ctl_sk = 0x0
  },

dccp_v4_ctl_send_reset() ended up calling dccp_v4_route_skb() with a NULL ctl_sk. Close race of some sort?

CVE References

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-150129 severity-high targetmilestone-inin---
bugproxy (bugproxy) wrote : dmesg

Default Comment by Bridge

Default Comment by Bridge

bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

bugproxy (bugproxy) wrote : net

Default Comment by Bridge

bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Tim Gardner (timg-tpi) wrote :

In dmesg I see these suspicious errors:

[ 222.904306] Injecting memory failure for page 0x3a44ee at 0x3fff85660000
[ 222.904621] Memory failure: 0x3a44ee: recovery action for dirty LRU page: Recovered

Are they normal ? The network crash could simply be a second order effect.

Colin Ian King (colin-king) wrote :

I believe the MADV_HWPOISON flag on the madvise stressor is triggering the "Injecting memory failure for page" errors. These are just test messages, see madvise(2):

"This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE".

Tim Gardner (timg-tpi) wrote :

Also, Colin King pointed out, "the dccp crash on that bug occurs because I added a dccp stressor in before christmas; it's a little used part of the networking stack, so I guess I hit something that's new". It is possible this really is an upstream network stack bug as Kevin Rudd pointed out.

bugproxy (bugproxy) on 2017-01-06
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin---

Default Comment by Bridge

bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

bugproxy (bugproxy) wrote : net

Default Comment by Bridge

bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

Kevin W. Rudd (kevinr) wrote :

Deleted accidental attachment dups.

------- Comment From <email address hidden> 2017-02-08 15:48 EDT-------
Hello Colin and Tim,

Any further progress? Please let us know if you need anything further from IBM. Thanks!

------- Comment on attachment From <email address hidden> 2017-03-20 11:13 EDT-------

Hello Canonical.

Applying dccp commit 449809a66c1d0b1563dee84493e14bf3104d2d7e on top of the 4.10.0-13 kernel source appears to have resolved this issue.

Please consider including this commit: tcp/dccp: block BH for SYN processing

Tim Gardner (timg-tpi) on 2017-03-20
Changed in linux (Ubuntu Zesty):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → Fix Committed
bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

bugproxy (bugproxy) wrote : net

Default Comment by Bridge

bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

------- Comment on attachment From <email address hidden> 2017-03-20 11:13 EDT-------

Hello Canonical.

Applying dccp commit 449809a66c1d0b1563dee84493e14bf3104d2d7e on top of the 4.10.0-13 kernel source appears to have resolved this issue.

Please consider including this commit: tcp/dccp: block BH for SYN processing

Launchpad Janitor (janitor) wrote :
Download full text (9.0 KiB)

This bug was fixed in the package linux - 4.10.0-15.17

---------------
linux (4.10.0-15.17) zesty; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1675868

  * In ZZ-BML (POWER9):ubuntu17.04 installation Fails (LP: #1675771)
    - powerpc/64s: fix handling of non-synchronous machine checks
    - powerpc/64s: allow machine check handler to set severity and initiator
    - powerpc/64s: POWER9 machine check handler

  * [Feature] R3 mwait support for Knights Mill (LP: #1637550)
    - x86/cpufeature: Enable RING3MWAIT for Knights Landing
    - x86/cpufeature: Enable RING3MWAIT for Knights Mill
    - x86/msr: Add MSR_MISC_FEATURE_ENABLES and RING3MWAIT bit
    - x86/elf: Add HWCAP2 to expose ring 3 MONITOR/MWAIT
    - x86/cpufeature: Add RING3MWAIT to CPU features

  * [Feature] GLK:New device IDs (LP: #1645951)
    - mfd: intel-lpss: Add Intel Gemini Lake PCI IDs
    - pwm: lpss: Add Intel Gemini Lake PCI ID
    - i2c: i801: Add support for Intel Gemini Lake
    - spi: pxa2xx: Add support for Intel Gemini Lake
    - [Config] CONFIG_PINCTRL_GEMINILAKE=m
    - pinctrl: intel: Add Intel Gemini Lake pin controller support

  * Zesty update to v4.10.5 stable release (LP: #1675032)
    - net/mlx5e: Register/unregister vport representors on interface attach/detach
    - net/mlx5e: Do not reduce LRO WQE size when not using build_skb
    - net/mlx5e: Fix broken CQE compression initialization
    - net/mlx5e: Update MPWQE stride size when modifying CQE compress state
    - net/mlx5e: Fix wrong CQE decompression
    - vxlan: correctly validate VXLAN ID against VXLAN_N_VID
    - vti6: return GRE_KEY for vti6
    - vxlan: don't allow overwrite of config src addr
    - ipv4: add missing initialization for flowi4_uid
    - ipv4: mask tos for input route
    - sctp: set sin_port for addr param when checking duplicate address
    - net sched actions: decrement module reference count after table flush.
    - l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
    - vxlan: lock RCU on TX path
    - geneve: lock RCU on TX path
    - mlxsw: spectrum_router: Avoid potential packets loss
    - net: bridge: allow IPv6 when multicast flood is disabled
    - net: don't call strlen() on the user buffer in packet_bind_spkt()
    - net: net_enable_timestamp() can be called from irq contexts
    - ipv6: orphan skbs in reassembly unit
    - dccp: Unlock sock before calling sk_free()
    - amd-xgbe: Stop the PHY before releasing interrupts
    - amd-xgbe: Be sure to set MDIO modes on device (re)start
    - amd-xgbe: Don't overwrite SFP PHY mod_absent settings
    - bonding: use ETH_MAX_MTU as max mtu
    - strparser: destroy workqueue on module exit
    - tcp: fix various issues for sockets morphing to listen state
    - net: fix socket refcounting in skb_complete_wifi_ack()
    - net: fix socket refcounting in skb_complete_tx_timestamp()
    - net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
    - dccp: fix use-after-free in dccp_feat_activate_values
    - team: use ETH_MAX_MTU as max mtu
    - vrf: Fix use-after-free in vrf_xmit
    - net/tunnel: set inner protocol in network gro hooks
    - uapi: fix linux/packet_diag.h use...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released

------- Comment (attachment only) From <email address hidden> 2017-04-11 02:01 EDT-------

------- Comment From <email address hidden> 2017-04-19 11:08 EDT-------
FYI Canonical:

It appears that the issue was not resolved with the most recent patch as this panic was recently seen with the 4.10.0-19-generic kernel. Unfortunately, no new vmcore is available due to the issue currently being worked in LP Bug 1680349.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-13 13:03 EDT-------
*** Bug 157351 has been marked as a duplicate of this bug. ***

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers