Looks to me like 4.13 is missing this set of patches from Cong Wang:
commit 623859ae06b85cabba79ce78f0d49e67783d4c34
Merge: 8f56246 35c55fc
Author: David S. Miller <email address hidden>
Date: Thu Nov 9 10:03:10 2017 +0900
Merge branch 'net-sched-race-fix'
Cong Wang says:
====================
net_sched: close the race between call_rcu() and cleanup_net()
This patchset tries to fix the race between call_rcu() and
cleanup_net() again. Without holding the netns refcnt the
tc_action_net_exit() in netns workqueue could be called before
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.
Patch 1 reverts the previous fix, patch 2 introduces two new
API's to help to address the bug and the rest patches switch
to the new API's. Please see each patch for details.
I was not able to reproduce this bug, but now after adding
some delay in filter destroy work I manage to trigger the
crash. After this patchset, the crash is not reproducible
any more and the debugging printk's show the order is expected
too.
====================
Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc action
Reported-by: Lucas Bates <email address hidden>
Cc: Lucas Bates <email address hidden>
Cc: Jamal Hadi Salim <email address hidden>
Cc: Jiri Pirko <email address hidden>
Signed-off-by: Cong Wang <email address hidden>
Signed-off-by: David S. Miller <email address hidden>
Note the comment he makes about “filter destroy work” and how the final function in the trace is __inet_del_ifa(). As you can see from the trace the machine is executing the netns cleanup_net() function when the panic occurs. This series of patches has not been backported to the 4.13.16 kernel.
James Page asked me to post some findings here:
Here’s the trace I’m getting (same as one in comment #10:
[ 5152.142936] device s1 left promiscuous mode codec_generic snd_hda_intel snd_hda_codec joydev snd_hda_core snd_hwdep snd_pcm input_leds serio_raw snd_timer snd pvpanic parport_pc i2c_piix4 soundcore mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_ iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear qxl ttm crc32_pclmul drm_kms_helper pcbc aesni_intel syscopyarea aes_i586 sysfillrect crypto_simd sysimgblt fb_sys_fops psmouse cryptd virtio_net virtio_blk drm pata_acpi floppy del_ifa+ 0xbb/0x260 clear_delrec+ 0x28/0xa0 event+0x22f/ 0x4e0 0x5b/0x70 nlevent_ flush+0x4c/ 0x90 call_chain+ 0x4e/0x70 call_chain+ 0x11/0x20 notifiers_ info+0x2a/ 0x60 registered_ many+0x21c/ 0x380 netdevice_ many.part. 102+0x10/ 0x80 device_ exit_batch+ 0x134/0x160 intr_irq+ 0x80/0x80 list.isra. 8+0x4d/ 0x60 net+0x18e/ 0x260 one_work+ 0x1a0/0x390 thread+ 0x37/0x440 one_work+ 0x390/0x390 create_ on_node+ 0x20/0x20 fork+0x19/ 0x24
[ 5152.427823] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5152.428422] IP: rtmsg_ifa+0x30/0xd0
[ 5152.428816] *pdpt = 0000000033f65001 *pde = 0000000000000000
[ 5152.428820]
[ 5152.429682] Oops: 0000 [#1] SMP
[ 5152.430046] Modules linked in: veth netconsole openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack ppdev snd_hda_
[ 5152.433348] CPU: 1 PID: 90 Comm: kworker/u4:3 Tainted: G W 4.13.0-16-generic #19-Ubuntu
[ 5152.433852] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5152.434346] Workqueue: netns cleanup_net
[ 5152.434816] task: f17aa100 task.stack: f4ef0000
[ 5152.435302] EIP: rtmsg_ifa+0x30/0xd0
[ 5152.435780] EFLAGS: 00010246 CPU: 1
[ 5152.436254] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 014000c0
[ 5152.436764] ESI: 00000000 EDI: f063a6c0 EBP: f4ef1dcc ESP: f4ef1db4
[ 5152.437267] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 5152.437780] CR0: 80050033 CR2: 00000000 CR3: 33c3f4a0 CR4: 001406f0
[ 5152.438311] Call Trace:
[ 5152.438816] __inet_
[ 5152.439344] ? igmpv3_
[ 5152.439868] inetdev_
[ 5152.440401] ? skb_dequeue+
[ 5152.440934] ? wireless_
[ 5152.441487] notifier_
[ 5152.442016] raw_notifier_
[ 5152.442554] call_netdevice_
[ 5152.443097] rollback_
[ 5152.443646] unregister_
[ 5152.444180] default_
[ 5152.444709] ? do_wait_
[ 5152.445223] ops_exit_
[ 5152.445744] cleanup_
[ 5152.446264] process_
[ 5152.446790] worker_
[ 5152.447321] kthread+0xf3/0x110
[ 5152.447843] ? process_
[ 5152.448380] ? kthread_
[ 5152.448919] ret_from_
[ 5152.449462] Code: 55 89 e5 57 56 53 89 d7 89 ce 83 ec 0c 85 c9 89 45 e8 c7 45 f0 00 00 00 00 74 06 8b 41 08 89 45 f0 8b 47 0c 31 c9 ba c0 00 40 01 <8b> 00 8b 80 20 03 00 00 6a ff 89 45 ec b8 60 00 00 00 e8 19 46
[ 5152.450719] EIP: rtmsg_ifa+0x30/0xd0 SS:ESP: 0068:f4ef1db4
[ 5152.451308] CR2: 0000000000000000
[ 5152.451885] ---[ end trace 5cdfc95a5b343f5c ]---
Looks to me like 4.13 is missing this set of patches from Cong Wang:
commit 623859ae06b85ca bba79ce78f0d49e 67783d4c34
Merge: 8f56246 35c55fc
Author: David S. Miller <email address hidden>
Date: Thu Nov 9 10:03:10 2017 +0900
Merge branch 'net-sched- race-fix'
Cong Wang says:
=== ======= ======= ===
net_sched: close the race between call_rcu() and cleanup_net()
This patchset tries to fix the race between call_rcu() and action_ net_exit( ) in netns workqueue could be called before
cleanup_net() again. Without holding the netns refcnt the
tc_
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.
Patch 1 reverts the previous fix, patch 2 introduces two new
API's to help to address the bug and the rest patches switch
to the new API's. Please see each patch for details.
I was not able to reproduce this bug, but now after adding ======= ======= ===
some delay in filter destroy work I manage to trigger the
crash. After this patchset, the crash is not reproducible
any more and the debugging printk's show the order is expected
too.
===
Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc action
Reported-by: Lucas Bates <email address hidden>
Cc: Lucas Bates <email address hidden>
Cc: Jamal Hadi Salim <email address hidden>
Cc: Jiri Pirko <email address hidden>
Signed-off-by: Cong Wang <email address hidden>
Signed-off-by: David S. Miller <email address hidden>
Note the comment he makes about “filter destroy work” and how the final function in the trace is __inet_del_ifa(). As you can see from the trace the machine is executing the netns cleanup_net() function when the panic occurs. This series of patches has not been backported to the 4.13.16 kernel.