network protocol appear call trace (rawv6_getsockopt->ipv6_getsockopt...))

Bug #1850605 reported by Andrew Cloke
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Won't Fix
Undecided
Unassigned
Ubuntu-18.04
Won't Fix
Undecided
Unassigned
Ubuntu-18.04-hwe
Won't Fix
Undecided
Unassigned
Ubuntu-19.04
Won't Fix
Undecided
Unassigned
Ubuntu-19.10
Won't Fix
Undecided
Unassigned
Ubuntu-20.04
Won't Fix
Undecided
Unassigned
Upstream-kernel
Invalid
Undecided
Unassigned

Bug Description

[Bug Description]

ubuntu 18.04.3 appear network protocal call trace and result in one cpu stuck (watchdog: BUG: soft lockup - CPU#4 stuck for 22s ).

[Steps to Reproduce]
1) taishan 2280V2 client <==> taishan 2280V2 server
2)modprobe ib_umad
  modprobe hns-roce-hw-v2
3) server : ib_atomic_bw -d hns_0 -A CMP_AND_SWAP -e
   client:ib_atomic_bw -d hns_0 -A CMP_AND_SWAP -e

4)server operation:
 mount -t debugfs none /mnt ;echo 7 > /proc/sys/kernel/printk
 echo '1' > /sys/class/net/ethx/device/sriov_numvfs
 echo help > /mnt/hns3/0000\:7d\:00.0/cmd
 ethtool -s ethx msglvl 0x0
 ethtool -s ethX msglvl probe on
 ping client
5) seRver : ib_send_bw -d hnsX -x 0 -c RC --use-srq -R
           ib_send_bw -d hnsX -x 0 -c RC --use-srq -R server_ip
   client: ib_send_bw -d hnsX -x 0 -c RC --use-srq -R --run_infinitely
           ib_send_bw -d hnsX -x 0 -c RC --use-srq -R server_ip --run_infinitely

6) server: apt install netperf
           ethtool -N ethx rx-flow-hash udp4 sdfn
           watch -d -n 1 "ethtool -S ethx | grep rxq|grep pktnum"
           ethtool -N ethx rx-flow-hash udp4 d
           ethtool -n ethx rx-flow-hash udp4
client (3 repeat): netperf -H IPxxx -t UDP_STREAM -l 10 -- -m 10240
                   netperf -H IPxxx -t UDP_STREAM -l 10 -- -m 10240
7) server: ib_send_bw -d hns_0 -n 5 -s1024 -S 3 -x 3
   client :ib_send_bw -d hns_0 -n 5 -s1024 -S 3 -x 3 serverip
8) ufw disable

[Actual Results]
Oct 15 11:46:08 ubuntu kernel: [108358.278824] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [ip6tables:41601]
Oct 15 11:46:08 ubuntu kernel: [108358.286208] Modules linked in: binfmt_misc sctp libcrc32c iptable_filter bpfilter ip6table_filter ip6_tables 8021q garp mrp stp llc ipmi_ssif aes_ce_blk crypto_simd nls_iso8859_1 cryptd aes_ce_cipher rpcrdma sunrpc rdma_ucm crct10dif_ce ib_iser ghash_ce rdma_cm iw_cm aes_arm64 ib_umad libiscsi ib_ipoib sha2_ce scsi_transport_iscsi input_leds joydev ib_cm sha256_arm64 sha1_ce hns_roce_hw_v2 hns_roce hibmc_drm ttm drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_si tpm_tis_spi ipmi_devintf ipmi_msghandler spi_dw_mmio spi_dw cppc_cpufreq sch_fq_codel ip_tables x_tables autofs4 ses enclosure mlx5_ib marvell ib_uverbs ib_core hid_generic mlx5_core hisi_sas_v3_hw tls hns3 usbhid hisi_sas_main mlxfw ixgbe hclge devlink hid libsas hnae3 hinic ahci scsi_transport_sas xfrm_algo mdio gpio_dwapb
Oct 15 11:46:08 ubuntu kernel: [108358.286253] CPU: 4 PID: 41601 Comm: ip6tables Not tainted 5.0.0-23-generic #24~18.04.1-Ubuntu
Oct 15 11:46:08 ubuntu kernel: [108358.286254] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B070.01 09/04/2019
Oct 15 11:46:08 ubuntu kernel: [108358.286255] pstate: 20400009 (nzCv daif +PAN -UAO)
Oct 15 11:46:08 ubuntu kernel: [108358.286262] pc : alloc_counters.isra.5+0x144/0x188 [ip6_tables]
Oct 15 11:46:08 ubuntu kernel: [108358.286264] lr : alloc_counters.isra.5+0x70/0x188 [ip6_tables]
Oct 15 11:46:08 ubuntu kernel: [108358.286265] sp : ffff000023ff3be0
Oct 15 11:46:08 ubuntu kernel: [108358.286265] x29: ffff000023ff3be0 x28: ffffdd595f213040
Oct 15 11:46:08 ubuntu kernel: [108358.286267] x27: ffff000020b49000 x26: 0000000000000000
Oct 15 11:46:08 ubuntu kernel: [108358.286268] x25: ffffdd593f8eff58 x24: 0000000000000000
Oct 15 11:46:08 ubuntu kernel: [108358.286270] x23: ffff3de097bcc738 x22: 0000000000000000
Oct 15 11:46:08 ubuntu kernel: [108358.286271] x21: ffff3de097bcd3b4 x20: ffffdd595f213000
Oct 15 11:46:08 ubuntu kernel: [108358.286272] x19: ffffdd595f213040 x18: 0000000000000000
Oct 15 11:46:08 ubuntu kernel: [108358.286273] x17: 0000000000000000 x16: ffff3de09711af48
Oct 15 11:46:08 ubuntu kernel: [108358.286274] x15: 0000000000000000 x14: ffffa2e540000000
Oct 15 11:46:08 ubuntu kernel: [108358.286275] x13: 0140000000000000 x12: 0040000000000041
Oct 15 11:46:08 ubuntu kernel: [108358.286277] x11: ffff000060b49000 x10: ffff3de097c1e1c8
Oct 15 11:46:08 ubuntu kernel: [108358.286278] x9 : 0000000000000000 x8 : 0000000000000001
Oct 15 11:46:08 ubuntu kernel: [108358.286279] x7 : 0040000000000001 x6 : 0000000000001000
Oct 15 11:46:08 ubuntu kernel: [108358.286280] x5 : ffff000020b49000 x4 : ffff3ddf977f5f58
Oct 15 11:46:08 ubuntu kernel: [108358.286281] x3 : 000000005f656369 x2 : 00009f79a80fa000
Oct 15 11:46:08 ubuntu kernel: [108358.286282] x1 : 00009f79a80fa000 x0 : ffff5ddf7f740000
Oct 15 11:46:08 ubuntu kernel: [108358.286284] Call trace:
Oct 15 11:46:08 ubuntu kernel: [108358.286286] alloc_counters.isra.5+0x144/0x188 [ip6_tables]
Oct 15 11:46:08 ubuntu kernel: [108358.286288] do_ip6t_get_ctl+0x258/0x418 [ip6_tables]
Oct 15 11:46:08 ubuntu kernel: [108358.286292] nf_getsockopt+0x64/0xa8
Oct 15 11:46:08 ubuntu kernel: [108358.286294] ipv6_getsockopt+0x1a4/0x2b8
Oct 15 11:46:08 ubuntu kernel: [108358.286296] rawv6_getsockopt+0x60/0xe8
Oct 15 11:46:08 ubuntu kernel: [108358.286298] sock_common_getsockopt+0x50/0x68
Oct 15 11:46:08 ubuntu kernel: [108358.286299] __arm64_sys_getsockopt+0x90/0xf8
Oct 15 11:46:08 ubuntu kernel: [108358.286302] el0_svc_common+0x78/0x120
Oct 15 11:46:08 ubuntu kernel: [108358.286303] el0_svc_handler+0x38/0x78
Oct 15 11:46:08 ubuntu kernel: [108358.286305] el0_svc+0x8/0xc
Oct 15 11:46:41 ubuntu kernel: [108390.854822] rcu: INFO: rcu_sched self-detected stall on CPU
Oct 15 11:46:41 ubuntu kernel: [108390.860471] rcu: 4-....: (14998 ticks this GP) idle=672/1/0x4000000000000002 softirq=431381/431381 fqs=7500
Oct 15 11:46:41 ubuntu kernel: [108390.870454] rcu: (t=15003 jiffies g=3365597 q=977)
Oct 15 11:46:41 ubuntu kernel: [108390.875499] Task dump for CPU 4:
Oct 15 11:46:41 ubuntu kernel: [108390.875500] ip6tables R running task 0 41601 41600 0x00000006
Oct 15 11:46:41 ubuntu kernel: [108390.875503] Call trace:
Oct 15 11:46:41 ubuntu kernel: [108390.875506] dump_backtrace+0x0/0x188
Oct 15 11:46:41 ubuntu kernel: [108390.875507] show_stack+0x24/0x30
Oct 15 11:46:41 ubuntu kernel: [108390.875509] sched_show_task+0x120/0x158
Oct 15 11:46:41 ubuntu kernel: [108390.875511] dump_cpu_task+0x48/0x58
Oct 15 11:46:41 ubuntu kernel: [108390.875513] rcu_dump_cpu_stacks+0xcc/0x10c
Oct 15 11:46:41 ubuntu kernel: [108390.875514] rcu_check_callbacks+0x7b0/0x868
Oct 15 11:46:41 ubuntu kernel: [108390.875515] update_process_times+0x34/0x60
Oct 15 11:46:41 ubuntu kernel: [108390.875517] tick_sched_handle.isra.5+0x34/0x70
Oct 15 11:46:41 ubuntu kernel: [108390.875518] tick_sched_timer+0x50/0xa0
Oct 15 11:46:41 ubuntu kernel: [108390.875520] __hrtimer_run_queues+0x11c/0x368
Oct 15 11:46:41 ubuntu kernel: [108390.875522] hrtimer_interrupt+0xf4/0x2a8
Oct 15 11:46:41 ubuntu kernel: [108390.875524] arch_timer_handler_phys+0x38/0x58
Oct 15 11:46:41 ubuntu kernel: [108390.875525] handle_percpu_devid_irq+0x90/0x240
Oct 15 11:46:41 ubuntu kernel: [108390.875527] generic_handle_irq+0x34/0x50
Oct 15 11:46:41 ubuntu kernel: [108390.875529] __handle_domain_irq+0x68/0xc0
Oct 15 11:46:41 ubuntu kernel: [108390.875530] gic_handle_irq+0x6c/0x168
Oct 15 11:46:41 ubuntu kernel: [108390.875531] el1_irq+0xe8/0x1c0
Oct 15 11:46:41 ubuntu kernel: [108390.875533] alloc_counters.isra.5+0x144/0x188 [ip6_tables]
Oct 15 11:46:41 ubuntu kernel: [108390.875535] do_ip6t_get_ctl+0x258/0x418 [ip6_tables]
Oct 15 11:46:41 ubuntu kernel: [108390.875537] nf_getsockopt+0x64/0xa8
Oct 15 11:46:41 ubuntu kernel: [108390.875538] ipv6_getsockopt+0x1a4/0x2b8
Oct 15 11:46:41 ubuntu kernel: [108390.875539] rawv6_getsockopt+0x60/0xe8
Oct 15 11:46:41 ubuntu kernel: [108390.875542] sock_common_getsockopt+0x50/0x68
Oct 15 11:46:41 ubuntu kernel: [108390.875543] __arm64_sys_getsockopt+0x90/0xf8
Oct 15 11:46:41 ubuntu kernel: [108390.875545] el0_svc_common+0x78/0x120
Oct 15 11:46:41 ubuntu kernel: [108390.875547] el0_svc_handler+0x38/0x78
Oct 15 11:46:41 ubuntu kernel: [108390.875548] el0_svc+0x8/0xc
Oct 15 11:47:08 ubuntu kernel: [108418.278822] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [ip6tables:41601]
Oct 15 11:47:08 ubuntu kernel: [108418.286202] Modules linked in: binfmt_misc sctp libcrc32c iptable_filter bpfilter ip6table_filter ip6_tables 8021q garp mrp stp llc ipmi_ssif aes_ce_blk crypto_simd nls_iso8859_1 cryptd aes_ce_cipher rpcrdma sunrpc rdma_ucm crct10dif_ce ib_iser ghash_ce rdma_cm iw_cm aes_arm64 ib_umad libiscsi ib_ipoib sha2_ce scsi_transport_iscsi input_leds joydev ib_cm sha256_arm64 sha1_ce hns_roce_hw_v2 hns_roce hibmc_drm ttm drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_si tpm_tis_spi ipmi_devintf ipmi_msghandler spi_dw_mmio spi_dw cppc_cpufreq sch_fq_codel ip_tables x_tables autofs4 ses enclosure mlx5_ib marvell ib_uverbs ib_core hid_generic mlx5_core hisi_sas_v3_hw tls hns3 usbhid hisi_sas_main mlxfw ixgbe hclge devlink hid libsas hnae3 hinic ahci scsi_transport_sas xfrm_algo mdio gpio_dwapb
Oct 15 11:47:08 ubuntu kernel: [108418.286230] CPU: 4 PID: 41601 Comm: ip6tables Tainted: G L 5.0.0-23-generic #24~18.04.1-Ubuntu
Oct 15 11:47:08 ubuntu kernel: [108418.286231] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B070.01 09/04/2019
Oct 15 11:47:08 ubuntu kernel: [108418.286232] pstate: 20400009 (nzCv daif +PAN -UAO)
Oct 15 11:47:08 ubuntu kernel: [108418.286234] pc : alloc_counters.isra.5+0x144/0x188 [ip6_tables]
Oct 15 11:47:08 ubuntu kernel: [108418.286236] lr : alloc_counters.isra.5+0x70/0x188 [ip6_tables]
Oct 15 11:47:08 ubuntu kernel: [108418.286237] sp : ffff000023ff3be0
Oct 15 11:47:08 ubuntu kernel: [108418.286237] x29: ffff000023ff3be0 x28: ffffdd595f213040
Oct 15 11:47:08 ubuntu kernel: [108418.286239] x27: ffff000020b49000 x26: 0000000000000000
Oct 15 11:47:08 ubuntu kernel: [108418.286241] x25: ffffdd593f8eff58 x24: 0000000000000000
Oct 15 11:47:08 ubuntu kernel: [108418.286242] x23: ffff3de097bcc738 x22: 0000000000000000
Oct 15 11:47:08 ubuntu kernel: [108418.286243] x21: ffff3de097bcd3b4 x20: ffffdd595f213000
Oct 15 11:47:08 ubuntu kernel: [108418.286244] x19: ffffdd595f213040 x18: 0000000000000000
Oct 15 11:47:08 ubuntu kernel: [108418.286246] x17: 0000000000000000 x16: ffff3de09711af48
Oct 15 11:47:08 ubuntu kernel: [108418.286247] x15: 0000000000000000 x14: ffffa2e540000000
Oct 15 11:47:08 ubuntu kernel: [108418.286248] x13: 0140000000000000 x12: 0040000000000041
Oct 15 11:47:08 ubuntu kernel: [108418.286249] x11: ffff000060b49000 x10: ffff3de097c1e1c8
Oct 15 11:47:08 ubuntu kernel: [108418.286250] x9 : 0000000000000000 x8 : 0000000000000001
Oct 15 11:47:08 ubuntu kernel: [108418.286252] x7 : 0040000000000001 x6 : 0000000000001000
Oct 15 11:47:08 ubuntu kernel: [108418.286253] x5 : ffff000020b49000 x4 : ffff3ddf977f5f58
Oct 15 11:47:08 ubuntu kernel: [108418.286254] x3 : 000000005f656369 x2 : 00009f79a80fa000
Oct 15 11:47:08 ubuntu kernel: [108418.286255] x1 : 00009f79a80fa000 x0 : ffff5ddf7f740000
Oct 15 11:47:08 ubuntu kernel: [108418.286256] Call trace:
Oct 15 11:47:08 ubuntu kernel: [108418.286258] alloc_counters.isra.5+0x144/0x188 [ip6_tables]
Oct 15 11:47:08 ubuntu kernel: [108418.286260] do_ip6t_get_ctl+0x258/0x418 [ip6_tables]
Oct 15 11:47:08 ubuntu kernel: [108418.286261] nf_getsockopt+0x64/0xa8
Oct 15 11:47:08 ubuntu kernel: [108418.286263] ipv6_getsockopt+0x1a4/0x2b8
Oct 15 11:47:08 ubuntu kernel: [108418.286264] rawv6_getsockopt+0x60/0xe8
Oct 15 11:47:08 ubuntu kernel: [108418.286265] sock_common_getsockopt+0x50/0x68
Oct 15 11:47:08 ubuntu kernel: [108418.286266] __arm64_sys_getsockopt+0x90/0xf8
Oct 15 11:47:08 ubuntu kernel: [108418.286268] el0_svc_common+0x78/0x120
Oct 15 11:47:08 ubuntu kernel: [108418.286269] el0_svc_handler+0x38/0x78
Oct 15 11:47:08 ubuntu kernel: [108418.286270] el0_svc+0x8/0xc

[Expected Results]
no calltrace

[Reproducibility]
low probability

[Additional information]
(Firmware version, kernel version, affected hardware, etc. if required):

[Resolution]

Revision history for this message
dann frazier (dannf) wrote :

I am a little confused by the reproduction steps. Are all of these commands necessary? Should it 100% reproduce the problem? I attempted to run these steps as recorded in the attached file, but I was not able to reproduce. Please check the attached file. I also confirmed I am running the 5.0.0-23 kernel on both sides.

Changed in kunpeng920:
status: New → Incomplete
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

@Xinwei has described that this issue was found while running a larger test suite.
It only occurs sporadically, and is quite hard to reproduce.
He has only seen it approximately once or twice in a month, although it's not clear how many times the test suite was run during that time.

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

this another reproduce way as follow:
1) apt-get install qemu-kvm libvirt-bin virtinst bridge-utils cpu-checker virt-manager virt-viewer
2)cofig this network bridge
>>vi /etc/netplan/01-netcfg.yaml
network:
version: 2
renderer: networkd
ethernets:
enp1s0:
dhcp4: no
dhcp6: no
bridges:
br0:
interfaces: [ enp1s0 ]
dhcp4: no
addresses: [ 172.19.47.145/24 ]
gateway4: 172.19.47.1
>>netplan apply
3) create virtual disk
qemu-img create -f qcow2 disk1.qcow2 10G
4)install this system
service libvirtd start
virt-install \
--virt-type=kvm \
--name vm1 \
--hvm \
--memory 8192 \
--vcpus=4 \
--os-type=linux \
--os-variant=ubuntu13.04 \
--network bridge=br0,model=virtio \
--location=/home/uose-pearl2-18.06.iso \
--disk path=/disk1.qcow2,bus=virtio,format=qcow2 \
--video=virtio \
--boot uefi \
--extra-args='console=ttyAMA0' \
--force
5) virsh start vm1;virsh destroy vm1

Can you this following step and it reproduce once a week?

Revision history for this message
dann frazier (dannf) wrote :

I wonder if the steps in comment #3 are actually causing the issue, or just coincidental.

What we are see in the log is that the ip6tables command is running, and appears to be hanging in the kernel. However, none of the steps in comment #3 appear to cause an exec() of /sbin/ip6tables. I verified this by setting a tracepoint for ip6tables execs[*] and running the test.

Is there something else running ip6tables commands in your environment?

[*]
$ echo sched:sched_process_exec | sudo tee /sys/kernel/debug/tracing/set_event
$ echo 'filename == "/sbin/ip6tables"' | sudo tee \
  /sys/kernel/debug/tracing/events/sched/sched_process_exec/filter
$ printf "" | sudo tee /sys/kernel/debug/tracing/trace

Revision history for this message
dann frazier (dannf) wrote :

Going by the traceback, I believe we are stuck in the seqcount read loop in ip6_tables.c:get_counters(). This could happen if a writer has taken the lock and failed to release it. However, I don't know if we can prove that theory or find out who the writer is from just the traceback. If possible, could you collect a crashdump on a system after it has hung? sosreport output from the system would also be useful.

Also, as mentioned in comment #4, I think something is running the /sbin/ip6tables command when the hang occurs. Can you take a look at your setup to see if you can determine what that might be?

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

=>Also, as mentioned in comment #4, I think something is running the /sbin/ip6tables command when the hang =?occurs. Can you take a look at your setup to see if you can determine what that might be?

our envirnment only use this comment #3 method to run it and set this net config.

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

Is there something else running ip6tables commands in your environment?

[*]
 $ echo sched:sched_process_exec | sudo tee /sys/kernel/debug/tracing/set_event
 $ echo 'filename == "/sbin/ip6tables"' | sudo tee \
   /sys/kernel/debug/tracing/events/sched/sched_process_exec/filter
 $ printf "" | sudo tee /sys/kernel/debug/tracing/trace

==> today our envirnment have reproduce it, but I hard to output any command to collect any informations.

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

comment #7 %s/output/input/g

Taihsiang Ho (tai271828)
summary: - network protocal appear call trace
+ network protocol appear call trace
(rawv6_getsockopt->ipv6_getsockopt...))
Revision history for this message
Ike Panhc (ikepanhc) wrote :

19.04 is EOLed. There is no more SRU cycle for it.

Ike Panhc (ikepanhc)
tags: added: ikeradar
Revision history for this message
Ike Panhc (ikepanhc) wrote :

I put kreiken on infinite loop as #3. Will check again on next Monday and see if we can reproduce.

ubuntu@kreiken:~$ while true; do sleep 5;sudo virsh destroy vm1;sleep 5;sudo virsh start vm1; done
Domain vm1 destroyed

Domain vm1 started

Revision history for this message
Ike Panhc (ikepanhc) wrote :

kreiken on infinite loop for 20hr and can not reproduce any kernel oops/backtrace

ubuntu@kreiken:/var/log$ grep 'Modules linked in' kern.log
ubuntu@kreiken:/var/log$ w
 03:39:00 up 20:08, 1 user, load average: 0.55, 0.53, 0.58
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
ubuntu pts/1 10.228.68.2 02:25 4.00s 0.14s 0.02s w

Revision history for this message
Ike Panhc (ikepanhc) wrote :

After 3 days running VM start/shutdown as #3 but I can not reproduce any CPU stuck on kreiken.

The same kernel backtrace is observed on d05-1 and is solved by re-seat all DIMM and PCIe cards. Is it possible that the root cause is some cards not seat well?

Revision history for this message
Ike Panhc (ikepanhc) wrote :
Revision history for this message
Ike Panhc (ikepanhc) wrote :

For now we can not reproduce this issue anymore. I am closing this bug and we can re-open it when we see kernel call trace again.

Changed in kunpeng920:
status: Incomplete → Won't Fix
Ike Panhc (ikepanhc)
tags: removed: ikeradar
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.