Kernel Crash [general protection fault: 0000 [#1] SMP NOPTI]

Bug #1962485 reported by Ammad Ali
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,

I am running openstack xena release on ubuntu focal. Today my compute node running ubuntu focal crashed with due to kernel and dump has been generated in /var/crash/. Below is the kernel trace in crash dump.

[455151.890114] general protection fault: 0000 [#1] SMP NOPTI
[455151.890285] CPU: 43 PID: 83232 Comm: qemu-system-x86 Kdump: loaded Tainted: G OE 5.4.0-88-generic #99-Ubuntu
[455151.890612] Hardware name: Dell Inc. PowerEdge R6525/XXXXX, BIOS 2.5.6 10/06/2021
[455151.890842] RIP: 0010:count_subheaders.part.0+0x26/0x60
[455151.890998] Code: 00 00 00 90 0f 1f 44 00 00 48 83 3f 00 74 4d 55 48 89 e5 41 55 45 31 ed 41 54 45 31 e4 53 48 89 fb 48 8b 7b 18 48 85 ff 74 23 <48> 83 3f 00 74 25 e8 cf ff ff ff 41
01 c5 48 83 c3 40 48 83 3b 00
[455151.891552] RSP: 0018:ffffa6b477487b88 EFLAGS: 00010202
[455151.891707] RAX: 0000000000000000 RBX: ffff9387c594f280 RCX: 0000000000000000
[455151.891918] RDX: 0000000000000060 RSI: ffff9390702a72c0 RDI: 0314a8c0f1b16f3e
[455151.892130] RBP: ffffa6b477487ba0 R08: 0000000000000000 R09: ffffffffbc6ed7f0
[455151.892341] R10: ffffa6b477487cd0 R11: 0000000000000001 R12: 0000000000000000
[455151.892552] R13: 0000000000000000 R14: ffff9391e5684000 R15: ffffffffbd5f9880
[455151.892767] FS: 00007f69950c75c0(0000) GS:ffff9391feac0000(0000) knlGS:0000000000000000
[455151.893016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[455151.893207] CR2: 00007f61e9e45000 CR3: 0000017c54afa000 CR4: 0000000000340ee0
[455151.893434] Call Trace:
[455151.893514] count_subheaders.part.0+0x31/0x60
[455151.893646] unregister_sysctl_table+0x30/0x90
[455151.893781] unregister_net_sysctl_table+0xe/0x10
[455151.893922] __devinet_sysctl_unregister.isra.0+0x2c/0x60
[455151.894082] devinet_sysctl_unregister+0x29/0x40
[455151.894220] inetdev_event+0x1e8/0x560
[455151.894334] ? skb_dequeue+0x5f/0x70
[455151.894444] notifier_call_chain+0x55/0x80
[455151.894565] ? notifier_call_chain+0x55/0x80
[455151.894693] raw_notifier_call_chain+0x16/0x20
[455151.894829] call_netdevice_notifiers_info+0x2e/0x60
[455151.894983] ? tun_show_owner+0x60/0x60
[455151.895098] rollback_registered_many+0x36e/0x520
[455151.895239] unregister_netdevice_queue+0x94/0x120
[455151.895383] __tun_detach+0x421/0x430
[455151.895495] tun_chr_close+0x3a/0x70
[455151.895605] __fput+0xcc/0x260
[455151.895698] ____fput+0xe/0x10
[455151.895792] task_work_run+0x8f/0xb0
[455151.895903] exit_to_usermode_loop+0x131/0x160
[455151.896036] do_syscall_64+0x163/0x190
[455151.896150] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[455151.896302] RIP: 0033:0x7f69965ba3fb
[455151.896410] Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb ff ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 89 44 24 0c e8 31 fc ff ff 8b 44
[455151.896975] RSP: 002b:00007ffdff14b350 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[455151.897201] RAX: 0000000000000000 RBX: 0000557fe0875e50 RCX: 00007f69965ba3fb
[455151.897412] RDX: 0000557fe0748f40 RSI: 0000000000000001 RDI: 000000000000002b
[455151.897637] RBP: 0000557fe0887460 R08: 0000000000000000 R09: 0000000000000000
[455151.904390] R10: 0000000000000032 R11: 0000000000000293 R12: 0000557fe0875e50
[455151.911165] R13: 0000000000000001 R14: 0000557fe09efc10 R15: 0000557fe0747900

I didn't find any documented details on kernel 5.4 for this bug. I have uploaded the logs via ubuntu-bug linux command.

# uname -a
Linux kvm03-a1-r01-khi04.rapid.pk 5.4.0-88-generic #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/version_signature
Ubuntu 5.4.0-88.99-generic 5.4.140

I am using Dell R6525 with EPYC 7532 CPUs.

Let me know if there is there are more information needed.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-88-generic 5.4.0-88.99
ProcVersionSignature: Ubuntu 5.4.0-88.99-generic 5.4.140
Uname: Linux 5.4.0-88-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 28 17:38 seq
 crw-rw---- 1 root audio 116, 33 Feb 28 17:38 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu27.20
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: pass
Date: Mon Feb 28 21:20:20 2022
InstallationDate: Installed on 2021-07-29 (214 days ago)
InstallationMedia: Ubuntu-Server 20.04.2 LTS "Focal Fossa" - Release amd64 (20210201.2)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R6525
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-88-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro iommu=pt intel_iommu=on swapaccount=1 vga=normal nofb nomodeset video=vesafb:off i915.modeset=0 crashkernel=512M
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-88-generic N/A
 linux-backports-modules-5.4.0-88-generic N/A
 linux-firmware 1.187.19
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/06/2021
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.5.6
dmi.board.name: 0GK70M
dmi.board.vendor: Dell Inc.
dmi.board.version: A10
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.5.6:bd10/06/2021:svnDellInc.:pnPowerEdgeR6525:pvr:rvnDellInc.:rn0GK70M:rvrA10:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R6525
dmi.product.sku: SKU=NotProvided;ModelName=PowerEdge R6525
dmi.sys.vendor: Dell Inc.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 28 17:38 seq
 crw-rw---- 1 root audio 116, 33 Feb 28 17:38 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu27.20
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: pass
DistroRelease: Ubuntu 20.04
InstallationDate: Installed on 2021-07-29 (214 days ago)
InstallationMedia: Ubuntu-Server 20.04.2 LTS "Focal Fossa" - Release amd64 (20210201.2)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R6525
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-88-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro iommu=pt intel_iommu=on swapaccount=1 vga=normal nofb nomodeset video=vesafb:off i915.modeset=0 crashkernel=512M
ProcVersionSignature: Ubuntu 5.4.0-88.99-generic 5.4.140
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-88-generic N/A
 linux-backports-modules-5.4.0-88-generic N/A
 linux-firmware 1.187.19
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: focal uec-images
Uname: Linux 5.4.0-88-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 10/06/2021
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.5.6
dmi.board.name: 0GK70M
dmi.board.vendor: Dell Inc.
dmi.board.version: A10
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.5.6:bd10/06/2021:svnDellInc.:pnPowerEdgeR6525:pvr:rvnDellInc.:rn0GK70M:rvrA10:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R6525
dmi.product.sku: SKU=NotProvided;ModelName=PowerEdge R6525
dmi.sys.vendor: Dell Inc.

Revision history for this message
Ammad Ali (syedammad83) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Ammad Ali (syedammad83) wrote : CRDA.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Ammad Ali (syedammad83) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : Lspci.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : Lspci-vt.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : Lsusb.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : Lsusb-t.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : Lsusb-v.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : ProcModules.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : UdevDb.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : WifiSyslog.txt

apport information

Revision history for this message
Ammad Ali (syedammad83) wrote : acpidump.txt

apport information

tags: added: sts
Revision history for this message
David Hill (david-hill-ubisoft) wrote :

Maybe the same as https://lore.kernel<email address hidden>/T/ ?

Revision history for this message
Matthew Ruffell (mruffell) wrote :
Download full text (3.5 KiB)

Hi David,

Thanks for the link, I think that is the most plausible explanation I have
seen so far.

The only problem is, if we look at the patch:

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7a3ab3427369..24001112c323 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -686,7 +686,6 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
   if (tun)
    xdp_rxq_info_unreg(&tfile->xdp_rxq);
   ptr_ring_cleanup(&tfile->tx_ring, tun_ptr_free);
- sock_put(&tfile->sk);
  }
 }

@@ -702,6 +701,9 @@ static void tun_detach(struct tun_file *tfile, bool clean)
  if (dev)
   netdev_state_change(dev);
  rtnl_unlock();
+
+ if (clean)
+ sock_put(&tfile->sk);
 }

 static void tun_detach_all(struct net_device *dev)

It moves the final sock_put(&tfile->sk) from the end of __tun_detach() to tun_detach(), after the call to netdev_state_change(dev).

 685 static void __tun_detach(struct tun_file *tfile, bool clean)
 686 {
...
 725 if (clean) {
 726 if (tun && tun->numqueues == 0 && tun->numdisabled == 0) {
 727 netif_carrier_off(tun->dev);
 728
 729 if (!(tun->flags & IFF_PERSIST) &&
 730 tun->dev->reg_state == NETREG_REGISTERED)
 731 unregister_netdevice(tun->dev);
 732 }
 733 if (tun)
 734 xdp_rxq_info_unreg(&tfile->xdp_rxq);
 735 ptr_ring_cleanup(&tfile->tx_ring, tun_ptr_free);
 736 sock_put(&tfile->sk);
 737 }
 738 }
 739
 740 static void tun_detach(struct tun_file *tfile, bool clean)
 741 {
 742 struct tun_struct *tun;
 743 struct net_device *dev;
 744
 745 rtnl_lock();
 746 tun = rtnl_dereference(tfile->tun);
 747 dev = tun ? tun->dev : NULL;
 748 __tun_detach(tfile, clean);
 749 if (dev)
 750 netdev_state_change(dev);
 751 rtnl_unlock();
 752 }

This more or less makes sense, but if you look at the call trace in the bug:

...
[455151.894444] notifier_call_chain+0x55/0x80
...
[455151.895239] unregister_netdevice_queue+0x94/0x120
[455151.895383] __tun_detach+0x421/0x430
...

$ eu-addr2line -ifae ./vmlinux-5.4.0-88-generic __tun_detach+0x421
0xffffffff8178b991
unregister_netdevice inlined at /build/linux-q2DMsi/linux-5.4.0/drivers/net/tun.c:731:5 in __tun_detach
/build/linux-q2DMsi/linux-5.4.0/include/linux/netdevice.h:2677:1
__tun_detach
/build/linux-q2DMsi/linux-5.4.0/drivers/net/tun.c:731:5

We get to notifier_call_chain() not from netdev_state_change() as mentioned in the bug report, but unregister_netdevice() from line 731. This means we haven't yet run sock_put(&tfile->sk) from line 736.

Puzzling isn't it? There are calls to sock_put(&tfile->sk) earlier in __tun_detach(), maybe it freed the socket buffer already, which would explain the behaviour.

But then when we run sock_put(&tfile->sk) again, wouldn't we then run into use-after-free territory, when we try free the socket buffer again?

1735 /* Ungrab socket and destroy it, if it was the last reference. */
1736 static inline void sock_put(struct sock *sk)
1737 {
1738 if (refcount_dec_and_test(&sk->sk_refcnt))
1739 sk_free(sk);
1740 }

I have a second call trace that I have been debugging along with ...

Read more...

Revision history for this message
Matthew Ruffell (mruffell) wrote :
Download full text (3.1 KiB)

ovs-vsctl[51186]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=120 --oneline --format=json --db=tcp:127.0.0.1:6640 -- --if-exists del-port br-int tap8c883ee5-5f
kernel: device tap8c883ee5-5f left promiscuous mode
lldpd[2309]: removal request for address of fe80::fc16:3eff:fe07:2be2%27, but no knowledge of it
systemd-networkd[1608]: tap8c883ee5-5f: Link DOWN
systemd-networkd[1608]: tap8c883ee5-5f: Lost carrier
kernel: general protection fault: 0000 [#1] SMP NOPTI
kernel: CPU: 41 PID: 25064 Comm: privsep-helper Tainted: G W 5.4.0-81-generic #91~18.04.1-Ubuntu
kernel: Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/16/2020
kernel: RIP: 0010:count_subheaders.part.15+0x41/0x60
kernel: Code: 31 e4 53 48 89 fb 48 8b 7b 18 48 85 ff 75 1b 41 bc 01 00 00 00 48 83 c3 40 48 83 3b 00 75 e7 43 8d 04 2c 5b 41 5c 41 5d 5d c3 <48> 83 3f 00 b8 01 00 00 00 74 05 e8 af ff ff ff 41 01 c5 eb d6 31
kernel: RSP: 0018:ffffa8fa4d0437a0 EFLAGS: 00010286
kernel: RAX: 0000000000000001 RBX: ffff98cfefac0800 RCX: 0000000000000000
kernel: RDX: 000000000000001b RSI: ffff98f85782cac0 RDI: 8c0a25048c0abfe4
kernel: RBP: ffffa8fa4d0437b8 R08: 0000000000000000 R09: 000000000000000a
kernel: R10: ffffa8fa4d0438e8 R11: 0000000000031220 R12: 0000000000000000
kernel: R13: 0000000000000000 R14: ffff98cff12e0000 R15: ffff98f85782ca00
kernel: FS: 00007f7f3f9f9700(0000) GS:ffff98e05fb40000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000000000225b7f8 CR3: 000000248bf1c006 CR4: 00000000007626e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: count_subheaders.part.15+0x51/0x60
kernel: unregister_sysctl_table+0x31/0xb0
kernel: unregister_net_sysctl_table+0xe/0x10
kernel: __devinet_sysctl_unregister.isra.25+0x2b/0x50
kernel: devinet_sysctl_unregister+0x29/0x40
kernel: inetdev_event+0x1f0/0x570
kernel: ? skb_dequeue+0x60/0x70
kernel: notifier_call_chain+0x4c/0x70
kernel: ? notifier_call_chain+0x4c/0x70
kernel: ? tun_show_group+0x60/0x60
kernel: raw_notifier_call_chain+0x16/0x20
kernel: call_netdevice_notifiers_info+0x2d/0x60
kernel: rollback_registered_many+0x346/0x520
kernel: ? mem_cgroup_throttle_swaprate+0x1d/0x140
kernel: unregister_netdevice_many.part.127+0x12/0x90
kernel: unregister_netdevice_many+0x16/0x20
kernel: rtnl_delete_link+0x4e/0x80
kernel: rtnl_dellink+0x12d/0x2b0
kernel: ? __nla_parse+0x22/0x30
kernel: ? rtnl_dump_ifinfo+0x360/0x5d0
kernel: ? ns_capable+0x10/0x20
kernel: rtnetlink_rcv_msg+0x296/0x340
kernel: ? aa_label_sk_perm.part.4+0x10f/0x160
kernel: ? _cond_resched+0x19/0x40
kernel: ? rtnl_calcit.isra.30+0x120/0x120
kernel: netlink_rcv_skb+0x51/0x120
kernel: rtnetlink_rcv+0x15/0x20
kernel: netlink_unicast+0x1a4/0x250
kernel: netlink_sendmsg+0x2eb/0x3f0
kernel: sock_sendmsg+0x63/0x70
kernel: __sys_sendto+0x13f/0x180
kernel: ? handle_mm_fault+0xcb/0x210
kernel: ? __do_page_fault+0x2be/0x4d0
kernel: __x64_sys_sendto+0x28/0x30
kernel: do_syscall_64+0x57/0x190
kernel: entry_SYSCALL_64_after_hw...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.