unexpected system behaviour after kernel update

Bug #2034701 reported by Hajo Locke
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hello,

since last kernel updates on ubuntu 20.04 and 22.04 we see lines like this in syslog after boot:

Sep 7 13:10:53 myhost kernel: [ 1.202296] pci 0000:00:15.4: BAR 13: no space for [io size 0x1000]
Sep 7 13:10:53 myhost kernel: [ 1.202387] pci 0000:00:15.4: BAR 13: failed to assign [io size 0x1000]

We see this on Ubuntu 20.04 after installing kernel 5.4.0-162-generic and on ubuntu 22.04 after installing kernel 5.15.0-83-generic

Since this time we notice some strange server behaviour.

- unexpected reboots during workhours
- systems hanging/freezing which requires a hardstop of the machine
- systems with high load but no significant number of processes, even on non productive machines with usually zero load.

We use most of our ubuntu servers as virtual machines in a VMWare Environment with VSphere and ESXi 7.0.3
There were no VMWare updates. I can say this with certainty because iam the responsible person for VMWare too and i did not install any ESXi Updates last days.
Other OS not affected as i can see.

Thank you,
Hans

p.s. i wanted to help and wanted to choose fitting package from dropdown "
In what package did you find this bug?" above this Inputfield, but it seems that packages like linux-image-5.15.0-83-generic (or similiar i tried somethging...) are unknown and search respondes with improper suggestions.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/2034701/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Paul White (paulw2u)
affects: ubuntu → linux (Ubuntu)
tags: added: focal jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2034701

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Hajo Locke (hajo-locke) wrote :

Our servers are not connected with public internet. i cant run apport-collect:

ERROR: connecting to Launchpad failed: [Errno 110] Connection timed out
You can reset the credentials by removing the file "/root/.cache/apport/launchpad.credentials"

Today Morning same notice like yesterday. Again 2 Servers down and i had to hardstop them.
Login by ssh was possible with extreme delay, any other commands like reboot etc. failed, just nothing happens.

Hajo

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Hajo Locke (hajo-locke) wrote :
Download full text (25.9 KiB)

I collected some more loglines, produced by unproductive member of a haproxy failoversystem with usually zero load. over the weekend there are affected several servers. I think this is serious.

Sep 8 17:01:06 myhost kernel: [115882.039808] CIFS: Attempting to mount \\host123\folder
Sep 8 17:01:06 myhost systemd[1]: mnt-win_share.mount: Deactivated successfully.
Sep 8 17:01:06 myhost kernel: [115882.497413] BUG: kernel NULL pointer dereference, address: 0000000000000040
Sep 8 17:01:06 myhost kernel: [115882.497460] #PF: supervisor read access in kernel mode
Sep 8 17:01:06 myhost kernel: [115882.497485] #PF: error_code(0x0000) - not-present page
Sep 8 17:01:06 myhost kernel: [115882.497502] PGD 0 P4D 0
Sep 8 17:01:06 myhost kernel: [115882.497515] Oops: 0000 [#1] SMP PTI
Sep 8 17:01:06 myhost kernel: [115882.497530] CPU: 1 PID: 1046703 Comm: kworker/1:2 Not tainted 5.15.0-83-generic #92-Ubuntu
Sep 8 17:01:06 myhost kernel: [115882.497560] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Sep 8 17:01:06 myhost kernel: [115882.497598] Workqueue: cifsoplockd cifs_oplock_break [cifs]
Sep 8 17:01:06 myhost kernel: [115882.497691] RIP: 0010:cifs_oplock_break+0x202/0x5c0 [cifs]
Sep 8 17:01:06 myhost kernel: [115882.497738] Code: 89 45 b8 c0 eb 02 83 e3 01 e8 aa f6 ff ff 84 db 75 34 49 8b 47 48 4c 89 e2 0f b7 4d b8 4d 89 f0 4c 89 ee 4c 89 ff 48 8b 40 48 <48> 8b 40 40 48 8b 80 28 02 00 00 ff d0 0f 1f 00 41 89 c4 f6 05 30
Sep 8 17:01:06 myhost kernel: [115882.497794] RSP: 0018:ffffa74603d87e10 EFLAGS: 00010246
Sep 8 17:01:06 myhost kernel: [115882.497812] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 8 17:01:06 myhost kernel: [115882.497834] RDX: 000000000000000b RSI: 236fcc211deddb5a RDI: ffff94088f58e000
Sep 8 17:01:06 myhost kernel: [115882.497856] RBP: ffffa74603d87e70 R08: ffff9408a1cd6a40 R09: 0000000000000000
Sep 8 17:01:06 myhost kernel: [115882.497879] R10: ffff9408a2da3000 R11: ffffffffffffff00 R12: 000000000000000b
Sep 8 17:01:06 myhost kernel: [115882.497902] R13: 236fcc211deddb5a R14: ffff9408a1cd6a40 R15: ffff94088f58e000
Sep 8 17:01:06 myhost kernel: [115882.497925] FS: 0000000000000000(0000) GS:ffff940bafc80000(0000) knlGS:0000000000000000
Sep 8 17:01:06 myhost kernel: [115882.497951] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 8 17:01:06 myhost kernel: [115882.497971] CR2: 0000000000000040 CR3: 00000001008dc005 CR4: 00000000007706e0
Sep 8 17:01:06 myhost kernel: [115882.498018] PKRU: 55555554
Sep 8 17:01:06 myhost kernel: [115882.498029] Call Trace:
Sep 8 17:01:06 myhost kernel: [115882.498040] <TASK>
Sep 8 17:01:06 myhost kernel: [115882.498051] ? show_trace_log_lvl+0x1d6/0x2ea
Sep 8 17:01:06 myhost kernel: [115882.498071] ? show_trace_log_lvl+0x1d6/0x2ea
Sep 8 17:01:06 myhost kernel: [115882.498098] ? process_one_work+0x228/0x3d0
Sep 8 17:01:06 myhost kernel: [115882.498662] ? show_regs.part.0+0x23/0x29
Sep 8 17:01:06 myhost kernel: [115882.499192] ? __die_body.cold+0x8/0xd
Sep 8 17:01:06 myhost kernel: [115882.499703] ? __die+0x2b/0x37
Sep 8 17:01:06 myhost kernel: [115882.500219] ? page_fault_oops+0x13b...

Revision history for this message
Hajo Locke (hajo-locke) wrote :

It seems that Ubuntu 22.04 is affected in a critical way. Currently we had freezes only on Ubuntu22, but in Ubuntu 20.04 there are similiar loglines.
After some hours of running the ubuntu 22.04 crashes and kernel/systemd are damaged.

Revision history for this message
Hajo Locke (hajo-locke) wrote :

uname

Linux myhost 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Hajo Locke (hajo-locke) wrote :

/proc/version_signature
Ubuntu 5.15.0-83.92-generic 5.15.116

Revision history for this message
Hajo Locke (hajo-locke) wrote :
Download full text (102.5 KiB)

dmesg

[ 0.000000] Linux version 5.15.0-83-generic (buildd@lcy02-amd64-027) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 (Ubuntu 5.15.0-83.92-generic 5.15.116)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=16488428-7c39-44f3-b82c-95b445a58fea ro
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] Disabled fast string operations
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009f3ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009f400-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bfedffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000bfee0000-0x00000000bfefefff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000bfeff000-0x00000000bfefffff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000bff00000-0x00000000bfffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffe0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000043fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 0.000000] vmware: hypercall mode: 0x02
[ 0.000000] Hypervisor detected: VMware
[ 0.000000] vmware: TSC freq read from hypervisor : 2099.999 MHz
[ 0.000000] vmware: Host bus clock speed read from hypervisor : 66000000 Hz
[ 0.000000] vmware: using clock offset of 6397541515 ns
[ 0.000016] tsc: Detected 2099.999 MHz processor
[ 0.002306] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.002313] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.002324] last_pfn = 0x440000 max_arch_pfn = 0x400000000
[ 0.002374] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.002397] total RAM covered: 31744M
[ 0.002629] Found optimal setting for mtrr clean up
[ 0.002631] gran_size: 64K chunk_size: 64K num_reg: 5 lose cover RAM: 0G
[ 0.002709] e820: update [mem 0xc0000000-0xffffffff] usable ==> reserved
[ 0.002720] last_pfn = 0xc0000 max_arch_pfn = 0x400000000
[ 0.008571] found SMP MP-table at [mem 0x000f6a70-0x000f6a7f]
[ 0.008609] Using GB pages for direct mapping
[ 0.008902] RAMDISK: [mem 0x2aee3000-0x31768fff]
[ 0.008911] ACPI: Early table checksum verification disabled
[ 0.008915] ACPI: RSDP 0x00000000000F6A00 000024 (v02 PTLTD )
[ 0.008921] ACPI: XSDT 0x00000000BFEEE8FB 00005C (v01 INTEL 440BX 06040000 VMW 01324272)
[ 0.008929] ACPI: FACP 0x00000000BFEF...

Revision history for this message
Hajo Locke (hajo-locke) wrote :
Download full text (138.5 KiB)

lspci -vvnn

00:00.0 Host bridge [0600]: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge [8086:7190] (rev 01)
        Subsystem: VMware Virtual Machine Chipset [15ad:1976]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Kernel driver in use: agpgart-intel

00:01.0 PCI bridge [0604]: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge [8086:7191] (rev 01) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=64
        I/O behind bridge: [disabled]
        Memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Secondary status: 66MHz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA+ VGA- VGA16- MAbort- >Reset- FastB2B+
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-

00:07.0 ISA bridge [0601]: Intel Corporation 82371AB/EB/MB PIIX4 ISA [8086:7110] (rev 08)
        Subsystem: VMware Virtual Machine Chipset [15ad:1976]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0

00:07.1 IDE interface [0101]: Intel Corporation 82371AB/EB/MB PIIX4 IDE [8086:7111] (rev 01) (prog-if 8a [ISA Compatibility mode controller, supports both channels switched to PCI native mode, supports bus mastering])
        Subsystem: VMware Virtual Machine Chipset [15ad:1976]
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Region 0: Memory at 000001f0 (32-bit, non-prefetchable) [virtual] [size=8]
        Region 1: Memory at 000003f0 (type 3, non-prefetchable) [virtual]
        Region 2: Memory at 00000170 (32-bit, non-prefetchable) [virtual] [size=8]
        Region 3: Memory at 00000370 (type 3, non-prefetchable) [virtual]
        Region 4: I/O ports at 1060 [virtual] [size=16]
        Kernel driver in use: ata_piix
        Kernel modules: pata_acpi

00:07.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 08)
        Subsystem: VMware Virtual Machine Chipset [15ad:1976]
        Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin ? routed to IRQ 9
        Kernel modules: i2c_piix4

00:07.7 System peripheral [0880]: VMware Virtual Machine Communication Interface [15ad:0740] (rev 10...

Revision history for this message
Hajo Locke (hajo-locke) wrote :

I did found a workaround, but iam still convinced that we have a kind of bug.
I think i should explain our typical systemsetup, for better understandig.

Typical field of application are failoversystems. overall we use very few software and systems have minimal load.
we have 2 servers in a cluster realised with pacemaker/corosync. they manage a reource haproxy and a floating ip. We do this with different ubuntu OS, from 18.04 over 20.04 to 22.04

Our systems are bound to Windows Active Director with SSSD (System Security Services Daemon) (https://schroeffu.ch/2019/09/linux-active-directory-ldap-ssh-login-mit-sssd-und-realmd/) so it is possible to Login with our AD Credentials.

last component is altiris server management suite agent (former symantec now broadcom) wich is running with root privileges and helps to manage our computerlandscape. And this is where i located the problem.

every evening the agent runs a bash script which was wrote by me 3 years ago. it is a small script with 90 lines, it collects some data, mounts a windows fileshare and finally uploads some small files before unmounting the share. nothing special, it takes around 5 seconds to complete, but here seems to be the problem.

As i can see every affected server shows in syslog this lines about kernel bug i uploaded on 2023-09-11 (#4) . In some cases there happens something unexpected and triggers this bug. this happens since 5.15.0-83-generic
the system gets unstable, high load without running processes, every command takes forever to complete. mostly we had to do a vmware hardstop, because even "reboot -f" failed. i uploaded already some logs.
I deactivated this job and the problems disappeared. i was not able to trigger this problem by manual run of the script. as the job was active, every morning we had a bunch of servers in this state between life and death.
So i can not confirm a change on our site, i still think about a newly introduced kind of bug.
I would like to hear from you, please tell me your opinion to this case. strange that i report a bug with documented kernel error and no one gets back to me.

Thanks,
Hajo

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.