watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [gnome-shell:1112]

Bug #1796385 reported by Bounty
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

I can use the system for a while, then at random, the screen blinks and freezes. Must reboot.
Seems to happen both with Wayland and Xorg.

ProblemType: KernelOops
DistroRelease: Ubuntu 18.10
Package: linux-image-4.18.0-8-generic 4.18.0-8.9
ProcVersionSignature: Ubuntu 4.18.0-8.9-generic 4.18.7
Uname: Linux 4.18.0-8-generic x86_64
Annotation: Your system might become unstable now and might need to be restarted.
ApportVersion: 2.20.10-0ubuntu11
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: greg 2039 F.... pulseaudio
 /dev/snd/controlC1: greg 2039 F.... pulseaudio
Date: Tue Oct 2 15:56:23 2018
Failure: oops
InstallationDate: Installed on 2018-09-28 (7 days ago)
InstallationMedia: Ubuntu 18.10 "Cosmic Cuttlefish" - Beta amd64 (20180927)
IwConfig:
 lo no wireless extensions.

 eno1 no wireless extensions.
MachineType: Gigabyte Technology Co., Ltd. Default string
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-8-generic root=UUID=ce06b10d-2a7f-49db-a15b-85554d9a7e4d ro quiet splash vt.handoff=1
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions: kerneloops-daemon N/A
RfKill:

SourcePackage: linux
Title: watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [gnome-shell:1112]
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/07/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F23
dmi.board.asset.tag: Default string
dmi.board.name: X99-UD4-CF
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF23:bd06/07/2017:svnGigabyteTechnologyCo.,Ltd.:pnDefaultstring:pvrDefaultstring:rvnGigabyteTechnologyCo.,Ltd.:rnX99-UD4-CF:rvrx.x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: Default string
dmi.product.name: Default string
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
Bounty (gregr-arsfabula) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Cristian Aravena Romero (caravena) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19-rc7 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc7/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Bounty (gregr-arsfabula) wrote :

This issue started with the upgrade to 18.10 beta. I was not having it before.
I installed the V4.19-rc7 and still have the issue though it seems to be happening less frequently.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Bounty (gregr-arsfabula)
tags: added: kernel-bug-exists-upstream
Revision history for this message
In , caravena (caravena-linux-kernel-bugs) wrote :

Hello,

Open bug in launchpad.net
https://bugs.launchpad.net/bugs/1796385

"I can use the system for a while, then at random, the screen blinks and freezes. Must reboot.
Seems to happen both with Wayland and Xorg."

Best regards,
--
Cristian Aravena Romero (caravena)

Revision history for this message
Cristian Aravena Romero (caravena) wrote :

https://bugzilla.kernel.org/show_bug.cgi?id=201379
--
Cristian Aravena Romero (caravena)

tags: added: rls-cc-incoming
Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Talk about shooting the messenger. The kernel's watchdog (kernel/watchdog.c) reports a stalled CPU. This is not a problem with the kernel's watchdog. The kernel watchdog just reports the problem. It is also not a problem with the watchdog subsystem, which is not even involved.

Revision history for this message
In , caravena (caravena-linux-kernel-bugs) wrote :

@Guenter,

Could you change to where it corresponds in bugzilla if it is not in 'watchdog' this report?

Best regards,
--
Cristian Aravena Romero (caravena)

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

The bug in lauchpad, unless I am missing something, provides not a single actionable traceback. I don't think it is even possible to identify where exactly the CPU hangs unless additional information is provided. There is no traceback in dmesg, and OopsText doesn't include it either.

Given that, it is not possible to identify the responsible subsystem, much less to fix the underlying problem. The only thing we can say for sure is that it is _not_ a watchdog driver problem.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Also, I don't think I have permission to change any of the bug status fields.

Revision history for this message
In , caravena (caravena-linux-kernel-bugs) wrote :

@Guenter,

I change it, but I do not know what 'Product' and 'Component'.

Best regards,
--
Cristian Aravena Romero (caravena)

Revision history for this message
Cristian Aravena Romero (caravena) wrote :

https://bugzilla.kernel.org/show_bug.cgi?id=201379#c3

We lack the 'Call Trace'
--
Cristian Aravena Romero (caravena)

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Unfortunately we do not have information to determine 'Product' and 'Component'.

The only information we have is that the hanging process is gnome-shell (or at least that this was the case in at least one instance), that the screen blinks and freezes when the problem is observed, and that the hanging CPU served most of the graphics card interrupts. If it is persistent, it _might_ suggest that graphics (presumably the Radeon graphics driver and/or the graphics hardware) is involved. This would be even more likely if the observed PCIe errors point to the graphics card (not sure if the information provided shows the PCIe bus tree; if so I have not found it).

Revision history for this message
Cristian Aravena Romero (caravena) wrote :

https://bugzilla.kernel.org/show_bug.cgi?id=201379#c6

"Unfortunately we do not have information to determine 'Product' and 'Component'.

The only information we have is that the hanging process is gnome-shell (or at least that this was the case in at least one instance), that the screen blinks and freezes when the problem is observed, and that the hanging CPU served most of the graphics card interrupts. If it is persistent, it _might_ suggest that graphics (presumably the Radeon graphics driver and/or the graphics hardware) is involved. This would be even more likely if the observed PCIe errors point to the graphics card (not sure if the information provided shows the PCIe bus tree; if so I have not found it)."
--
Cristian Aravena Romero (caravena)

Revision history for this message
Cristian Aravena Romero (caravena) wrote :

@Bounty

Could you temporarily change the video card to rule out problems with it?

Your current video card is:
[AMD/ATI] Curacao PRO [Radeon R7 370 / R9 270/370 OEM]

Best regards,
--
Cristian Aravena Romero (caravena)

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Cristian Aravena Romero (caravena) wrote :

Hello,

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796385/comments/6

This suggests another error -> Bug 1797625
--
Cristian Aravena Romero (caravena)

Revision history for this message
Bounty (gregr-arsfabula) wrote :

Hello,

I won't be able to quickly test another video card. sorry about that.

Greg

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Amer Hwitat (amer.hwitat) wrote :
Revision history for this message
Amer Hwitat (amer.hwitat) wrote :

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

Revision history for this message
In , amer.hwaitat (amer.hwaitat-linux-kernel-bugs) wrote :

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

upgrade firmware from my side to VMWare 14 instead of using VMware 12 on RHEL 7.6 Maipo

Revision history for this message
Amer Hwitat (amer.hwitat) wrote :

[root@localhost network-scripts]# systemctl status network -l
? network.service - LSB: Bring up/down networking
   Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sat 2019-01-19 03:47:01 EST; 21s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 86319 ExecStop=/etc/rc.d/init.d/network stop (code=exited, status=0/SUCCESS)
  Process: 86591 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=1/FAILURE)
    Tasks: 0

Jan 19 03:47:01 localhost.localdomain dhclient[86963]: Please report for this software via the Red Hat Bugzilla site:
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: http://bugzilla.redhat.com
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: ution.
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: exiting.
Jan 19 03:47:01 localhost.localdomain network[86591]: failed.
Jan 19 03:47:01 localhost.localdomain network[86591]: [FAILED]
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service: control process exited, code=exited status=1
Jan 19 03:47:01 localhost.localdomain systemd[1]: Failed to start LSB: Bring up/down networking.
Jan 19 03:47:01 localhost.localdomain systemd[1]: Unit network.service entered failed state.
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service failed.
[root@localhost network-scripts]#

[root@localhost log]#
Message from syslogd@localhost at Jan 23 02:23:31 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [ovsdb-server:10088]

[root@amer network-scripts]#
Message from syslogd@amer at Jan 27 12:46:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nova-api:102738]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

Revision history for this message
In , tony.freeman (tony.freeman-linux-kernel-bugs) wrote :

I'm seeing the same here.

Using RedHat 7.7

kernel version: 3.10.0-1062.8.1.el7.x86_64

All 10 of my machines are on the same hardware and rhel7 release. Just the one machine is reporting this problem. Eventually the machine becomes unusable and hard reboot is needed.

Revision history for this message
In , amer.hwaitat (amer.hwaitat-linux-kernel-bugs) wrote :

I had the same problem with OSP on RHEL 7.6

it seems that there was a network problem with my case, if it's relevant please check network connectivity, and / or IO latency on your VM, check the logs for that.

good luck

Revision history for this message
In , amer.hwaitat (amer.hwaitat-linux-kernel-bugs) wrote :

you can also check RAID card as it's common to be defect and causes problems on servers, and disk I/O latency if there is a disk problem it will affect the performance, I had this on HDD also for some services that needed an SSD to be installed and recommended by some Vendor to me.

I checked messages for errors and audit.log, if you are using OSP on your machines check the related Nova.log ... etc

BR

Revision history for this message
In , tony.freeman (tony.freeman-linux-kernel-bugs) wrote :

I opened up the server and reset cards and vacuumed it out a couple of days ago. I'll know if things are okay next week.

Revision history for this message
In , amer.hwaitat (amer.hwaitat-linux-kernel-bugs) wrote :

Hi,

cat /var/log/messages.log | grep -i error >> messages-errors.txt
cat /var/log/nova/nova.log | grep -i error >> nova-errors.txt
cat /var/log/audit/audit.log | grep -i error >> audit-errors.txt

in your home directory you will find the .txt files

take time to trace errors, maybe you will find answers.

if you have a Power-edge servers (DELL), the RC (RAID - Controller) problem is common.

Otherwise you may have to check Network connections between server and switch, replace the UTP cables, and ping the other servers, rabbitMQ in OSP is depending on heart-beat to sync between servers, if the Heartbeat fails it causes this.

Best regards
Amer Hwitat

Revision history for this message
In , amer.hwaitat (amer.hwaitat-linux-kernel-bugs) wrote :

Hi,

also you can check the journalctl

journalctl -l|grep -i error > journal-errors.txt

cheers

Revision history for this message
In , tony.freeman (tony.freeman-linux-kernel-bugs) wrote :

Thanks ... I reviewed log files this morning and had a look at output from journalclt. Appears the system is good to go. I guess going through and blowing out the machine and re-seating everything into their slots helped. Thanks you for your assistance!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.