ubuntu 19.10: unresponsive/freezes on ThunderX2 if system is idle for ~22min

Bug #1862559 reported by Naresh Bhat
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gdm3 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

UBUNTU 19.10 installed on ThunderX2 saber ARM64 machine. If the system left idle for ~22min it becomes unresponsive. The system will not take any inputs like from UART, ping ..etc. The system will completely freeze and we need to hard reset the system. It could be possible the system goes to hibernate state and could not able to come out of the state.

Dmesg log when system halts/no response on Saber boards (TX2)

Ubuntu 19.10 ubuntu ttyAMA0

ubuntu login: ubuntu
Password:
Last login: Wed Jan 29 20:08:22 PST 2020 on ttyAMA0
Welcome to Ubuntu 19.10 (GNU/Linux 5.3.0-26-generic aarch64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage

ubuntu@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.10
Release: 19.10
Codename: eoan

[ +0.732512] audit: type=1400 audit(1580358920.412:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=3087 comm="apparmor_parser"
[ +0.000064] audit: type=1400 audit(1580358920.412:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/ippusbxd" pid=3091 comm="apparmor_parser"
[ +0.000168] audit: type=1400 audit(1580358920.412:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=3086 comm="apparmor_parser"
[ +0.000005] audit: type=1400 audit(1580358920.412:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=3086 comm="apparmor_parser"
[ +0.000546] audit: type=1400 audit(1580358920.412:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=3085 comm="apparmor_parser"
[ +0.000006] audit: type=1400 audit(1580358920.412:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=3085 comm="apparmor_parser"
[ +0.000005] audit: type=1400 audit(1580358920.412:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=3085 comm="apparmor_parser"
[ +0.000854] audit: type=1400 audit(1580358920.412:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=3089 comm="apparmor_parser"
[ +0.002667] audit: type=1400 audit(1580358920.416:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=3093 comm="apparmor_parser"
[ +0.000004] audit: type=1400 audit(1580358920.416:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=3093 comm="apparmor_parser"
[ +1.382579] igb 0000:92:00.1 enp146s0f1: igb: enp146s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ +0.000299] IPv6: ADDRCONF(NETDEV_CHANGE): enp146s0f1: link becomes ready
[ +3.131173] mpt3sas_cm0: port enable: SUCCESS
[Jan29 20:36] random: crng init done
[ +0.000004] random: 7 urandom warning(s) missed due to ratelimiting
*[Jan29 20:37] rfkill: input handler disabled*
*[Jan29 20:57] PM: suspend entry (deep)*

Tags: eoan
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1862559

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: eoan
Revision history for this message
Ike Panhc (ikepanhc) wrote :

I can not reproduce this issue with 5.3.0-29.31 kernel

$ uname -a;w;sleep 1800;uname -a;w
Linux helo 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:30:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
 10:22:36 up 45 min, 1 user, load average: 0.04, 0.01, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
ubuntu ttyAMA0 - 10:21 2.00s 0.19s 0.08s w
Linux helo 5.3.0-29-generic #31-Ubuntu SMP Fri Jan 17 17:30:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
 10:52:36 up 1:15, 1 user, load average: 0.00, 0.00, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
ubuntu ttyAMA0 - 10:21 30:02 0.18s 0.07s w

Revision history for this message
dann frazier (dannf) wrote :

I also could not reproduce, even with the same kernel version the submitter used (5.3.0-26.28). What firmware are you using? My system (dualla) reports:

BIOS Date: 03/19/2019 14:27:32 Ver: 0ACKL026

Please also attach the apport info requested in Comment #1.

Revision history for this message
dann frazier (dannf) wrote :

Do you happen to have a desktop installation? If so, it does appear that GNOME now defaults to auto-suspending the system after ~20 minutes of inactivity:
  https://gitlab.gnome.org/GNOME/gnome-control-center/issues/22

Revision history for this message
dann frazier (dannf) wrote :

I tried installing ubuntu-desktop on a sabre, but I was still unable to reproduce. That's still my best hypothesis (Comment #4), but I think we're stuck here w/o more info from Naresh.

Revision history for this message
Naresh Bhat (nbhat) wrote :

Thank you very much. Attached the dmesg log collected from the system. May be it gives some hints on the freeze behavior. BTW it looks like we are using a different BIOS version TX2-FW-Release-7.4-build_05 from the logs. I will confirm the equivalent AMI BIOS version. If they are not using the AMI BIOS. I will ask our developers to check with AMI BIOS version. Thank you very much for the investigations.

Revision history for this message
Naresh Bhat (nbhat) wrote :

We are still able to reproduce the issue with latest AMI BIOS FW with the version "BIOS Date: 09/09/2019 16:06:39 Ver: 0ACKL028". The logs attached in the previous comment are useful ?

Revision history for this message
dann frazier (dannf) wrote :

@Naresh: see my comment #4, asking if you are running Ubuntu Desktop. dmesg doesn't tell me that - but an sosreport would, if you can attach that.

We've recently hit this on an x86 server, and did tie that back to the system having a desktop installed (Nvidia CUDA 11 somehow brings in gdm as a dependency), so I'm still suspecting the same here.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
status: Invalid → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thank you for reporting this bug to Ubuntu.
Ubuntu 19.10 (eoan) reached end-of-life on July 17, 2020.

See this document for currently supported Ubuntu releases:
https://wiki.ubuntu.com/Releases

We appreciate that this bug may be old and you might not be interested in discussing it any more. But if you are then please upgrade to the latest Ubuntu version and re-test. If you then find the bug is still present in the newer Ubuntu version, please add a comment here telling us which new version it is in.

Changed in gdm3 (Ubuntu):
status: New → Incomplete
dann frazier (dannf)
Changed in gdm3 (Ubuntu):
status: Incomplete → Confirmed
Changed in gdm3 (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

I asked our desktop team about this, and Iain Lane mentioned that Ubuntu overrides the GNOME default of auto-suspending, but that override only takes effect if you have the ubuntu-settings package installed. I tested this out on a Saber system, and I can confirm that it does seem to only happen when gdm is installed/running and ubuntu-settings is *not* installed. And that explains why I was unable to reproduce in Comment #5. There I had installed ubuntu-desktop which would bring in ubuntu-settings. It also explains why the CUDA-11 dependency chain *did* cause the problem on the x86 server in Comment #8. In that case, gdm3 is getting installed as a Recommends somewhere in the dependency chain - but it does not pull in gnome-settings.

I'll therefore go ahead and close the 'linux' task as Invalid - this isn't a kernel bug. I'll mark the gdm3 bug as "Won't Fix" because, in theory, we could change the defaults inside of gdm3 itself - but we instead recommend users install ubuntu-settings to override the defaults.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in gdm3 (Ubuntu):
status: Incomplete → Won't Fix
Mathew Hodson (mhodson)
no longer affects: linux (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.