Hang after upgrade to 16.04

Bug #1680502 reported by Arlie Stephens
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned
Xenial
Expired
High
Unassigned

Bug Description

Last week I upgraded from 12.04 LTS to 14.04 LTS and then immediately to 16.04 LTS.

12.04 was not entirely stable; something was crashing regularly, and the Ubuntu tools make it hard for a user to determine what. The upgrade went moderately well; I now get error messages during system startup (about an unnamed file not being found) and a couple of other bits of flakiness, but I counted it as a success and the system as functional.

This morning I tried to wake up my screen, and nothing much happened. I then attempted to ssh to the ubuntu box from another system. This requested my password almost instantly, as normal - but then nothing else happened, and the connection eventually dropped.

I conclude that IP and TCP are functional, and it's possible for some processes to respond, but not many. So it's not a complete kernel hang. (In particular, I'm seeing evidence that it's getting beyond things done at interrupt level.)

I don't have any debugging aids installed, so I don't believe I can get a kernel crash dump, which is what I'd want if I were debugging this. I *can* potentially retrieve and attach logs, but you'll have to tell me which ones are relevant, and do so before they rollover. (Also, logging will have to be functioning; IIRC, there were syslog issues in 12.04, and while I'd implemented whatever fix was reccommended at the time, I haven't looked at my logs since the upgrade.)

This is a desk top system originally from System 76 - i.e. built for linux - that's also running a bunch of server software (postfix, apache, ...) I was not (knowingly) running anythign unusual at the time - probably Unity, a few shells, firefox, maybe guncash and/or emacs - and all the usual demons.

IIRC, I was not at the very latest versions of all software installed - some new versions ahd come out since I upgraded, and I was going to deal with installing them on the weekend.

I'm going to hard reboot the system now. I can then gather identifying info. If I have time this AM before work, I'll check for standard things you want in all bugs, and add them. (Right now I'm posting from my Mac laptop ;-))
---
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: arlie 2507 F.... pulseaudio
 /dev/snd/controlC0: arlie 2507 F.... pulseaudio
CurrentDesktop: Unity
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=e206b01d-6cec-4b56-b469-25b106536f09
InstallationDate: Installed on 2012-04-26 (1811 days ago)
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
MachineType: System76, Inc. Wild Dog Performance
NonfreeKernelModules: nvidia
Package: linux (not installed)
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-72-generic root=UUID=96551326-e461-4071-ab9c-0e81ad7015d7 ro quiet splash
ProcVersionSignature: Ubuntu 4.4.0-72.93-generic 4.4.49
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-72-generic N/A
 linux-backports-modules-4.4.0-72-generic N/A
 linux-firmware 1.157.8
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: yes
  Hard blocked: no
Tags: xenial
Uname: Linux 4.4.0-72-generic x86_64
UpgradeStatus: Upgraded to xenial on 2017-03-31 (11 days ago)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 02/24/2012
dmi.bios.vendor: Intel Corp.
dmi.bios.version: KCH7710H.86A.0069.2012.0224.1825
dmi.board.name: DH77KC
dmi.board.vendor: Intel Corporation
dmi.board.version: AAG39641-400
dmi.chassis.type: 3
dmi.chassis.vendor: System76, Inc.
dmi.chassis.version: WilP9
dmi.modalias: dmi:bvnIntelCorp.:bvrKCH7710H.86A.0069.2012.0224.1825:bd02/24/2012:svnSystem76,Inc.:pnWildDogPerformance:pvrwilp9:rvnIntelCorporation:rnDH77KC:rvrAAG39641-400:cvnSystem76,Inc.:ct3:cvrWilP9:
dmi.product.name: Wild Dog Performance
dmi.product.version: wilp9
dmi.sys.vendor: System76, Inc.

Revision history for this message
Arlie Stephens (arlie) wrote :

After I power cycled the system, I got 3 of the "system problem detected" pop-ups, and told it I wanted them reported. I wasn't given an opportunity to see what was reported, or add to it, so was unable to reference this bug number. IF I understand what's behind the "user friendly" wrapping, this probably means there were 3 core dumps lying around from the crash. Good luck linking them with this bug report.

Also, I tried alt-sysreq before power cycling, and determined that, as I half remembered, this functionality is not available on ubuntu - or conceivably was broken by the same bug i was trying to gather data for.

Revision history for this message
Arlie Stephens (arlie) wrote :

I tried to run apport-collect specifying this bug, but I think something may not have worked.

$ apport-collect 1680502
The authorization page:
 (https://launchpad.net/+authorize-token?oauth_token=Km4k3CpdglJ0WgdZk2DF&allow_permission=DESKTOP_INTEGRATION)
should be opening in your browser. Use your browser to authorize
this program to access Launchpad on your behalf.
Waiting to hear from Launchpad about your decision...
Gtk-Message: GtkDialog mapped without a transient parent. This is discouraged.

I have it 1 hour permission when the new tab appeared in my web browser.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1680502/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Arlie Stephens (arlie) wrote :

It's done it again today. What do I need to gather to give you something useful? And is there any way to get some action? Perhaps assign it to kernel, even though like as not it's an application-level hangup?

Revision history for this message
Arlie Stephens (arlie) wrote :

Assigning this to "linux" package. I have no evidence as to whether this is a kernel issue, or a case of user space processes with crazy priorities and tight loops etc. However in my experience (a) it takes a kernel person to distinguish the two (b) kernel people are more responsive than whoever monitors the no-package-assigned category.

affects: ubuntu → linux (Ubuntu)
Revision history for this message
Arlie Stephens (arlie) wrote :

No pop-ups after today's hang, so that symptom is probably coincidental.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1680502

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Arlie Stephens (arlie) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected xenial
description: updated
Revision history for this message
Arlie Stephens (arlie) wrote : CRDA.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : IwConfig.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : JournalErrors.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : Lspci.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : Lsusb.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : ProcEnviron.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : ProcModules.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : PulseList.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : UdevDb.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote : WifiSyslog.txt

apport information

Revision history for this message
Arlie Stephens (arlie) wrote :

This time, apport-collect appears to have worked.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc6

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Incomplete
tags: added: kernel-da-key
Revision history for this message
Arlie Stephens (arlie) wrote :

My apologies for going dark on you. I intend to try the upstream kernel, but I've been haviing trouble finding time. With luck, I'll get it installed this coming weekend.

Meanwhile, I've since had 2 or 3 standard recurrences, and one this morning involving significantly diferent behaviour. (Of course it could be a different bug, but the primary symptom is the same ... complete lack of response on the system console.

The difference this morning is that I was able to log in over the LAN and attempt a controlled shutdown - which failed.

Also notable - I've installed at least one batch of fixes since the first incident.

-- Snippets from window follow. Note the load average. Note also the hysterical compiz process --
arlie@ansuz$ uptime
 08:09:42 up 3 days, 11:19, 2 users, load average: 9.95, 6.86, 3.43
arlie@ansuz$ sudo shutdown -r now
[sudo] password for arlie:
Failed to start reboot.target: Connection timed out
See system logs and 'systemctl status reboot.target' for details.

arlie@ansuz$ systemctl status reboot.target
Failed to get properties: Connection timed out

top - 08:17:44 up 3 days, 11:27, 2 users, load average: 10.00, 9.39, 6.11
Tasks: 312 total, 5 running, 234 sleeping, 0 stopped, 73 zombie
%Cpu(s): 0.0 us, 25.1 sy, 0.0 ni, 74.9 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8140712 total, 1111908 free, 288328 used, 6740476 buff/cache
KiB Swap: 7781712 total, 7495504 free, 286208 used. 7202464 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 2527 arlie 20 0 0 0 0 Z 99.7 0.0 44:05.00 compiz
24400 arlie 20 0 41936 3780 3048 R 0.3 0.0 0:00.02 top
    1 root 20 0 120072 3980 2132 D 0.0 0.0 0:06.37 systemd
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
    3 root 20 0 0 0 0 R 0.0 0.0 0:00.46 ksoftirqd/0
    5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
    7 root 20 0 0 0 0 S 0.0 0.0 0:40.51 rcu_sched
    8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh

Revision history for this message
Arlie Stephens (arlie) wrote :

Note that the compiz process appears to be zombie, and "kill -9" naturally fails to make it go away. I suppose it might be racking up all this time attempting to dump core.

I wish I knew whether compiz has been berserk in all the cases where I power cycled the system to regain control. If it has, the implication seems to be that this is a Unity bug. Otherwise not.

Revision history for this message
Arlie Stephens (arlie) wrote :

Hmm - 73 zombies?! And systemd is in an uninterruptible sleep. It moreover has not accumulated any time since I first ran top, so I'd say it's still in the *same* uninterruptible sleep... for at least 14 minutes. Also, the number of zombies has been increasing, as of right now, it's 187.
I don't have the ability to live debug this kernel, but I'm morally convinced that if we could trace the resource systemd is waiting on,

---
top - 08:34:16 up 3 days, 11:44, 2 users, load average: 10.00, 9.99, 8.69
Tasks: 472 total, 5 running, 280 sleeping, 0 stopped, 187 zombie
%Cpu(s): 1.2 us, 0.4 sy, 0.1 ni, 98.1 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8140712 total, 1037684 free, 336272 used, 6766756 buff/cache
KiB Swap: 7781712 total, 7495516 free, 286196 used. 7137692 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 2527 arlie 20 0 0 0 0 Z 106.7 0.0 60:36.87 compiz
    1 root 20 0 120072 3980 2132 D 0.0 0.0 0:06.37 systemd
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd

---
And look at this, we have a defunct pager!!

arlie@ansuz$ ps -Fa -p1
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
root 1 0 0 30018 3980 3 Apr15 ? 00:00:06 /sbin/init splash
arlie 24386 1 0 0 0 3 08:15 pts/2 00:00:00 [pager] <defunct>
arlie 24859 24139 0 9342 3284 3 08:37 pts/2 00:00:00 ps -Fa -p1

--
I can't get the WCHAN for systemd out of ps. Sorry.

--
arlie@ansuz$ ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o comm | head
  PID WIDE-WCHAN-COLUMN COMMAND
    1 - systemd
    2 - kthreadd
    3 - ksoftirqd/0

Revision history for this message
Arlie Stephens (arlie) wrote :
Download full text (3.6 KiB)

Today's "hang" involved a zombie compiz consuming 100% of a cpu, along with an emacs instance consuming another 100%. Load average around 11, and climbing. Only 22 zombies currently, but it was 4 when I managed to get on with ssh.

I was in the process of installing software updates, using the GUI tool (rather than direct use of apt-get from the shell) when this happened.

Parts of the update still seem to be running.

arlie@ansuz$ ps -Fa -p1 -www
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
root 1 0 0 30034 4656 2 Apr28 ? 00:00:08 /sbin/init splash
root 25826 25775 0 1127 1712 0 07:57 pts/18 00:00:00 /bin/sh -e /var/lib/dpkg/info/udev.postrm upgrade 229-4ubuntu17
root 25843 25826 0 6542 1352 0 07:57 pts/18 00:00:00 systemctl --system daemon-reload
arlie 25846 22284 0 9342 3232 2 07:57 pts/4 00:00:00 ps -Fa -p1 -www

I'm wondering now whether my first guess of a kernel issue is dead wrong, and the root cause is actually compiz. Or perhaps we have multiple causes, for the same basic symptom.

Here's the current crop of defunct processes

arlie@ansuz$ ps aux | grep defunct
arlie 2488 0.0 0.0 0 0 ? Z<l Apr28 0:00 [pulseaudio] <defunct>
arlie 2503 0.8 0.0 0 0 ? Zsl Apr28 55:08 [compiz] <defunct>
arlie 2692 0.0 0.0 0 0 ? Z Apr28 0:00 [gconf-helper] <defunct>
root 22212 0.0 0.0 0 0 ? Z 07:42 0:00 [check-new-relea] <defunct>
sshd 24480 0.0 0.0 0 0 ? Z 07:52 0:00 [sshd] <defunct>
sshd 24489 0.0 0.0 0 0 ? Z 07:52 0:00 [sshd] <defunct>
sshd 24491 0.0 0.0 0 0 ? Z 07:52 0:00 [sshd] <defunct>
sshd 24494 0.0 0.0 0 0 ? Z 07:53 0:00 [sshd] <defunct>
sshd 24496 0.0 0.0 0 0 ? Z 07:53 0:00 [sshd] <defunct>
sshd 24500 0.0 0.0 0 0 ? Z 07:53 0:00 [sshd] <defunct>
sshd 24504 0.0 0.0 0 0 ? Z 07:53 0:00 [sshd] <defunct>
sshd 24508 0.0 0.0 0 0 ? Z 07:53 0:00 [sshd] <defunct>
sshd 24510 0.0 0.0 0 0 ? Z 07:54 0:00 [sshd] <defunct>
sshd 24514 0.0 0.0 0 0 ? Z 07:54 0:00 [sshd] <defunct>
sshd 24518 0.0 0.0 0 0 ? Z 07:54 0:00 [sshd] <defunct>
sshd 24523 0.0 0.0 0 0 ? Z 07:54 0:00 [sshd] <defunct>
sshd 24532 0.0 0.0 0 0 ? Z 07:54 0:00 [sshd] <defunct>
sshd 24538 0.0 0.0 0 0 ? Z 07:55 0:00 [sshd] <defunct>
sshd 24541 0.0 0.0 0 0 ? Z 07:55 0:00 [sshd] <defunct>
sshd 24543 0.0 0.0 0 0 ? Z 07:55 0:00 [sshd] <defunct>
sshd 25708 0.0 0.0 0 0 ? Z 07:55 0:00 [sshd] <defunct>
sshd 25711 0.0 0.0 0 0 ? Z 07:56 0:00 [sshd] <defunct>
arlie 26946 0.0 0.0 14228 964 pts/4 S+ 08:00 0:00 grep defunct

Systemd is in top's state "D" - just like last time. That's an uninterruptable sleep. It does not appear to have accumulat...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Xenial) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Xenial):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.