Activity log for bug #1154876

Date Who What changed Old value New value Message
2013-03-14 00:40:57 Marc Hasson bug added bug
2013-03-14 00:40:57 Marc Hasson attachment added Crash dump taken by sysrq-g after hang and subsequent kdb session to demo where hung https://bugs.launchpad.net/bugs/1154876/+attachment/3573038/+files/dump.201303072055
2013-03-14 00:44:38 Marc Hasson attachment added boot up messages until standard running state of OOMs spew out https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573090/+files/console_boot_output.txt
2013-03-14 00:45:47 Marc Hasson attachment added dmesg file from boot, mostly duplicates start of console_boot_output.txt https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573109/+files/dmesg_of_boot.txt
2013-03-14 00:46:44 Marc Hasson attachment added last messages on serial console when system hung https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573110/+files/console_last_output.txt
2013-03-14 00:47:59 Marc Hasson attachment added kdb session demo'ing where system is "spinning" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573111/+files/console_kdb_session.txt
2013-03-14 00:49:07 Marc Hasson attachment added Machine environment and script/data used in our testbed https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573112/+files/reproduction_info.txt
2013-03-14 00:50:05 Marc Hasson attachment added Requested version.log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573113/+files/version.log
2013-03-14 00:51:34 Marc Hasson attachment added Requested lspci-vnvn.log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573114/+files/lspci-vnvn.log
2013-03-14 01:00:16 Brad Figg linux (Ubuntu): status New Confirmed
2013-03-14 15:49:14 Tim Hartrick bug added subscriber Tim Hartrick
2013-03-14 19:12:11 Joseph Salisbury linux (Ubuntu): importance Undecided Medium
2013-03-14 19:12:31 Joseph Salisbury linux (Ubuntu): status Confirmed Incomplete
2013-03-25 23:37:30 Marc Hasson linux (Ubuntu): status Incomplete Confirmed
2013-05-23 00:25:19 Marc Hasson tags amd64 apport-bug precise running-unity amd64 apport-bug kernel-bug-exists-upstream precise running-unity
2013-06-06 13:53:30 p bug added subscriber p
2013-08-04 02:16:28 penalvch description Background We've been experiencing mysterious hangs on our 2.6.38-16 Ubuntu 10.04 systems in the field. The systems have large amounts of memory and disk, along with up to a couple dozen CPU threads. Our operations folks have to power-cycle the machines to recover them, they do not panic. Our use of "hang" means the system will no longer respond to any current shell prompts, will not accept new logins, and may not even respond to pings. It appears totally dead. Using log files and the "sar" utility from the "sysstat" package we gradually put together the following clues to the hangs: Numerous "INFO: task <task-name>:<pid> blocked for more than 120 seconds" High CPU usage suddenly on all CPUs heading into the hang, 92% or higher Very high kswapd page scan rates (pgscank/s) - up to 7 million per second Very high direct page scan rates (pgscand/s) - up to 3 million per second In addition to noting the above events just before the hangs, we have some evidence that the high kswapd scans occur at other times for no seemingly obvious reason. Such as when there is a signficant (25%) amount of kbmemfree. Also, we've seen cases where there are application errors related to a system's responsiveness and that has sometimes correlated with either high pgscank/s or pgscand/s that lasts for some number of sar records before the system returns to normal running. The peaks of these transients aren't usually as high as those we see leading to a solid system hang/failure. And sometimes these are not "transients", but last for hours with no apparent event related to the starting or stopping of this behavior! So we decided to see if we can reproduce these symptoms on a VMware testbed that we could easily examine with kdb and snapshot/dump. Through a combination of tar, find, and cat commands launched from a shell script we could recreate a system hang on both our 2.6.38-16 systems as well as the various flavors of the 3.2 kernels, with the one crashdump'ed here being the latest 3.2.0-38 at the time of testing. The "sar" utility on our 2.6 testing confirmed similar behavior of the CPUs, kswapd scans, and direct scans leading up to the testbed hangs as to what we see in the field failures of our servers. Details on the shell scripts can be found in the file referenced below. Its important to read the information below on how the crash dump was taken before investigating it. Reproduction on a 2-CPU VM took 1.5-4 days for a 3.2 kernel, usually considerably less for a 2.6 kernel. Hang/crashdump details: In the crashdump the crash "dmesg" command will also show Call Traces that occured *after* kdb investigations started. Its important to note the kernel timestamp that indicates the start of those kdb actions and only examine prior to that for clues as to the hang proper: [160512.756748] SysRq : DEBUG [164936.052464] psmouse serio1: Wheel Mouse at isa0060/serio1/input0 lost synchronization, throwing 2 bytes away. [164943.764441] psmouse serio1: resync failed, issuing reconnect request [165296.817468] SysRq : DEBUG Everything previous to the above "dmesg" output occurs prior (or during) the full system hang. The kdb session started over 12 hours after the hang, the system was totally non-responsive at either its serial console or GUI. Did not try a "ping" in this instance. The "kdb actions" taken may be seen in an actual log of that session recorded in console_kdb_session.txt. It shows where these 3.2 kernels are spending their time when hung in our testbed ("spinning" in __alloc_pages_slowpath by failing an allocation, sleeping, retrying). We see the same behavior for the 2.6 kernels/tests as well except for one difference described below. For the 3.2 dump included here all our script/load processes, as well as system processes, are constantly failing to allocate a page, sleeping briefly, and trying again. This occurs across all CPUs (2 CPUs in this system/dump), which fits with what we believe we see in our field machines for the 2.6 kernels. For the 2.6 kernels the only difference we see is that there is typically a call to the __alloc_pages_may_oom function which in turn selects a process to kill, but we see that there is already a "being killed by oom" process at the hang so no additional ones are selected. And we deadlock, just as the comment in oom_kill.c's select_bad_process() says. In the 3.2 kernels we are now moving our systems to we see in our testbed hang that the code does not go down the __alloc_pages_may_oom path. Yet from the logs we include and the "dmesg" within crash one can see that prior to the hang OOM killing is invoked frequently. The key seems to be a difference in the "did_some_progress" variable returned when we are very low on memory, its always a "1" in the 3.2 kernels on our testbed. Though the kernel used here is 3.2.0-38-generic we have also caused this to occur with earlier 3.2 Ubuntu kernels. We have also reproduced the failures with 2.6.38-8, 2.6.38-16, and 3.0 Ubuntu kernels. Quick description of included attachments (assuming this bug tool lets me add them separately): console_boot_output.txt - boot up messages until standard running state of OOMs dmesg_of_boot.txt - dmesg file from boot, mostly duplicates start of the above console_last_output.txt - last messages on serial console when system hung console_kdb_session.txt - kdb session demo'ing where system is "spinning" dump.201303072055 - sysrq-g dump, system was up around 2 days before hanging reproduction_info.txt - Machine environment and script used in our testbed Here are a couple of the miscellaneous things asked for on the bug reporting guidelines: lsb_release -rd Description: Ubuntu 12.04.2 LTS Release: 12.04 uname -a Linux direct-12-04 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux I will attempt to subsequently attach the lspci output and other attachments, if I can.... ProblemType: Bug DistroRelease: Ubuntu 12.04 Package: linux-image-3.2.0-38-generic 3.2.0-38.61 ProcVersionSignature: Ubuntu 3.2.0-38.61-generic 3.2.37 Uname: Linux 3.2.0-38-generic x86_64 AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24. ApportVersion: 2.0.1-0ubuntu17.1 Architecture: amd64 ArecordDevices: **** List of CAPTURE Hardware Devices **** card 0: AudioPCI [Ensoniq AudioPCI], device 0: ES1371/1 [ES1371 DAC2/ADC] Subdevices: 1/1 Subdevice #0: subdevice #0 AudioDevicesInUse: USER PID ACCESS COMMAND /dev/snd/controlC0: marc 2591 F.... pulseaudio CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found. Card0.Amixer.info: Card hw:0 'AudioPCI'/'Ensoniq AudioPCI ENS1371 at 0x20c0, irq 18' Mixer name : 'Cirrus Logic CS4297A rev 3' Components : 'AC97a:43525913' Controls : 24 Simple ctrls : 13 Date: Wed Mar 13 17:05:30 2013 HibernationDevice: RESUME=UUID=2342cd45-2970-47d7-bb6d-6801d361cb3e InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425) Lsusb: Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub MachineType: VMware, Inc. VMware Virtual Platform MarkForUpload: True ProcEnviron: TERM=xterm PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-38-generic root=UUID=2db72c58-0ff6-48f6-87e4-55365ee344df ro crashkernel=384M-2G:64M,2G-:128M rootdelay=60 console=ttyS1,115200n8 kgdboc=kms,kbd,ttyS1,115200n8 splash RelatedPackageVersions: linux-restricted-modules-3.2.0-38-generic N/A linux-backports-modules-3.2.0-38-generic N/A linux-firmware 1.79.1 RfKill: SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 06/02/2011 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd06/02/2011:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc. Background We've been experiencing mysterious hangs on our 2.6.38-16 Ubuntu 10.04 systems in the field. The systems have large amounts of memory and disk, along with up to a couple dozen CPU threads. Our operations folks have to power-cycle the machines to recover them, they do not panic. Our use of "hang" means the system will no longer respond to any current shell prompts, will not accept new logins, and may not even respond to pings. It appears totally dead. Using log files and the "sar" utility from the "sysstat" package we gradually put together the following clues to the hangs:   Numerous "INFO: task <task-name>:<pid> blocked for more than 120 seconds"   High CPU usage suddenly on all CPUs heading into the hang, 92% or higher   Very high kswapd page scan rates (pgscank/s) - up to 7 million per second   Very high direct page scan rates (pgscand/s) - up to 3 million per second In addition to noting the above events just before the hangs, we have some evidence that the high kswapd scans occur at other times for no seemingly obvious reason. Such as when there is a signficant (25%) amount of kbmemfree. Also, we've seen cases where there are application errors related to a system's responsiveness and that has sometimes correlated with either high pgscank/s or pgscand/s that lasts for some number of sar records before the system returns to normal running. The peaks of these transients aren't usually as high as those we see leading to a solid system hang/failure. And sometimes these are not "transients", but last for hours with no apparent event related to the starting or stopping of this behavior! So we decided to see if we can reproduce these symptoms on a VMware testbed that we could easily examine with kdb and snapshot/dump. Through a combination of tar, find, and cat commands launched from a shell script we could recreate a system hang on both our 2.6.38-16 systems as well as the various flavors of the 3.2 kernels, with the one crashdump'ed here being the latest 3.2.0-38 at the time of testing. The "sar" utility on our 2.6 testing confirmed similar behavior of the CPUs, kswapd scans, and direct scans leading up to the testbed hangs as to what we see in the field failures of our servers. Details on the shell scripts can be found in the file referenced below. Its important to read the information below on how the crash dump was taken before investigating it. Reproduction on a 2-CPU VM took 1.5-4 days for a 3.2 kernel, usually considerably less for a 2.6 kernel. Hang/crashdump details: In the crashdump the crash "dmesg" command will also show Call Traces that occured *after* kdb investigations started. Its important to note the kernel timestamp that indicates the start of those kdb actions and only examine prior to that for clues as to the hang proper: [160512.756748] SysRq : DEBUG [164936.052464] psmouse serio1: Wheel Mouse at isa0060/serio1/input0 lost synchronization, throwing 2 bytes away. [164943.764441] psmouse serio1: resync failed, issuing reconnect request [165296.817468] SysRq : DEBUG Everything previous to the above "dmesg" output occurs prior (or during) the full system hang. The kdb session started over 12 hours after the hang, the system was totally non-responsive at either its serial console or GUI. Did not try a "ping" in this instance. The "kdb actions" taken may be seen in an actual log of that session recorded in console_kdb_session.txt. It shows where these 3.2 kernels are spending their time when hung in our testbed ("spinning" in __alloc_pages_slowpath by failing an allocation, sleeping, retrying). We see the same behavior for the 2.6 kernels/tests as well except for one difference described below. For the 3.2 dump included here all our script/load processes, as well as system processes, are constantly failing to allocate a page, sleeping briefly, and trying again. This occurs across all CPUs (2 CPUs in this system/dump), which fits with what we believe we see in our field machines for the 2.6 kernels. For the 2.6 kernels the only difference we see is that there is typically a call to the __alloc_pages_may_oom function which in turn selects a process to kill, but we see that there is already a "being killed by oom" process at the hang so no additional ones are selected. And we deadlock, just as the comment in oom_kill.c's select_bad_process() says. In the 3.2 kernels we are now moving our systems to we see in our testbed hang that the code does not go down the __alloc_pages_may_oom path. Yet from the logs we include and the "dmesg" within crash one can see that prior to the hang OOM killing is invoked frequently. The key seems to be a difference in the "did_some_progress" variable returned when we are very low on memory, its always a "1" in the 3.2 kernels on our testbed. Though the kernel used here is 3.2.0-38-generic we have also caused this to occur with earlier 3.2 Ubuntu kernels. We have also reproduced the failures with 2.6.38-8, 2.6.38-16, and 3.0 Ubuntu kernels. Quick description of included attachments (assuming this bug tool lets me add them separately): console_boot_output.txt - boot up messages until standard running state of OOMs dmesg_of_boot.txt - dmesg file from boot, mostly duplicates start of the above console_last_output.txt - last messages on serial console when system hung console_kdb_session.txt - kdb session demo'ing where system is "spinning" dump.201303072055 - sysrq-g dump, system was up around 2 days before hanging reproduction_info.txt - Machine environment and script used in our testbed ProblemType: Bug DistroRelease: Ubuntu 12.04 Package: linux-image-3.2.0-38-generic 3.2.0-38.61 ProcVersionSignature: Ubuntu 3.2.0-38.61-generic 3.2.37 Uname: Linux 3.2.0-38-generic x86_64 AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24. ApportVersion: 2.0.1-0ubuntu17.1 Architecture: amd64 ArecordDevices:  **** List of CAPTURE Hardware Devices ****  card 0: AudioPCI [Ensoniq AudioPCI], device 0: ES1371/1 [ES1371 DAC2/ADC]    Subdevices: 1/1    Subdevice #0: subdevice #0 AudioDevicesInUse:  USER PID ACCESS COMMAND  /dev/snd/controlC0: marc 2591 F.... pulseaudio CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found. Card0.Amixer.info:  Card hw:0 'AudioPCI'/'Ensoniq AudioPCI ENS1371 at 0x20c0, irq 18'    Mixer name : 'Cirrus Logic CS4297A rev 3'    Components : 'AC97a:43525913'    Controls : 24    Simple ctrls : 13 Date: Wed Mar 13 17:05:30 2013 HibernationDevice: RESUME=UUID=2342cd45-2970-47d7-bb6d-6801d361cb3e InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425) Lsusb:  Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub  Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub  Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse  Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub MachineType: VMware, Inc. VMware Virtual Platform MarkForUpload: True ProcEnviron:  TERM=xterm  PATH=(custom, no user)  LANG=en_US.UTF-8  SHELL=/bin/bash ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-38-generic root=UUID=2db72c58-0ff6-48f6-87e4-55365ee344df ro crashkernel=384M-2G:64M,2G-:128M rootdelay=60 console=ttyS1,115200n8 kgdboc=kms,kbd,ttyS1,115200n8 splash RelatedPackageVersions:  linux-restricted-modules-3.2.0-38-generic N/A  linux-backports-modules-3.2.0-38-generic N/A  linux-firmware 1.79.1 RfKill: SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 06/02/2011 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd06/02/2011:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc.
2013-08-04 02:22:32 penalvch linux (Ubuntu): status Confirmed Incomplete
2013-08-09 21:51:17 Marc Hasson linux (Ubuntu): status Incomplete Confirmed
2013-08-10 02:21:14 penalvch tags amd64 apport-bug kernel-bug-exists-upstream precise running-unity amd64 apport-bug kernel-bug-exists-upstream-v3.9-rc2 precise running-unity
2013-08-10 02:35:27 penalvch linux (Ubuntu): status Confirmed Incomplete
2013-08-10 03:24:16 Marc Hasson bug watch added http://bugzilla.kernel.org/show_bug.cgi?id=59901
2013-10-16 18:46:05 Marc Hasson tags amd64 apport-bug kernel-bug-exists-upstream-v3.9-rc2 precise running-unity amd64 apport-bug kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.12-rc5 kernel-bug-exists-upstream-v3.9-rc2 kernel-fixed-upstream kernel-fixed-upstream-v3.11-rc7 precise running-unity
2013-10-16 18:46:49 Marc Hasson linux (Ubuntu): status Incomplete Confirmed