System locks up after upgrading to linux-image-2.6.32-32-generic-pae

Bug #805209 reported by Tom Ellis
46
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Maverick
Fix Released
Undecided
Unassigned

Bug Description

# Issue

* System lock-ups are experienced after upgrading to linux-image-2.6.32-32-generic-pae from linux-image-2.6.32-31-generic-pae

* Issue only affects PAE kernel when running with non-pae kernel the issue is not seen

## Environment

Ubuntu 10.04.2 LTS x86

On Lenovo models:
W510
T410
T500
X201

* All systems have 4gb of ram
* All systems use the PAE kernel

## Resolution

* (workaround) Revert to linux-image-2.6.32-31-generic-pae
* (workaround) Revert to linux-image-2.6.32-32-generic

## Diagnostic Steps

* Lockups happen at random, system is non-responsive to network pings and no console logs are displayed
* Can be triggered by applications that probe the network and hardware (asset collection)

## Other information

* If freeze does not occur directly, one CPU core goes up to 100% and/or memory consuption increases linearly until memory is depleated

* Additional tests carried out on the T410 (PASS/FAIL indicate if system crashes are experienced during testing):
2.6.38-10-PAE (linux-image-generic-pae-lts-backport-natty from lucid-proposed): PASS
2.6.35-30-PAE (linux-image-generic-pae-lts-backport-maverick): FAIL
2.6.32-33-PAE (lucid-proposed): FAIL
2.6.32-33 (lucid-proposed): PASS
2.6.32-32-PAE: FAIL
2.6.32-32: PASS
2.6.32-31-PAE: PASS
2.6.32-31: PASS

* T410 VGA Controller:
00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)
        Kernel driver in use: i915
        Kernel modules: i915

ii linux-image-2.6.32-31-generic 2.6.32-31.61 Linux kernel image for version 2.6.32 on x86/x86_64
ii linux-image-2.6.32-31-generic-pae 2.6.32-31.61 Linux kernel image for version 2.6.32 on x86
ii linux-image-2.6.32-32-generic 2.6.32-32.62 Linux kernel image for version 2.6.32 on x86/x86_64
ii linux-image-2.6.32-32-generic-pae 2.6.32-32.62 Linux kernel image for version 2.6.32 on x86
ii linux-image-2.6.32-33-generic 2.6.32-33.69 Linux kernel image for version 2.6.32 on x86/x86_64
ii linux-image-2.6.32-33-generic-pae 2.6.32-33.69 Linux kernel image for version 2.6.32 on x86

Tom Ellis (tellis)
tags: added: kernel-bug pse
Tom Ellis (tellis)
tags: added: regression-update
tags: added: i386 lucid
Revision history for this message
Tom Ellis (tellis) wrote :

Started a bisect of the lucid kernel for testing, notes below.

Created a lucid i386 schroot for building kernels on my x86_64 natty system using the kernel team build scripts:
https://wiki.ubuntu.com/KernelTeam/KernelMaintenanceStarter

Kernel bisections (https://wiki.ubuntu.com/Kernel/KernelBisection):
$ git clone git://kernel.ubuntu.com/ubuntu/ubuntu-lucid.git

Commit tags for good and bad kernel:
Bad: Ubuntu-2.6.32-32.62
Good: Ubuntu-2.6.32-31.61

All commits between the two releases we think are "good" and "bad":
$ git log --oneline Ubuntu-2.6.32-31.61..Ubuntu-2.6.32-32.62 > ../diff-Ubuntu-2.6.32-31.61-Ubuntu-2.6.32-32.62

$ git log --oneline Ubuntu-2.6.32-31.61..Ubuntu-2.6.32-32.62 | wc -l
179

179 commits between the two tags.

Checkout a tree with all commits up to the "bad" kernel tag, Ubuntu-2.6.32-32.62:
git checkout -b bisect Ubuntu-2.6.32-32.62

Build from the schroot (follow article above):
sudo schroot -clucid-i386

Start bisection:
$ git bisect start Ubuntu-2.6.32-32.62 Ubuntu-2.6.32-31.61
Bisecting: 89 revisions left to test after this (roughly 7 steps)
[1198ae36cd30092718563b42e2f1d847516a5e45] fbcon: Bugfix soft cursor detection in Tile Blitting

Update the debian.master/changelog with an entry for the custom kernel test package e.g.:
linux (2.6.32-32.63~tellis01LP805209) lucid; urgency=low

Test build for bisect of pae kernel lockups regression

-- Tom Ellis <email address hidden> Sun, 03 Jul 2011 20:30:52 +0000

Prepare for the build (this creates the needed files in the debian/ directory from the debian.master copies):
fakeroot debian/rules clean

Build it:
$ skipabi=true skipmodule=true fakeroot debian/rules binary-generic-pae

Kernels uploaded to:
http://people.canonical.com/~trellis/bisect-lp805209/

Revision history for this message
Tom Ellis (tellis) wrote :

So far:
linux-image-2.6.32-32-generic-pae_2.6.32-32.63~tellis02LP805209_i386.deb - PASS - no freeze on 10 tests
linux-image-2.6.32-32-generic-pae_2.6.32-32.63~tellis01LP805209_i386.deb - FAIL - freezes approx every second test
2.6.32-32 PAE standard kernel - FAIL - freezes approx every second test

Bisect log after linux-image-2.6.32-32-generic-pae_2.6.32-32.63~tellis02LP805209_i386.deb tested:
# git bisect good
Bisecting: 22 revisions left to test after this (roughly 5 steps)
[0359bccd80e98612776ec55d93662c1065d1e850] isdn: avoid calling tty_ldisc_flush() in atomic context
# git bisect log
# bad: [cbced76c3674f374a468e7aabe1247fa4fa7a012] UBUNTU: Ubuntu-2.6.32-32.62
# good: [a35a5d8abda4c1a445fba727877c40ff40fdd57c] UBUNTU: Ubuntu-2.6.32-31.61
git bisect start 'Ubuntu-2.6.32-32.62' 'Ubuntu-2.6.32-31.61'
# bad: [1198ae36cd30092718563b42e2f1d847516a5e45] fbcon: Bugfix soft cursor detection in Tile Blitting
git bisect bad 1198ae36cd30092718563b42e2f1d847516a5e45
# good: [f1324476bb889b1a6824581f55d65324df9b505d] ahci: AHCI mode SATA patch for Intel Patsburg SATA RAID controller
git bisect good f1324476bb889b1a6824581f55d65324df9b505d

Revision history for this message
Tom Ellis (tellis) wrote :
Revision history for this message
Tom Ellis (tellis) wrote :

Last bisect passed.

(lucid-i386)root@viper# git bisect good
Bisecting: 14 revisions left to test after this (roughly 4 steps)
[71e56b32596646a8fc6eb139189112bdad1ea117] x86, binutils, xen: Fix another wrong size directive

(lucid-i386)root@viper# git bisect log
# bad: [cbced76c3674f374a468e7aabe1247fa4fa7a012] UBUNTU: Ubuntu-2.6.32-32.62
# good: [a35a5d8abda4c1a445fba727877c40ff40fdd57c] UBUNTU: Ubuntu-2.6.32-31.61
git bisect start 'Ubuntu-2.6.32-32.62' 'Ubuntu-2.6.32-31.61'
# bad: [1198ae36cd30092718563b42e2f1d847516a5e45] fbcon: Bugfix soft cursor detection in Tile Blitting
git bisect bad 1198ae36cd30092718563b42e2f1d847516a5e45
# good: [f1324476bb889b1a6824581f55d65324df9b505d] ahci: AHCI mode SATA patch for Intel Patsburg SATA RAID controller
git bisect good f1324476bb889b1a6824581f55d65324df9b505d
# good: [f1809971f60bf42ff9b47eac395e8729ed6bd525] SUNRPC: Ensure we always run the tk_callback before tk_action
git bisect good f1809971f60bf42ff9b47eac395e8729ed6bd525

Revision history for this message
Tom Ellis (tellis) wrote :

Having issues recompiling after this portion of the bisect, I've followed advice from apw to run 'git reset --hard HEAD^' but after 10 of these, it still fails to compile.

Revision history for this message
Tom Ellis (tellis) wrote :

Some error messages seen on screen.

Revision history for this message
Tom Ellis (tellis) wrote :
Revision history for this message
Herton R. Krzesinski (herton) wrote :

@Tom Ellis: can you try reverting the commit cdaa050bd1c658c5b808a35a1fc061038ad19bc5 ("x86: Flush TLB if PGD entry is changed in i386 PAE mode") from the Ubuntu-2.6.32-32.62 kernel, build a -pae flavour and ask for testing?

I think it's likely this commit play a role in this bug, since it's inside the last bad/good window from your last bisect log (f1809971f60bf42ff9b47eac395e8729ed6bd525..1198ae36cd30092718563b42e2f1d847516a5e45), and this commit affects the PAE path, adding a flush_tlb_mm call that also appears on the soft lockup trace.

Revision history for this message
Tom Ellis (tellis) wrote :

Thanks Herton, I did see that commit and wonder... compiling now.

Revision history for this message
Tom Ellis (tellis) wrote :

http://people.canonical.com/~trellis/bisect-lp805209/linux-image-2.6.32-32-generic-pae_2.6.32-32.63~tellis04LP805209_i386.deb

This kernel is not a bisect, it just removes the one patch mentioned above:
cdaa050bd1c658c5b808a35a1fc061038ad19bc5 "x86: Flush TLB if PGD entry is changed in i386 PAE mode"

Revision history for this message
Thorsten Hesemeyer (thorsten-hesemeyer) wrote :

New kernel passed!
No freeze!

Thank you Tom,
the last tellis04 kernel does not show the heavy freeze issue.

The current standard Ubuntu kernel 2.6.32-32-generic-pae showed a freeze every time executing our application (or at least the second time).
I ran the application thirty (30) times with the 2.6.32-32.63-tellis04 on the same machine - no freeze.

Congratulations!!!
Thorsten

Revision history for this message
Tom Ellis (tellis) wrote :

Good news without cdaa050bd1c658c5b808a35a1fc061038ad19bc5 the kernel passed 30 test iterations. So it's this one causing the regression.

Revision history for this message
Murilo Opsfelder Araújo (mopsfelder) wrote :

Hi guys,

I'd like to add that my T410 didn't freeze with linux-image-2.6.32-32-generic-pae_2.6.32-32.63~tellis04LP805209_i386.deb also.

Thanks,
Murilo

Revision history for this message
Tom Ellis (tellis) wrote :

Thanks Gustavo, Thorsten & Murilo.

We are respinning the lucid kernel, and it'll be out soon. Hopefully within two weeks, before then we'll have it in Lucid-proposed for testing with headers and other bits you need.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

This bug is awaiting verification that the kernel in lucid -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lucid' to 'verification-done-lucid'.

If verification is not done by one week from today, this revert will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-lucid
Revision history for this message
Herton R. Krzesinski (herton) wrote :

Ok, please ignore comment #15, since this is a known revert we don't need to verify it (and will not drop the revert).

Anyway would be great if anyone here can verify the kernel in -proposed and confirm you don't have the issue, just report here the results, thanks.

tags: removed: verification-needed-lucid
Revision history for this message
Thorsten Hesemeyer (thorsten-hesemeyer) wrote :

Hi Herton,

I've enabled lucid-proposed, installed and tested this kernel:
   Ubuntu 2.6.32-33-generic-pae #70

Did 30x iterations of running the application that triggered the bug, not a single freeze.

Thank you.

Kind regards,
Thorsten Hesemeyer

tags: added: verification-done-lucid
Revision history for this message
Gustavo Yokoyama Ribeiro (gutoyr) wrote :

Hi Herton,

Same here, tested lucid-proposed kernel (Ubuntu 2.6.32-33-generic-pae #70) and system did not lock up.

Regards,
Gustavo

Revision history for this message
Herton R. Krzesinski (herton) wrote :

As this bug affects maverick too, a new maverick kernel is also now available in -proposed (2.6.35-30.56), with same revert.

Revision history for this message
Thorsten Hesemeyer (thorsten-hesemeyer) wrote :

Dear Herton,

just got a very positive feedback for the new 2.6.35-30.56 kernel from a colleague who formerly complained about a multiple freezes per day.
Thank you.

Kind regards,
Thorsten

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (17.2 KiB)

This bug was fixed in the package linux - 2.6.32-33.70

---------------
linux (2.6.32-33.70) lucid-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #807175

  [ Upstream Kernel Changes ]

  * Revert "x86: Flush TLB if PGD entry is changed in i386 PAE mode"
    - LP: #805209

linux (2.6.32-33.69) lucid-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #802554

  [ Upstream Kernel Changes ]

  * Revert "af_unix: Only allow recv on connected seqpacket sockets."

linux (2.6.32-33.68) lucid-proposed; urgency=low

  [ Steve Conklin ]

  * Release Tracking Bug
    - LP: #798305
  * Fix abi directory

linux (2.6.32-33.67) lucid-proposed; urgency=low

  [ Upstream Kernel Changes ]

  * Revert "iwlagn: Support new 5000 microcode."

linux (2.6.32-33.66) lucid-proposed; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #794098

  [ Upstream Kernel Changes ]

  * Revert "xhci: Fix full speed bInterval encoding."
  * Revert "USB: xhci - fix math in xhci_get_endpoint_interval()"
  * Revert "USB: xhci - fix unsafe macro definitions"

linux (2.6.32-33.65) lucid-proposed; urgency=low

  [ Upstream Kernel Changes ]

  * xhci: Fix full speed bInterval encoding.
    - LP: #792959

linux (2.6.32-33.64) lucid-proposed; urgency=low

  [ Herton R. Krzesinski ]

   * Release Tracking Bug
     - LP: #789325

  [ Leann Ogasawara ]

  * SAUCE: (no-up) Fix up KVM: VMX: Fix host userspace gsbase corruption
    - LP: #787675

  [ Thomas Schlichter ]

  * SAUCE: vesafb: mtrr module parameter is uint, not bool
    - LP: #778043

  [ Tim Gardner ]

  * Revert "(pre-stable): input: Support Clickpad devices in ClickZone
    mode"
    - LP: #780588

  [ Upstream Kernel Changes ]

  * Revert "GFS2: Fix writing to non-page aligned gfs2_quota structures"
    - LP: #780588
  * Revert "mmc: build fix: mmc_pm_notify is only available with
    CONFIG_PM=y"
    - LP: #780588
  * Revert "mmc: fix all hangs related to mmc/sd card insert/removal during
    suspend/resume"
    - LP: #780588
  * Revert "econet: fix CVE-2010-3848"
    - LP: #780588
  * Revert "dell-laptop: Add another Dell laptop family to the DMI
    whitelist"
    - LP: #780588
  * Revert "dell-laptop: Add another Dell laptop family to the DMI
    whitelist"
    - LP: #780588
  * Revert "xen: set max_pfn_mapped to the last pfn mapped"
  * cifs: always do is_path_accessible check in cifs_mount
    - LP: #770050
  * video: sn9c102: world-wirtable sysfs files
    - LP: #770050
  * UBIFS: restrict world-writable debugfs files
    - LP: #770050
  * NET: cdc-phonet, handle empty phonet header
    - LP: #770050
  * x86: Fix a bogus unwind annotation in lib/semaphore_32.S
    - LP: #770050
  * tioca: Fix assignment from incompatible pointer warnings
    - LP: #770050
  * mca.c: Fix cast from integer to pointer warning
    - LP: #770050
  * ramfs: fix memleak on no-mmu arch
    - LP: #770050
  * MAINTAINERS: update STABLE BRANCH info
    - LP: #770050
  * UBIFS: fix oops when R/O file-system is fsync'ed
    - LP: #770050
  * x86, cpu: AMD errata checking framework
    - LP: #770050
  * x86, cpu: Clean up AMD erratum 400 workaround
    - LP: #770050
  * x86, AMD: Se...

Changed in linux (Ubuntu):
status: New → Fix Released
Revision history for this message
Greg Gorman (gregg-public) wrote :

Booted on my x201 and will see how it goes. Older kernels with the bug tended to freeze up overnight. First impressions are good in that it didn't freeze when the wifi associated (I filed it as bug #800798).

Revision history for this message
Steve Conklin (sconklin) wrote :

This revert has been applied to the Maverick kernel now in -proposed

Changed in linux (Ubuntu Maverick):
status: New → Confirmed
status: Confirmed → Fix Committed
Revision history for this message
Tom Ellis (tellis) wrote :

We now have confirmation that both -proposed kernel for lucid and maverick are working well.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

As this was a revert from a upstream patch which brought this regression, doesn't need verification. Also we have confirmation from comments here it's fixed on -proposed kernel. Tagging verification-done-maverick.

tags: added: verification-done-maverick
tags: added: maverick
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.35-30.56

---------------
linux (2.6.35-30.56) maverick-proposed; urgency=low

  [Herton R. Krzesinski]

  * Release Tracking Bug
    - LP: #808934

  [ Herton Ronaldo Krzesinski ]

  * Revert "SAUCE: mmc: Enable MMC card reader for RICOH [1180:e823]"

  [ Upstream Kernel Changes ]

  * Revert "x86: Flush TLB if PGD entry is changed in i386 PAE mode"
    - LP: #805209

linux (2.6.35-30.55) maverick-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #801690

  [ Jeremy Kerr ]

  * SAUCE: cx23885: Fix argument to videobuf_dma_unmap
    - LP: #800527

  [ Manoj Iyer ]

  * SAUCE: mmc: Enable MMC card reader for RICOH [1180:e823]
    - LP: #790754

  [ Upstream Kernel Changes ]

  * agp: fix OOM and buffer overflow
    - LP: #791918
    - CVE-2011-1746
  * tty: icount changeover for other main devices, CVE-2010-4076,
    CVE-2010-4077
    - LP: #720189
    - CVE-2010-4077
  * fs/partitions/efi.c: corrupted GUID partition tables can cause kernel
    oops
    - LP: #795418
    - CVE-2011-1577
  * Fix corrupted OSF partition table parsing
    - LP: #796606
    - CVE-2011-1163
  * can: Add missing socket check in can/bcm release.
    - LP: #796502
    - CVE-2011-1598
  * nfs4: Ensure that ACL pages sent over NFS were not allocated from the
    slab (v3) CVE-2011-1090
    - LP: #800775
    - CVE-2011-1090
 -- Herton Ronaldo Krzesinski <email address hidden> Mon, 11 Jul 2011 15:17:32 -0300

Changed in linux (Ubuntu Maverick):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.