serial deadlock with SMP

Bug #280821 reported by pbeeson on 2008-10-09
8
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Hardy
High
Andy Whitcroft

Bug Description

Binary package hint: linux-image-2.6.24-19-generic

Upon switching from Dapper to Hardy, I noticed one of my systems kept getting serial timeouts. Upon further investigation, it seems that there is a serial deadlock issue in kernels < 2.6.27, where machines with multiple processors and multiple serial devices can call the serial interrupt handler at the same time and become deadlocked. This was fixed in 2.6.27 with 4 lines of code. I have created a patch with the same four line based on the linux-source-2.6.24.tbz (from installing linux-source package), and I have tested it with many serial deives talking at once. My problems go away. Given that this is such a fundamental bug (deadlock in interrupt handlers), I feel it is necessary to apply this to the LTS kernels in Hardy.

The original discussion of this bug and the patch (by the usual kernel programmers) is at:
http://kerneltrap.org/mailarchive/linux-kernel/2008/7/1/2313364

Note, that the final fix in 2.6.27 keeps the lock and unlock lines as they were (the original discussion at the thread above changed them). My patch does not change them (as in the final 2.6.27 patch for this problem).

===

SRU Justification

Justification: Systems with multiple serial ports on the will deadlock losing serial output.

Impact: Systems may lose serial output including console output, particularly serious on servers.

Fix Description: Cherry-picked the upstream fixes for this, which introduces proper locking.

Patch:
    http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=commit;h=45485c71ce3099b7c7349f6c32dadc53087794cf
    http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=commit;h=6dc468aa85f0cf60ab1b8897143226203510efe8

Risks: This has been tested by the affected users to good effect. Risk should be low for normal use.

TEST CASE: See bug report.

pbeeson (pbeeson) wrote :
pbeeson (pbeeson) on 2008-10-09
description: updated
description: updated

Hi pbeeson,

Thanks for the report. I'll go ahead and open a Hardy nomination. I'm also including the git commit id's whose final patch result is what you have attached:

commit 768aec0b5bccbd460bcf6e9131f19b5a26f3862d
Author: Anton Vorontsov <email address hidden>

    serial: 8250: fix shared interrupts issues with SMP and RT kernels

diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c
index ce948b6..27f34a9 100644
--- a/drivers/serial/8250.c
+++ b/drivers/serial/8250.c
@@ -1874,7 +1874,9 @@ static int serial8250_startup(struct uart_port *port)
                 * the interrupt is enabled. Delays are necessary to
                 * allow register changes to become visible.
                 */
- spin_lock_irqsave(&up->port.lock, flags);
+ spin_lock(&up->port.lock);
+ if (up->port.flags & UPF_SHARE_IRQ)
+ disable_irq_nosync(up->port.irq);

                wait_for_xmitr(up, UART_LSR_THRE);
                serial_out_sync(up, UART_IER, UART_IER_THRI);
@@ -1886,7 +1888,9 @@ static int serial8250_startup(struct uart_port *port)
                iir = serial_in(up, UART_IIR);
                serial_out(up, UART_IER, 0);

- spin_unlock_irqrestore(&up->port.lock, flags);
+ if (up->port.flags & UPF_SHARE_IRQ)
+ enable_irq(up->port.irq);
+ spin_unlock(&up->port.lock);

                /*
                 * If the interrupt is not reasserted, setup a timer to

commit c389d27b5e643d745f55ffb939b1426060ba63d4
Author: Borislav Petkov <email address hidden>

    8250.c: port.lock is irq-safe

diff --git a/drivers/serial/8250.c b/drivers/serial/8250.c
index a97f1ae..342e12f 100644
--- a/drivers/serial/8250.c
+++ b/drivers/serial/8250.c
@@ -1885,7 +1885,7 @@ static int serial8250_startup(struct uart_port *port)
                 * the interrupt is enabled. Delays are necessary to
                 * allow register changes to become visible.
                 */
- spin_lock(&up->port.lock);
+ spin_lock_irqsave(&up->port.lock, flags);
                if (up->port.flags & UPF_SHARE_IRQ)
                        disable_irq_nosync(up->port.irq);

@@ -1901,7 +1901,7 @@ static int serial8250_startup(struct uart_port *port)

                if (up->port.flags & UPF_SHARE_IRQ)
                        enable_irq(up->port.irq);
- spin_unlock(&up->port.lock);
+ spin_unlock_irqrestore(&up->port.lock, flags);

                /*
                 * If the interrupt is not reasserted, setup a timer to

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
status: New → Triaged

I've marked this as "Triaged" for Hardy and am setting this to "Fix Released" for Intrepid. Thanks.

Changed in linux:
status: New → Fix Released

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

pbeeson (pbeeson) wrote :

>>Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Does this mean that this bug patch will not be considered for Hardy? I've been hoping that it would be included in one of the several kernel updates that have been released since I sent this patch, but it doesn't seem to have been included. I don't understand why, as a deadlock in the serial driver seems important to me -- in fact it took me a week to figure out that this was causing hug problem with my system. Sorry to seem impatient, but I would like to get other kernel updates without having to manually patch the source and rebuild the kernel each time.

Andy Whitcroft (apw) wrote :

@pbeeson -- it simply means that the assignment to the launchpad group ubuntu-kernel-team was not adding any value, and indeed leading to bugs being missed. So we have removed that assignment. It does not mean that the bug is not the responsibility of that group, nor that it is not being tracked.

Andy Whitcroft (apw) on 2009-02-09
Changed in linux:
assignee: nobody → apw
status: Triaged → In Progress
Andy Whitcroft (apw) wrote :

@pbeeson -- I have built some test kernels based on the latest Hardy -proposed kernel with these two commits applied. As part of the SRU process I need to get these tested. Could you test them and report back here. Kernels will be at the URL below shortly:

    http://people.ubuntu.com/~apw/lp280821-hardy/

Changed in linux:
importance: Undecided → High

I will test these the next time I am at the facility that contains the
machine that does multi-threaded serial I/O.

It may be a week or so before I get back to you.

Thanks.

Andy Whitcroft wrote:
> @pbeeson -- I have built some test kernels based on the latest Hardy
> -proposed kernel with these two commits applied. As part of the SRU
> process I need to get these tested. Could you test them and report back
> here. Kernels will be at the URL below shortly:
>
> http://people.ubuntu.com/~apw/lp280821-hardy/
>
> ** Changed in: linux (Ubuntu)
> Importance: Undecided => High
>

--
Patrick Beeson
http://www.cs.utexas.edu/~pbeeson

pbeeson (pbeeson) wrote :

I have done some initial testing of this kernel . Multiple serial devices work fine with the patched kernel. Of course, even in the non-patched kernels these work OK sometimes, so I'll continue to use this kernel and report if/when I see a failure. But I can verify that the same patch built on my own version of the kernel has worked successfully for many months, and this new patched kernel has worked successfully for several hours so far.

pbeeson (pbeeson) wrote :

Long term testing of this kernel confirms the fact that this patch to the 8250 serial driver does indeed fix the deadlock problem. This test kernel has worked great so far, with no problems. Pleaswe condiser adding this patch to the publicly distributed kernel.

Andy Whitcroft (apw) on 2009-03-02
description: updated
Stefan Bader (smb) on 2009-03-18
Changed in linux:
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (5.1 KiB)

This bug was fixed in the package linux - 2.6.24-24.53

---------------
linux (2.6.24-24.53) hardy-proposed; urgency=low

  [Stefan Bader]

  * Rebuild of 2.6.24-24.51 with 2.6.24-23.52 security patches applied.

linux (2.6.24-24.51) hardy-proposed; urgency=low

  [Alessio Igor Bogani]

  * rt: Updated PREEMPT_RT support to rt27
    - LP: #324275

  [Steve Beattie]

  * fix apparmor memory leak on deleted file ops
    - LP: #329489

  [Upstream Kernel Changes]

  * KVM: MMU: Add locking around kvm_mmu_slot_remove_write_access()
    - LP: #335097, #333409
  * serial: 8250: fix shared interrupts issues with SMP and RT kernels
    - LP: #280821
  * 8250.c: port.lock is irq-safe
    - LP: #280821
  * ACPI: Clear WAK_STS on resume
    - LP: #251338

linux (2.6.24-24.50) hardy-proposed; urgency=low

  [Alok Kataria]

  * x86: add X86_FEATURE_HYPERVISOR feature bit
    - LP: #319945
  * x86: add a synthetic TSC_RELIABLE feature bit
    - LP: #319945
  * x86: vmware: look for DMI string in the product serial key
    - LP: #319945
  * x86: Hypervisor detection and get tsc_freq from hypervisor
    - LP: #319945
  * x86: Use the synthetic TSC_RELIABLE bit to workaround virtualization
    anomalies.
    - LP: #319945
  * x86: Skip verification by the watchdog for TSC clocksource.
    - LP: #319945
  * x86: Mark TSC synchronized on VMware.
    - LP: #319945

  [Colin Ian King]

  * SAUCE: Bluetooth USB: fix kernel panic during suspend while streaming
    audio to bluetooth headset
    - LP: #331106

  [James Troup]

  * XEN: Enable architecture specific get_unmapped_area_topdown
    - LP: #237724

  [Stefan Bader]

  * Xen: Fix FTBS after Vmware TSC updates.
    - LP: #319945

  [Upstream Kernel Changes]

  * r8169: fix RxMissed register access
    - LP: #324760
  * r8169: Tx performance tweak helper
    - LP: #326891
  * r8169: use pci_find_capability for the PCI-E features
    - LP: #326891
  * r8169: add 8168/8101 registers description
    - LP: #326891
  * r8169: add hw start helpers for the 8168 and the 8101
    - LP: #326891
  * r8169: additional 8101 and 8102 support
    - LP: #326891
  * Fix memory corruption in console selection
    - LP: #329007

linux (2.6.24-23.52) hardy-security; urgency=low

  [Stefan Bader]
  * rt: Fix FTBS caused by shm changes
    - CVE-2009-0859

  [Steve Beattie]

  * fix apparmor memory leak on deleted file ops
    - LP: #329489

  [Upstream Kernel Changes]

  * NFS: Remove the buggy lock-if-signalled case from do_setlk()
    - CVE-2008-4307
  * sctp: Avoid memory overflow while FWD-TSN chunk is received with bad
    stream ID
    - CVE-2009-0065
  * net: 4 bytes kernel memory disclosure in SO_BSDCOMPAT gsopt try #2
    - CVE-2009-0676
  * sparc: Fix mremap address range validation.
    - CVE-2008-6107
  * copy_process: fix CLONE_PARENT && parent_exec_id interaction
    - CVE-2009-0028
  * security: introduce missing kfree
    - CVE-2009-0031
  * eCryptfs: check readlink result was not an error before using it
    - CVE-2009-0269
  * dell_rbu: use scnprintf() instead of less secure sprintf()
    - CVE-2009-0322
  * drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic
    - CVE-2009-0675
  * Ext4: Fix online res...

Read more...

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers