CPU soft lockup in Xen PTE allocation on m2.2xlarge instances

Bug #919431 reported by Matt Wilson
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-meta-ec2 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

The following soft lockup is seen randomly on m2.2xlarge instances in EC2:

[1284451.875485] BUG: soft lockup - CPU#3 stuck for 61s! [identify:24060]
[1284451.875485] Modules linked in: ipv6 ipt_REJECT ipt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp xt_owner iptable_filter ip_tables x_tables raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear md_mod
[1284451.875485] CPU 3:
[1284451.875485] Modules linked in: ipv6 ipt_REJECT ipt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp xt_owner iptable_filter ip_tables x_tables raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear md_mod
[1284451.875485] Pid: 24060, comm: identify Tainted: G D 2.6.32-316-ec2 #31-Ubuntu
[1284451.875485] RIP: e030:[<ffffffff810063aa>] [<ffffffff810063aa>] 0xffffffff810063aa
[1284451.875485] RSP: e02b:ffff8800eba4d930 EFLAGS: 00000246
[1284451.875485] RAX: 0000000000000000 RBX: ffff88025b7634c8 RCX: ffffffff810063aa
[1284451.875485] RDX: 0000000000000019 RSI: ffff8800eba4d948 RDI: 0000000000000003
[1284451.875485] RBP: ffff8800eba4d968 R08: ffff88025b763708 R09: 0000000000000040
[1284451.875485] R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000015
[1284451.875485] R13: 0000000000000045 R14: ffff88025b7634a8 R15: 0000000000000001
[1284451.875485] FS: 00007f1a7f487700(0000) GS:ffff880002f35000(0000) knlGS:0000000000000000
[1284451.875485] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[1284451.875485] CR2: 00007f16bfb0a398 CR3: 0000000001001000 CR4: 0000000000002660
[1284451.875485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1284451.875485] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1284451.875485] Call Trace:
[1284451.875485] [<ffffffff813a1e5f>] ? xen_poll_irq+0x7f/0xc0
[1284451.875485] [<ffffffff813a70d4>] xen_spin_wait+0x84/0x170
[1284451.875485] [<ffffffff814b112e>] ? _spin_lock+0x3e/0x60
[1284451.875485] [<ffffffff814b1143>] _spin_lock+0x53/0x60
[1284451.875485] [<ffffffff8101bf28>] _pin_lock+0x28/0x290
[1284451.875485] [<ffffffff8101d0ff>] mm_unpin+0x1f/0x40
[1284451.875485] [<ffffffff8101d1b1>] arch_exit_mmap+0x91/0xa0
[1284451.875485] [<ffffffff810d24f8>] exit_mmap+0x38/0x1b0
[1284451.875485] [<ffffffff8103a8ad>] mmput+0x2d/0x100
[1284451.875485] [<ffffffff8103f94d>] exit_mm+0x10d/0x150
[1284451.875485] [<ffffffff81040192>] do_exit+0x132/0x370
[1284451.875485] [<ffffffff814b2067>] oops_end+0xa7/0xf0
[1284451.875485] [<ffffffff8100dbf6>] die+0x56/0x90
[1284451.875485] [<ffffffff814b1794>] do_trap+0xc4/0x170
[1284451.875485] [<ffffffff8100b3c0>] do_invalid_op+0xb0/0xd0
[1284451.875485] [<ffffffff8101f5e4>] ? do_lN_entry_update+0x174/0x180
[1284451.875485] [<ffffffff810b2e35>] ? zone_watermark_ok+0x25/0xe0
[1284451.875485] [<ffffffff810b38de>] ? prep_new_page+0x11e/0x1c0
[1284451.875485] [<ffffffff8100a4b5>] invalid_op+0x25/0x30
[1284451.875485] [<ffffffff8101f5e4>] ? do_lN_entry_update+0x174/0x180
[1284451.875485] [<ffffffff8101f58e>] ? do_lN_entry_update+0x11e/0x180
[1284451.875485] [<ffffffff810b4636>] ? __alloc_pages_nodemask+0xd6/0x180
[1284451.875485] [<ffffffff8101fafb>] xen_l2_entry_update+0x13b/0x160
[1284451.875485] [<ffffffff810c99c6>] __pte_alloc+0x166/0x170
[1284451.875485] [<ffffffff810cdf65>] handle_mm_fault+0x495/0x4f0
[1284451.875485] [<ffffffff814b3767>] do_page_fault+0x147/0x390
[1284451.875485] [<ffffffff814b1598>] page_fault+0x28/0x30

The pinning hypercall in do_lN_entry_update() failed, triggering a BUG() which invokes do_invalid_op(). Unfortunately we don't get the BUG() message. Instead, we end up deadlocked on the pin lock.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-ec2 2.6.32.341.22
ProcVersionSignature: User Name 2.6.32-341.42-ec2 2.6.32.49+drm33.21
Uname: Linux 2.6.32-341-ec2 x86_64
Architecture: amd64
Date: Fri Jan 20 22:43:24 2012
Ec2AMI: ami-55dc0b3c
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m2.2xlarge
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-meta-ec2

Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :

The hypercall fails due to invalid write permissions on the page that's attempting to be pinned. Perhaps the page that's being pinned for PTEs was reused?

One fix that was applied to the upstream kernel for such problems was this: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=64141da587241301ce8638cc945f8b67853156ec

I don't think that's the cause in this case since XFS isn't in use. Perhaps some other kernel subsystem is leaving pages behind in the vmalloc area with write permissions set?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-meta-ec2 (Ubuntu):
status: New → Confirmed
Andy Whitcroft (apw)
Changed in linux-meta-ec2 (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.