Ubuntu Crashes/Freeze on XenMotion

Bug #681083 reported by Luiz Ozaki on 2010-11-24
94
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Lucid
Medium
Stefan Bader
Maverick
Medium
Stefan Bader

Bug Description

Binary package hint: linux-image-2.6.32-25-generic

When using one processor trying to migrate the Ubuntu Guest from Hosts it dumps:

[518258.206396] INFO: task xenwatch:12 blocked for more than 120 seconds.
[518258.206405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[518258.206412] xenwatch D ffff88000294cbc0 0 12 2 0x00000000
[518258.206417] ffff88007dfe5bf0 0000000000000246 0000000000015bc0 0000000000015bc0
[518258.206423] ffff88007dfd9ab0 ffff88007dfe5fd8 0000000000015bc0 ffff88007dfd96f0
[518258.206428] 0000000000015bc0 ffff88007dfe5fd8 0000000000015bc0 ffff88007dfd9ab0
[518258.206434] Call Trace:
[518258.206443] [<ffffffff8155a0be>] ? _spin_unlock_irqrestore+0x1e/0x30
[518258.206448] [<ffffffff8155841d>] schedule_timeout+0x22d/0x300
[518258.206454] [<ffffffff8100f302>] ? check_events+0x12/0x20
[518258.206459] [<ffffffff8106085f>] ? __enqueue_rt_entity+0x11f/0x220
[518258.206463] [<ffffffff8105a616>] ? update_curr+0xe6/0x1e0
[518258.206466] [<ffffffff815576c6>] wait_for_common+0xd6/0x180
[518258.206470] [<ffffffff8105a350>] ? default_wake_function+0x0/0x20
[518258.206474] [<ffffffff8155782d>] wait_for_completion+0x1d/0x20
[518258.206478] [<ffffffff81083f1b>] kthread_stop+0x4b/0xd0
[518258.206482] [<ffffffff8108724f>] ? hrtimer_force_reprogram+0x7f/0x90
[518258.206487] [<ffffffff8107fbbe>] cleanup_workqueue_thread+0x3e/0x80
[518258.206490] [<ffffffff8107fda3>] destroy_workqueue+0x93/0xe0
[518258.206495] [<ffffffff810b5764>] stop_machine_destroy+0x34/0x50
[518258.206499] [<ffffffff8131ff7f>] do_suspend+0xaf/0x120
[518258.206502] [<ffffffff813200f9>] shutdown_handler+0x109/0x160
[518258.206505] [<ffffffff81321472>] xenwatch_thread+0xc2/0x190
[518258.206509] [<ffffffff81084240>] ? autoremove_wake_function+0x0/0x40
[518258.206512] [<ffffffff813213b0>] ? xenwatch_thread+0x0/0x190
[518258.206515] [<ffffffff81083ec6>] kthread+0x96/0xa0
[518258.206519] [<ffffffff810131ea>] child_rip+0xa/0x20
[518258.206523] [<ffffffff810123d1>] ? int_ret_from_sys_call+0x7/0x1b
[518258.206526] [<ffffffff81012b5d>] ? retint_restore_args+0x5/0x6
[518258.206530] [<ffffffff810131e0>] ? child_rip+0x0/0x20

Using more processors, the VM just freezes (can't access from console or network) or lose Disk IO and Network connection.

I've tried the kernel http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33.5-lucid/linux-image-2.6.33-02063305-generic_2.6.33-02063305_amd64.deb and the problem doesn't happen in this kernel.

I'm using XenServer 5.6 Build 31188p

I'm trying to use some old kernels and backporting some patches from the 2.6.33 but I'm not successful.

I'm gonna still try to fix it, but any help will be appreciated.

Best regards,

Luiz Ozaki (luiz-ozaki) wrote :

Testing the PPA's kernel, before http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.32.16.5-lucid/ it works well, after this release 2.6.32.16.5 the problem appears.

Looking at the changes, this one seems to be related, changing the disk and network, both symptoms that I'm getting:

      xen: avoid allocation causing potential swap activity on the resume path

I tried to rollback the patch and the Ubuntu VM Guest gets powered off in the end of the migration.

I'll keep looking.

Luiz Ozaki (luiz-ozaki) wrote :
Download full text (4.2 KiB)

Updated to the new kernel release 2.6.32-26-generic

[1285568.771462] ------------[ cut here ]------------
[1285568.771473] kernel BUG at /build/buildd/linux-2.6.32/arch/x86/xen/spinlock.c:343!
[1285568.771486] invalid opcode: 0000 [#1] SMP
[1285568.771500] last sysfs file: /sys/power/pm_trace
[1285568.771508] CPU 0
[1285568.771517] Modules linked in: xenfs lp parport xen_netfront xen_blkfront
[1285568.771554] Pid: 41, comm: xenwatch Not tainted 2.6.32-26-generic #47-Ubuntu
[1285568.771567] RIP: e030:[<ffffffff8100fca4>] [<ffffffff8100fca4>] dummy_handler+0x4/0x10
[1285568.771590] RSP: e02b:ffff880003669e88 EFLAGS: 00010046
[1285568.771594] RAX: ffffffffff57b000 RBX: ffff88007fc1b060 RCX: 0000000000000000
[1285568.771600] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000001
[1285568.771605] RBP: ffff880003669e88 R08: 0000000000000000 R09: 0000000000000000
[1285568.771612] R10: ffff880003671028 R11: 0000000000012eb0 R12: 0000000000000000
[1285568.771617] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000100
[1285568.771626] FS: 00007fae6666c700(0000) GS:ffff880003666000(0000) knlGS:0000000000000000
[1285568.771632] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[1285568.771637] CR2: 0000000000000000 CR3: 000000007d7f7000 CR4: 0000000000002660
[1285568.771645] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1285568.771653] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1285568.771663] Process xenwatch (pid: 41, threadinfo ffff88007d50c000, task ffff88007d4cadc0)
[1285568.771669] Stack:
[1285568.771672] ffff880003669ed8 ffffffff810c4550 0000000000000000 0000000000000000
[1285568.771681] <0> 0000000000000000 ffffffff817b1300 0000000000000001 0000000000000000
[1285568.771690] <0> 0000000000000001 0000000000000100 ffff880003669ef8 ffffffff810c6882
[1285568.771702] Call Trace:
[1285568.771705] <IRQ>
[1285568.771713] [<ffffffff810c4550>] handle_IRQ_event+0x60/0x170
[1285568.771723] [<ffffffff810c6882>] handle_percpu_irq+0x42/0x80
[1285568.771734] [<ffffffff81014d12>] handle_irq+0x22/0x30
[1285568.771743] [<ffffffff8131da99>] xen_evtchn_do_upcall+0x199/0x1c0
[1285568.771749] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[1285568.771754] <EOI>
[1285568.771763] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[1285568.771769] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[1285568.771776] [<ffffffff8100eb6d>] ? xen_force_evtchn_callback+0xd/0x10
[1285568.771783] [<ffffffff8100f302>] ? check_events+0x12/0x20
[1285568.771791] [<ffffffff8100f2a9>] ? xen_irq_enable_direct_end+0x0/0x7
[1285568.771800] [<ffffffff810578b9>] ? finish_task_switch+0x59/0xe0
[1285568.771808] [<ffffffff815417a8>] ? thread_return+0x48/0x420
[1285568.771815] [<ffffffff8106240a>] ? __cond_resched+0x2a/0x40
[1285568.771823] [<ffffffff81543e5e>] ? _spin_unlock_irqrestore+0x1e/0x30
[1285568.771829] [<ffffffff81541c80>] ? _cond_resched+0x30/0x40
[1285568.771836] [<ffffffff81080726>] ? flush_workqueue+0x36/0x80
[1285568.771843] [<ffffffff810b5b34>] ? __stop_machine+0xf4/0x120
[1285568.771850] [<ffffffff8131eda0>] ? xen_suspend+0x0/0xf0
[1285568.771855] [<ffffffff810b5d8e>] ? s...

Read more...

Luiz Ozaki (luiz-ozaki) wrote :

Using 2.6.32-26-server, I think is the same problem, but got in swapper process, anyways here goes:

[1286941.792445] ------------[ cut here ]------------
[1286941.792454] kernel BUG at /build/buildd/linux-2.6.32/arch/x86/xen/spinlock.c:343!
[1286941.792462] invalid opcode: 0000 [#1] SMP
[1286941.792469] last sysfs file: /sys/power/pm_trace
[1286941.792473] CPU 1
[1286941.792477] Modules linked in: xenfs lp xen_netfront parport xen_blkfront
[1286941.792490] Pid: 0, comm: swapper Not tainted 2.6.32-26-server #47-Ubuntu
[1286941.792495] RIP: e030:[<ffffffff8100fca4>] [<ffffffff8100fca4>] dummy_handler+0x4/0x10
[1286941.792507] RSP: e02b:ffff880003680e88 EFLAGS: 00010046
[1286941.792511] RAX: ffffffffff57b000 RBX: ffff88007fc1b2a0 RCX: 0000000000000000
[1286941.792520] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000007
[1286941.792527] RBP: ffff880003680e88 R08: 0000000000000000 R09: 0000000000000000
[1286941.792535] R10: ffff880003688028 R11: 0000000000012eb0 R12: 0000000000000000
[1286941.792541] R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000100
[1286941.792550] FS: 00007f639b1fa700(0000) GS:ffff88000367d000(0000) knlGS:0000000000000000
[1286941.792557] CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
[1286941.792562] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000002660
[1286941.792568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1286941.792574] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1286941.792580] Process swapper (pid: 0, threadinfo ffff88007dfc8000, task ffff88007dfb2dc0)
[1286941.792585] Stack:
[1286941.792588] ffff880003680ed8 ffffffff810c4060 ffff880003680ec8 ffffffff8108e553
[1286941.792597] <0> ffff88000368d400 ffffffff817cb780 0000000000000007 0000000000000200
[1286941.792607] <0> 0000000000000001 0000000000000100 ffff880003680ef8 ffffffff810c6392
[1286941.792618] Call Trace:
[1286941.792622] <IRQ>
[1286941.792628] [<ffffffff810c4060>] handle_IRQ_event+0x60/0x170
[1286941.792637] [<ffffffff8108e553>] ? ktime_get+0x63/0xe0
[1286941.792643] [<ffffffff810c6392>] handle_percpu_irq+0x42/0x80
[1286941.792650] [<ffffffff81014d12>] handle_irq+0x22/0x30
[1286941.792658] [<ffffffff8131f189>] xen_evtchn_do_upcall+0x199/0x1c0
[1286941.792665] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[1286941.792669] <EOI>
[1286941.792675] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1010
[1286941.792681] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1010
[1286941.792688] [<ffffffff8100ebd0>] ? xen_safe_halt+0x10/0x20
[1286941.792693] [<ffffffff8100c285>] ? xen_idle+0x35/0x50
[1286941.792699] [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[1286941.792704] [<ffffffff8100f2a9>] ? xen_irq_enable_direct_end+0x0/0x7
[1286941.792711] [<ffffffff8154d2e5>] ? cpu_bringup_and_idle+0x13/0x15
[1286941.792716] Code: 89 e5 c9 0f 95 c0 c3 55 b8 01 00 00 00 86 07 84 c0 48 89 e5 0f 94 c0 c9 0f b6 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 ba ff ff ff ff 48 89 e5
[1286941.792788] RIP [<ffffffff8100fca4>] dummy_handler+0x4/0x10
[1286941.792794] RSP <ffff880003680e88>

Jeremy Foshee (jeremyfoshee) wrote :

Hi Luiz,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/daily/current/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 681083

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Luiz Ozaki (luiz-ozaki) wrote :

Testing the PPA's kernel, before http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.32.16.5-lucid/ it works well, after this release 2.6.32.16.5 the problem appears.

tags: removed: needs-upstream-testing
Luiz Ozaki (luiz-ozaki) wrote :

Testing the latest version (natty) I get:
xenfs: not registering filesystem on non-xen platform

So I cant do a migration between hosts to test this.

Luiz Ozaki (luiz-ozaki) wrote :

Problem still happening in the XenServer 5.6 FP1 Beta

tags: removed: kj-triage needs-kernel-logs
Muriel (tudamp) wrote :

Hi all,
this is my experience with this bug.
Ubuntu server 10.04, x86_64, kernel from standard repo:
2.6.32.27 doesn't work
2.6.32.26 doesn't work
2.6.32.25 work unstable ( sometimes yes, often not)
2.6.32.24 and earlier works.

The problem is also present in mainline kernels:
2.6.32-0206321505 work
2.6.32-0206321606 work unstable
2.6.32-0206321709 work unstable
2.6.32-0206322210 doesn't boot
2.6.32-0206322310 doesn't boot
2.6.32-0206322411 doesn't work
2.6.32-0206322511 doesn't work
2.6.32-0206322611 doesn't work
2.6.32-0206322712 doesn't work

You need more logs to confirm the bug?

thanks for your work

Muriel (tudamp) wrote :
Andy Whitcroft (apw) on 2011-01-19
Changed in linux (Ubuntu):
status: Incomplete → Triaged
tags: added: regression-proposed
Stefan Bader (smb) wrote :

For Natty the fix mentioned is included with 2.6.37 final:

commit 6903591f314b8947d0e362bda7715e90eb9df75e
Author: Ian Campbell <email address hidden>
Date: Mon Nov 1 16:30:09 2010 +0000

Now it needs to be backported to our ec2 topic branch for Lucid.
    xen: events: do not unmask event channels on resume

Changed in linux (Ubuntu Lucid):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Triaged → Fix Released
Luiz Ozaki (luiz-ozaki) wrote :

Hi Muriel,

Which processor are you using ?

It seems that the kernel doesnt crash, but still freezes at least for me.

Unable to get console output or IO problem causing the process hang warning.

But the wierd thing is that when I migrate to hostA -> hostB = console freeze.
HostB -> hostA = IO freeze but console still works and then after any IO request, it freezes turning into process hang.
These happens always.

I wonder, Why dont I get the HostB -> HostA AND HostA-> HostB the same problem....

Gonna look into the Xen mailling as well, maybe I`m getting a different error or something now.

Stefan Bader (smb) wrote :

SRU Justification:

Impact: With the current ec2 kernels the kernel oops described in comment #3 is experienced as a result of enabling interrupts on the pv spinlock event channel.

Fix: The following patch is taken from upstream and is included in 2.6.37. It has been reported to successfully prevent the oops.

Testcase: Migration of a guest (using suspend)

Changed in linux (Ubuntu Lucid):
status: Triaged → In Progress
Muriel (tudamp) wrote :

Hi Luiz,
I have two different pools with two different processors
1) amd
when i migrate A -> B and B is the master of the pool the console (and the vm) freeze;
when i migrate B -> A and B is the master all works fine (with the patch);
when i migrate B -> C and all are members all works fine (with the patch;)

2) intel
all works fine with the patch

Ian Campbell says it could be a problem on my master: You have had the same problem with the same processor?
------------

Stefan,
the patch that i proposed is the Ian's patch changed for the ubuntu kernel 2.6.32-27.49. The only differences are in the row on which make the change.

Stefan Bader (smb) on 2011-01-21
Changed in linux (Ubuntu Lucid):
status: In Progress → Fix Committed
Steve Conklin (sconklin) on 2011-02-04
tags: added: verification-needed-lucid
Andy Whitcroft (apw) on 2011-02-11
tags: added: regression-update
removed: regression-proposed

Is there any news on this?

Stefan Bader (smb) wrote :

Apparently we got hit by confusion. Usually Xen in Lucid means the ec2 topic branch. But in this case this is the generic kernel. Actually the patch has no effect when applied to the ec2 topic branch as the file does exist but is completely ignored in that build.
So at the moment it was only applied to the topic branch without effect and we need to pull it out to the master branch.

Muriel (tudamp) wrote :

No news?

I've done a test with Debian 6.0 and it works fine (no crash/lost network connection) on vmotion with XenServer 5.6 FP1.
Debian included this patch in their last kernel:

  [ Ian Campbell ]
  * xen: blkback: fix potential leak of kernel thread. (CVE-2010-3699)

Their prior last kernel linux-2.6 (2.6.32-30) didn't work.
Their actual kernel linux-2.6 (2.6.32-31), that includes this patch, works.

I've tested this Debian kernel with Ubuntu 10.04 and it works fine (no crash/lost network connection) on vmotion.

Can this patch be applied to the Ubuntu 10.04 kernel?

description: updated

I have the same problem. Anyone got the solution?

There is no solution yet I think, maybe in the next kernel release.

Although I have tried this kernel which does allow xen motion(like posted above):
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33.5-lucid/linux-image-2.6.33-02063305-generic_2.6.33-02063305_amd64.deb
It works well although it probably has a few issues with certain applications.

Davim (davim) wrote :

I have no problem moving VMs around but if I suspend one Ubuntu VM it freezes on resume...

Accepted linux into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

jnoer (jnoer) wrote :

Installed linux-image-2.6.32-32-server and migrated slave-master-slave and I didn't see any problems. Seem to work fine now.

Luiz Ozaki (luiz-ozaki) wrote :

2.6.32-32-server #62-Ubuntu SMP Wed Apr 20 22:07:43 UTC 2011 x86_64 GNU/Linux

Working fine !!! Nice.

Gonna try more test, but simple migrations seems to work fine.

Tks

Martin Pitt (pitti) on 2011-04-26
tags: added: verification-done-lucid
removed: verification-needed-lucid

I installed 2.6.32-32-server on our lucid guests running under xenserver 5.6 fp1 and the migration now works randomly. So there is some progress because before that it wasn't working at all. When it doesn't work, the same xenwatch backtrace is shown. SLES 11 SP1 guests migrate flawlessly...

Davim (davim) wrote :

Same problem here, most of the times it works but some times it crashes...
I would say it's crashing one out of 5 migrations...

Luiz Ozaki (luiz-ozaki) wrote :

Yea, seems that it continuing but sometimes it works.

First it seems that were crashing only when I migrate to the master.

Then it seems more random BUT for example. If I migrate to the destiny and that migration failed, all migrations for that same destiny will fail.

Like in some of my pools I have 1 "bugged" host, I can migrate from him and to other hosts, but if I try to migrate to that "bugged" host it will fail.

Luiz Ozaki (luiz-ozaki) wrote :

I dont know if its the same bugs or another bug here its the stack trace on the ocasional failures:

[6368472.738379] Call Trace:
[6368472.738389] [<ffffffff81041abe>] ? pick_next_task_fair+0xca/0xd6
[6368472.738395] [<ffffffff812fae40>] ? thread_return+0x79/0xe0
[6368472.738401] [<ffffffff8100e160>] ? xen_vcpuop_set_next_event+0x0/0x60
[6368472.738405] [<ffffffff812fb1fd>] ? schedule_timeout+0x2e/0xdd
[6368472.738408] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[6368472.738412] [<ffffffff812fc1da>] ? _spin_unlock_irqrestore+0xd/0xe
[6368472.738418] [<ffffffff810ad168>] ? cpupri_set+0x10c/0x135
[6368472.738425] [<ffffffff8121cc16>] ? serial8250_resume+0x0/0x3a
[6368472.738428] [<ffffffff812fb0b4>] ? wait_for_common+0xde/0x15b
[6368472.738433] [<ffffffff8104a42f>] ? default_wake_function+0x0/0x9
[6368472.738439] [<ffffffff81061b0c>] ? flush_cpu_workqueue+0x5e/0x75
[6368472.738442] [<ffffffff81064c2e>] ? kthread_stop+0x5d/0xa2
[6368472.738446] [<ffffffff81061b67>] ? cleanup_workqueue_thread+0x44/0x51
[6368472.738449] [<ffffffff81061c0e>] ? destroy_workqueue+0x76/0xad
[6368472.738454] [<ffffffff8108ad7f>] ? stop_machine_destroy+0x2e/0x47
[6368472.738458] [<ffffffff811efd2d>] ? shutdown_handler+0x230/0x25c
[6368472.738462] [<ffffffff812fb776>] ? mutex_lock+0xd/0x31
[6368472.738465] [<ffffffff811f1038>] ? xenwatch_thread+0x117/0x14a
[6368472.738469] [<ffffffff81064e96>] ? autoremove_wake_function+0x0/0x2e
[6368472.738472] [<ffffffff811f0f21>] ? xenwatch_thread+0x0/0x14a
[6368472.738474] [<ffffffff81064bc9>] ? kthread+0x79/0x81
[6368472.738479] [<ffffffff81011baa>] ? child_rip+0xa/0x20
[6368472.738482] [<ffffffff81010d61>] ? int_ret_from_sys_call+0x7/0x1b
[6368472.738485] [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
[6368472.738489] [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
[6368472.738492] [<ffffffff81011ba0>] ? child_rip+0x0/0x20

But after some time it gets back to normal.

Luiz Ozaki (luiz-ozaki) wrote :

Ummm... Disregard that stack trace... I was using the Debian 6... =/

Sorry.

Davim (davim) wrote :

What do you mean by "bugged" host?
Hat is the problem with that host?
Reinstalling the host would solve the problem?
I have a pool with 10 hosts and all of my linux VMs (about 20VMs) are running the same ubuntu kernel (2.6.32-31-generic-pae) and some of them always crash on migration, some never crashed and some only crashed a few times...

I think that the VMs with most memory usage are the ones that crash more often.

Davim (davim) wrote :

I've just tested the 2.6.35-25-generic-pae kernel on one of my VMs that always crash on xenmotion and the problem is the same :(

Does anyone have a solution for this???

Stefan Bader (smb) wrote :

Seems the patch we added for 2.6.32 (Lucid) is still missing for 2.6.35 (Maverick). I will get it SRUed there as well.

Changed in linux (Ubuntu Maverick):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → Medium
status: New → In Progress
Davim (davim) wrote :

But the 2.6.32 is not working either...
Before testing the 2.6.35 I was on the 2.6.32-31-generic-pae.

Davim (davim) wrote :

Is there any other information I can add to help determine the cause/solution of this problem?

Stefan Bader (smb) wrote :

It seems that there are probably two problems involved. On .32 some people were seeing improvement. But others still see a crash that seemed to look a bit different. So at least having that fix in .35 as well sounds reasonable. And maybe there no other problem exists and the remaining other problem only is in .32.

Koszta, Tamas (tamas-koszta) wrote :

There is an article in Citrix knowledge center about this problem. Their workaround is to install natty's backported kernel, which is based on 2.6.38.
I've tested it on 5.6FP1 with 64bit Lucid guest, and works fine, every migration completed successfully.
However it would be nice if the official lucid kernel could do the same.

Stefan Bader (smb) wrote :

Davim, I placed some 2.6.35 kernels that include the proposed patch at http://people.canonical.com/~smb/lp681083/. If you could try one of those and let me know the result. So we can decide whether this is an issue with some setups over all releases or this is a secondary issue only with Lucid. Thanks.

Davim (davim) wrote :

Thanks Stefan I haven't been able to test those kernels yet, I intend to test them until the end of this week...

I've noticed that the problem does not only occur on xenmotion but also on suspend/resume.

I will get back to you as soon as I have the chance to test those kernels.

Davim (davim) wrote :

Success!!!

I've just tested the kernels provided by Stefan and they solve the problem :)

The test I made was:

 * Installed a new Ubuntu 10.04.2 VM (net install) with two vCPUs
 * Tried to migrate the VM to another Xenserver and confirmed it crashed
 * Downloaded and installed the generic-pae kernels provided by Stefan and rebooted the VM into the new kernel.
 * Migrated the VM around several hosts, including the master, and the VM never crashed.

Now the question is, when will we see this kernels as official for the Ubuntu 10.04.2 LTS ??

Thanks Stefan.

Works for me.

I've migrated a test VM from and to the pool master a couple of times.
Everything still seems to work like it should.

Hoping to see this in the official kernel soon.

Stefan Bader (smb) wrote :

Well, the kernels supplied are 2.6.35 ones (10.10 Maverick). And the patch is queued for the next update (after a currently pending one). The same patch is in the currently pending 2.6.32 (10.04 Lucid) kernel too. Just the feeling there from some feedback is that there might still be some (other) issue. Both updates are not yet in the normal place (for Lucid you would need to enable proposed to get it) but if there is still a problem with the proposed version of Lucid this will continue to be one because that then needs something more. But I would wait for this patch to get out on both, then check the fallout and probably work on the follow up in a new bug report.

Launchpad Janitor (janitor) wrote :
Download full text (12.8 KiB)

This bug was fixed in the package linux - 2.6.32-32.62

---------------
linux (2.6.32-32.62) lucid-proposed; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #767370

  [ Stefan Bader ]

  * (config) Disable CONFIG_NET_NS
    - LP: #720095

  [ Upstream Kernel Changes ]

  * Revert "drm/radeon/kms: Fix retrying ttm_bo_init() after it failed
    once."
    - LP: #736234
  * Revert "drm/radeon: fall back to GTT if bo creation/validation in VRAM
    fails."
    - LP: #736234
  * x86: pvclock: Move scale_delta into common header
  * KVM: x86: Fix a possible backwards warp of kvmclock
  * KVM: x86: Fix kvmclock bug
  * cpuset: add a missing unlock in cpuset_write_resmask()
    - LP: #736234
  * keyboard: integer underflow bug
    - LP: #736234
  * RxRPC: Fix v1 keys
    - LP: #736234
  * ixgbe: fix for 82599 erratum on Header Splitting
    - LP: #736234
  * mm: fix possible cause of a page_mapped BUG
    - LP: #736234
  * powerpc/kdump: CPUs assume the context of the oopsing CPU
    - LP: #736234
  * powerpc/kdump: Use chip->shutdown to disable IRQs
    - LP: #736234
  * powerpc: Use more accurate limit for first segment memory allocations
    - LP: #736234
  * powerpc/pseries: Add hcall to read 4 ptes at a time in real mode
    - LP: #736234
  * powerpc/kexec: Speedup kexec hash PTE tear down
    - LP: #736234
  * powerpc/crashdump: Do not fail on NULL pointer dereferencing
    - LP: #736234
  * powerpc/kexec: Fix orphaned offline CPUs across kexec
    - LP: #736234
  * netfilter: nf_log: avoid oops in (un)bind with invalid nfproto values
    - LP: #736234
  * nfsd: wrong index used in inner loop
    - LP: #736234
  * r8169: use RxFIFO overflow workaround for 8168c chipset.
    - LP: #736234
  * Staging: comedi: jr3_pci: Don't ioremap too much space. Check result.
    - LP: #736234
  * net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules,
    CVE-2011-1019
    - LP: #736234
    - CVE-2011-1019
  * ip6ip6: autoload ip6 tunnel
    - LP: #736234
  * Linux 2.6.32.33
    - LP: #736234
  * drm/radeon: fall back to GTT if bo creation/validation in VRAM fails.
    - LP: #652934, #736234
  * drm/radeon/kms: Fix retrying ttm_bo_init() after it failed once.
    - LP: #652934, #736234
  * drm: fix unsigned vs signed comparison issue in modeset ctl ioctl,
    CVE-2011-1013
    - LP: #736234
    - CVE-2011-1013
  * Linux 2.6.32.33+drm33.15
    - LP: #736234
  * econet: Fix crash in aun_incoming(). CVE-2010-4342
    - LP: #736394
    - CVE-2010-4342
  * igb: only use vlan_gro_receive if vlans are registered, CVE-2010-4263
    - LP: #737024
    - CVE-2010-4263
  * irda: prevent integer underflow in IRLMP_ENUMDEVICES, CVE-2010-4529
    - LP: #737823
    - CVE-2010-4529
  * hwmon/f71882fg: Set platform drvdata to NULL later
    - LP: #742056
  * mtd: add "platform:" prefix for platform modalias
    - LP: #742056
  * libata: no special completion processing for EH commands
    - LP: #742056
  * MIPS: MTX-1: Make au1000_eth probe all PHY addresses
    - LP: #742056
  * x86/mm: Handle mm_fault_error() in kernel space
    - LP: #742056
  * ftrace: Fix memory leak with function graph and cpu hotplug
    - LP: #742056
  * x86: Fix panic when ...

Changed in linux (Ubuntu Lucid):
status: Fix Committed → Fix Released
Davim (davim) wrote :

The kernel released on this fix as a nastier bug tha causes the VMs to freeze on boot if they're configured with more than 512M of RAM, see this bug report:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/790747

Herton R. Krzesinski (herton) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-maverick' to 'verification-done-maverick'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-maverick
Steve Conklin (sconklin) wrote :

Note that comment #43 only applies to Lucid and not to maverick. So we are still awaiting verification that the problem is resolved in Maverick by the kernel in -proposed.

Thanks

Steve Conklin (sconklin) wrote :

The verification period for this kernel will close soon, and we have not received verification that the problem is resolved in Maverick. Please test and post results to this bug or the fix will be reverted from the release.

Stefan Bader (smb) wrote :

It is probably stretching things a bit, but Davim sort of does a verification for Maverick kernels in comment #39 as he installed the kernels that I did provide in comment #37 into a Lucid installation (and I only provided 2.6.35 kernels).

The bug got fixed by applying the HowTo in Citrix's website as mentioned at #36 in this thread.
The Lucid kernel 2.6.32-33.66 which released few days ago didn't do any help, but 2.6.38 solved this.

Luiz Ozaki (luiz-ozaki) wrote :
Download full text (6.4 KiB)

Okay, here goes my tests:

2.6.35-28-server #50-Ubuntu SMP Fri Mar 18 18:59:25 UTC 2011 x86_64 GNU/Linux
[2819125.429932] ------------[ cut here ]------------
[2819125.429943] kernel BUG at /build/buildd/linux-2.6.35/arch/x86/xen/spinlock.c:344!
[2819125.429950] invalid opcode: 0000 [#1] SMP
[2819125.429956] last sysfs file: /sys/kernel/uevent_seqnum
[2819125.429961] CPU 3
[2819125.429964] Modules linked in: xenfs lp parport xen_netfront xen_blkfront
[2819125.429977]
[2819125.429981] Pid: 12, comm: migration/3 Not tainted 2.6.35-28-server #50-Ubuntu /
[2819125.429987] RIP: e030:[<ffffffff81007d04>] [<ffffffff81007d04>] dummy_handler+0x4/0x10
[2819125.430000] RSP: e02b:ffff880003f60ea8 EFLAGS: 00010046
[2819125.430005] RAX: ffffffffff57b000 RBX: ffff88001fc1a8a0 RCX: 0000000000000000
[2819125.430011] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000013
[2819125.430016] RBP: ffff880003f60ea8 R08: 0000000000000600 R09: 0000000000000000
[2819125.430022] R10: ffff880003f68028 R11: 0000000000012ed0 R12: 0000000000000000
[2819125.430028] R13: 0000000000000000 R14: 0000000000000013 R15: 0000000000000100
[2819125.430038] FS: 00007f2137153700(0000) GS:ffff880003f5d000(0000) knlGS:0000000000000000
[2819125.430045] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[2819125.430050] CR2: 0000000000000000 CR3: 0000000003acf000 CR4: 0000000000002660
[2819125.430057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2819125.430063] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[2819125.430069] Process migration/3 (pid: 12, threadinfo ffff88001fdd2000, task ffff88001fd9dbc0)
[2819125.430076] Stack:
[2819125.430079] ffff880003f60ef8 ffffffff810ca310 0000000000000000 000000000072f59a
[2819125.430088] <0> ffff880003f60ef8 ffff88001fc05900 0000000000000013 0000000000000600
[2819125.430099] <0> 0000000000000001 0000000000000100 ffff880003f60f18 ffffffff810cca52
[2819125.430111] Call Trace:
[2819125.430115] <IRQ>
[2819125.430123] [<ffffffff810ca310>] handle_IRQ_event+0x50/0x160
[2819125.430130] [<ffffffff810cca52>] handle_percpu_irq+0x42/0x80
[2819125.430139] [<ffffffff81348706>] xen_evtchn_do_upcall+0x1d6/0x200
[2819125.430147] [<ffffffff810b2400>] ? stop_machine_cpu_stop+0x0/0xe0
[2819125.430154] [<ffffffff8100b02e>] xen_do_hypervisor_callback+0x1e/0x30
[2819125.430159] <EOI>
[2819125.430164] [<ffffffff810b2400>] ? stop_machine_cpu_stop+0x0/0xe0
[2819125.430172] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
[2819125.430179] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
[2819125.430186] [<ffffffff81006b9d>] ? xen_force_evtchn_callback+0xd/0x10
[2819125.430192] [<ffffffff81007332>] ? check_events+0x12/0x20
[2819125.430199] [<ffffffff810072d9>] ? xen_irq_enable_direct_end+0x0/0x7
[2819125.430206] [<ffffffff810b2301>] ? cpu_stopper_thread+0xd1/0x1d0
[2819125.430214] [<ffffffff8159f6f1>] ? schedule+0x3e1/0x830
[2819125.430221] [<ffffffff815a1aee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[2819125.430227] [<ffffffff810b2230>] ? cpu_stopper_thread+0x0/0x1d0
[2819125.430235] [<ffffffff8107f616>] ? kthread+0x96/0xa0
[2819125.430240] [<ffffffff8100aee4>] ? kernel_thread_helper+0x4/0x10
[2...

Read more...

Steve Conklin (sconklin) on 2011-06-17
tags: added: verification-done-maverick
removed: verification-needed-maverick
Launchpad Janitor (janitor) wrote :
Download full text (30.0 KiB)

This bug was fixed in the package linux - 2.6.35-30.54

---------------
linux (2.6.35-30.54) maverick-proposed; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #794114

  [ Upstream Kernel Changes ]

  * Revert "xhci: Fix full speed bInterval encoding."
  * Revert "USB: xhci - also free streams when resetting devices"
  * Revert "USB: xhci - fix math in xhci_get_endpoint_interval()"
  * Revert "USB: xhci - fix unsafe macro definitions"

linux (2.6.35-30.53) maverick-proposed; urgency=low

  [ Upstream Kernel Changes ]

  * xhci: Fix full speed bInterval encoding.
    - LP: #792959

linux (2.6.35-30.52) maverick-proposed; urgency=low

  [ Herton R. Krzesinski ]

  * Release Tracking Bug
    - LP: #790653

  [ Stefan Bader ]

  * Include nls_iso8859-1 for virtual images
    - LP: #732046

  [ Thomas Schlichter ]

  * SAUCE: vesafb: mtrr module parameter is uint, not bool
    - LP: #778043

  [ Tim Gardner ]

  * [Config] Add cachefiles.ko to virtual flavour
    - LP: #770430

  [ Upstream Kernel Changes ]

  * Revert "intel_idle: PCI quirk to prevent Lenovo Ideapad s10-3 boot
    hang"
    - LP: #772560
  * Revert "TPM: Long default timeout fix"
    - LP: #772560
  * Revert "tpm_tis: Use timeouts returned from TPM"
    - LP: #772560
  * Revert "xen: set max_pfn_mapped to the last pfn mapped"
  * CAN: Use inode instead of kernel address for /proc file, CVE-2010-4565
    - LP: #765007
    - CVE-2010-4565
  * xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1,
    CVE-2011-0711
    - LP: #767740
    - CVE-2011-0711
  * Treat writes as new when holes span across page boundaries,
    CVE-2011-0463
    - LP: #770483
    - CVE-2011-0463
  * fs/partitions/ldm.c: fix oops caused by corrupted partition table,
    CVE-2011-1017
    - LP: #771382
    - CVE-2011-1017
  * qla2xxx: Make the FC port capability mutual exclusive.
    - LP: #772560
  * staging: usbip: bugfixes related to kthread conversion
    - LP: #772560
  * staging: usbip: bugfix add number of packets for isochronous frames
    - LP: #772560
  * staging: usbip: bugfix for isochronous packets and optimization
    - LP: #772560
  * staging: hv: Fix GARP not sent after Quick Migration
    - LP: #772560
  * staging: hv: use sync_bitops when interacting with the hypervisor
    - LP: #772560
  * irda: validate peer name and attribute lengths
    - LP: #772560
  * irda: prevent heap corruption on invalid nickname
    - LP: #772560
  * nilfs2: fix data loss in mmap page write for hole blocks
    - LP: #772560
  * ASoC: Explicitly say registerless widgets have no register
    - LP: #772560
  * ALSA: ens1371: fix Creative Ectiva support
    - LP: #772560
  * ROSE: prevent heap corruption with bad facilities
    - LP: #772560
  * Btrfs: Fix uninitialized root flags for subvolumes
    - LP: #772560
  * x86, mtrr, pat: Fix one cpu getting out of sync during resume
    - LP: #772560
  * UBIFS: do not read flash unnecessarily
    - LP: #772560
  * UBIFS: fix oops on error path in read_pnode
    - LP: #772560
  * UBIFS: fix debugging failure in dbg_check_space_info
    - LP: #772560
  * quota: Don't write quota info in dquot_commit()
    - LP: #772560
  * mm: avoid wrapping vm_...

Changed in linux (Ubuntu Maverick):
status: In Progress → Fix Released
Davim (davim) wrote :

This problem returned in 2.6.32-35....

Stefan Bader (smb) wrote :

Likely the regression reported in bug #881542. Trying to get it fixed in the next update.

David Ehle (ehle-p) wrote :

I'm also seeing this with the 2.6.32-35 kernel. When I try to live migrate a system with that kernel, it hangs in some fashion. Console is black screen. When I try to ssh in, it responds but cannot finish auth process.
2.6.32-34 had a kernel panic when I migrated.

Installing linux-image-server-lts-backport-maverick package to get 2.6.35.30.38 seems to let the VMs migrate without crashing.

Xen environment is XenServer 5.6SP2

jagudo (jagudo) wrote :

Same problem for Lucid (10.04) in today update: 2.6.32-36-generic

Davim (davim) wrote :

I've just tested 2.6.32-37-generic-pae on Lucid and the problem is solved.

Please make sure this problem stays solved, It has been a pain having this problem coming back on almost every update...

Luiz Ozaki (luiz-ozaki) wrote :

Hmmm... Now it doesn't crash. But I lose network access and I get a stack trace

[7744382.224271] eth0: no IPv6 routers present
[7744715.716789] PM: suspend of devices complete after 0.119 msecs
[7744715.716794] suspending xenstore...
[7744715.716825] PM: late suspend of devices complete after 0.026 msecs
[7746693.008996] trying to map vcpu_info 0 at ffff8800034ab020, mfn 18aa756, offset 32
[7746693.009004] cpu 0 using vcpu_info at ffff8800034ab020
[7746693.009359] PM: early resume of devices complete after 0.034 msecs
[7746693.016403] PM: resume of devices complete after 5.629 msecs
[7746827.319483] INFO: task xenwatch:11 blocked for more than 120 seconds.
[7746827.319498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[7746827.319508] xenwatch D ffff8800034b5f00 0 11 2 0x00000000
[7746827.319516] ffff88001fd9bbf0 0000000000000246 0000000000015f00 0000000000015f00
[7746827.319524] ffff88001fda03d0 ffff88001fd9bfd8 0000000000015f00 ffff88001fda0000
[7746827.319531] 0000000000015f00 ffff88001fd9bfd8 0000000000015f00 ffff88001fda03d0
[7746827.319539] Call Trace:
[7746827.319551] [<ffffffff81559aed>] schedule_timeout+0x22d/0x300
[7746827.319559] [<ffffffff81062549>] ? __enqueue_rt_entity+0x129/0x240
[7746827.319565] [<ffffffff81558c66>] wait_for_common+0xd6/0x180
[7746827.319577] [<ffffffff8105deb0>] ? default_wake_function+0x0/0x20
[7746827.319583] [<ffffffff81558dcd>] wait_for_completion+0x1d/0x20
[7746827.319589] [<ffffffff81085e1b>] kthread_stop+0x4b/0xd0
[7746827.319595] [<ffffffff8108914f>] ? hrtimer_force_reprogram+0x7f/0x90
[7746827.319600] [<ffffffff81081aae>] cleanup_workqueue_thread+0x3e/0x80
[7746827.319605] [<ffffffff81081c93>] destroy_workqueue+0x93/0xe0
[7746827.319612] [<ffffffff810b7d44>] stop_machine_destroy+0x34/0x50
[7746827.319619] [<ffffffff8132431f>] do_suspend+0xaf/0x120
[7746827.319623] [<ffffffff81324499>] shutdown_handler+0x109/0x160
[7746827.319628] [<ffffffff81325812>] xenwatch_thread+0xc2/0x190
[7746827.319634] [<ffffffff81086140>] ? autoremove_wake_function+0x0/0x40
[7746827.319639] [<ffffffff81325750>] ? xenwatch_thread+0x0/0x190
[7746827.319644] [<ffffffff81085dc6>] kthread+0x96/0xa0
[7746827.319650] [<ffffffff810141aa>] child_rip+0xa/0x20
[7746827.319658] [<ffffffff81013391>] ? int_ret_from_sys_call+0x7/0x1b
[7746827.319663] [<ffffffff81013b1d>] ? retint_restore_args+0x5/0x6
[7746827.319668] [<ffffffff810141a0>] ? child_rip+0x0/0x20

Linux x 2.6.32-37-server #81-Ubuntu SMP Fri Dec 2 20:49:12 UTC 2011 x86_64 GNU/Linux @ XenServer 5.6 SP2

Gonna test on XS 6.0

Stefan Bader (smb) wrote :

Hm, that stack trace looks a bit like that task somehow got starved off on its way into suspend. There has been regression reports (unfortunately after testing in proposed) on real hardware which was tracked to

commit f0cf1db8f15e8f95f5085f191313694cb623a558
Author: Thomas Gleixner <email address hidden>
Date: Fri Dec 2 16:02:45 2011 +0100

    clockevents: Set noop handler in clockevents_exchange_device()
    BugLink: http://bugs.launchpad.net/bugs/902317
    commit de28f25e8244c7353abed8de0c7792f5f883588c upstream.

That has been reverted now upstream and it will come down via stable. Luiz, does that happen to you all the time or at least often enough to verify with a kernel that has that patch reverted (which I would provide)?

Luiz Ozaki (luiz-ozaki) wrote :

Yep, all the time.

Sure Stefan, give me the kernel and I'll test it.

Cheers.

Stefan Bader (smb) wrote :

Seems I got confused on the kernel versions this morning. The patch I was suspecting actually was not present in 2.6.32-37.81. The code with it did not yet get moved into updates. So clearly not the problem here. That will make it take a bit longer until I may post a kernel to try... :(

Luiz Ozaki (luiz-ozaki) wrote :

Revert this makes any sense ?

http://patchwork.ozlabs.org/patch/129004/

I saw the kernel source it seems still applied in this release, but I'm not sure if thats causing the problem.

btw in XS 6.0 works fine, problem only occurs in 5.6

Stefan Bader (smb) wrote :

Tentiatively I would say no (makes not sense). The story there is that for 2.6.35 the code was at a point where save/restore seems to work. Upstream changed after that to use functionality introduced into the generic interrupt handling code. But that had the problem of not reactivating interrupts early enough. This has now been fixed in a backport to 2.6.32, but 2.6.35 dropped off being really being cared upstream. So there it will make sense to revert (or not apply the change) but for .32 not so much.

Beside, you say XS 6.0 works. So there seems to be a dependency at least to the host code. I am not that familiar with various XS versions. Maybe you could check what Xen version the guest reports in dmesg (for 5.6 and 6.0)?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers