Xen VCPUs hotplug does not work

Bug #1007002 reported by Frediano Ziglio
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

Hi,
  I discovered a problem with Xen and Ubuntu installed with default kernel installed (linux-image-generic). This occur with either version 10.04 and 10.10. I used XenServer versions 6.0 and an updated version (currently not released, I'm a Xen developer). Problem does not occur with linux-image-ec2 (but also with -server and -virtual ones). Trying to start machine with less CPU than the maximum (or trying to remove CPU after system is up) cause an OOPS in arch/x86/xen/spinlock.c.

The OOPS is

[ 0.171019] Cannot set affinity for irq 0
[ 0.171027] Broke affinity for irq 6
[ 0.171030] Broke affinity for irq 7
[ 0.171033] Broke affinity for irq 8
[ 0.171035] Broke affinity for irq 9
[ 0.171038] Broke affinity for irq 10
[ 0.171040] Broke affinity for irq 11
[ 0.180000] ------------[ cut here ]------------
[ 0.180000] kernel BUG at /build/buildd/linux-2.6.32/arch/x86/xen/spinlock.c:343!
[ 0.180000] invalid opcode: 0000 [#1] SMP
[ 0.180000] last sysfs file:
[ 0.180000] CPU 4
[ 0.180000] Modules linked in:
[ 0.180000] Pid: 46, comm: kstop/4 Not tainted 2.6.32-33-generic #72-Ubuntu
[ 0.180000] RIP: e030:[<ffffffff8100fcc4>] [<ffffffff8100fcc4>] dummy_handler+0x4/0x10
[ 0.180000] RSP: e02b:ffff880003414e88 EFLAGS: 00010046
[ 0.180000] RAX: ffffffffff57b000 RBX: ffff88003fc1b960 RCX: 0000000000000000
[ 0.180000] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000019
[ 0.180000] RBP: ffff880003414e88 R08: 0000000000000000 R09: 0000000000000000
[ 0.180000] R10: ffff88000341c028 R11: 0000000000012ed0 R12: 0000000000000000
[ 0.180000] R13: 0000000000000000 R14: 0000000000000019 R15: 0000000000000100
[ 0.180000] FS: 0000000000000000(0000) GS:ffff880003411000(0000) knlGS:0000000000000000
[ 0.180000] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.180000] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000000660
[ 0.180000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.180000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.180000] Process kstop/4 (pid: 46, threadinfo ffff88003e484000, task ffff88003e47ade0)
[ 0.180000] Stack:
[ 0.180000] ffff880003414ed8 ffffffff810c4d80 ffffffff8100f322 0000000000012ed0
[ 0.180000] <0> ffff88000341c028 ffff88003ee1b900 0000000000000019 0000000000000800
[ 0.180000] <0> 0000000000000001 0000000000000100 ffff880003414ef8 ffffffff810c70b2
[ 0.180000] Call Trace:
[ 0.180000] <IRQ>
[ 0.180000] [<ffffffff810c4d80>] handle_IRQ_event+0x60/0x170
[ 0.180000] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180000] [<ffffffff810c70b2>] handle_percpu_irq+0x42/0x80
[ 0.180000] [<ffffffff81014d12>] handle_irq+0x22/0x30
[ 0.180000] [<ffffffff8131ee29>] xen_evtchn_do_upcall+0x199/0x1c0
[ 0.180000] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180000] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[ 0.180000] <EOI>
[ 0.180000] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180000] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180000] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180000] [<ffffffff8100eb7d>] ? xen_force_evtchn_callback+0xd/0x10
[ 0.180000] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180000] [<ffffffff8100f2c9>] ? xen_irq_enable_direct_end+0x0/0x7
[ 0.180000] [<ffffffff810b645f>] ? stop_cpu+0xbf/0xf0
[ 0.180000] [<ffffffff810800e7>] ? run_workqueue+0xc7/0x1a0
[ 0.180000] [<ffffffff81080263>] ? worker_thread+0xa3/0x110
[ 0.180000] [<ffffffff81084cc0>] ? autoremove_wake_function+0x0/0x40
[ 0.180000] [<ffffffff810801c0>] ? worker_thread+0x0/0x110
[ 0.180000] [<ffffffff81084946>] ? kthread+0x96/0xa0
[ 0.180000] [<ffffffff810131ea>] ? child_rip+0xa/0x20
[ 0.180000] [<ffffffff810123d1>] ? int_ret_from_sys_call+0x7/0x1b
[ 0.180000] [<ffffffff81012b5d>] ? retint_restore_args+0x5/0x6
[ 0.180000] [<ffffffff810131e0>] ? child_rip+0x0/0x20
[ 0.180000] Code: 89 e5 c9 0f 95 c0 c3 55 b8 01 00 00 00 86 07 84 c0 48 89 e5 0f 94 c0 c9 0f b6 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 ba ff ff ff ff 48 89 e5
[ 0.180000] RIP [<ffffffff8100fcc4>] dummy_handler+0x4/0x10
[ 0.180000] RSP <ffff880003414e88>
[ 0.180000] ---[ end trace f17e946d22a56015 ]---
[ 0.180579] ------------[ cut here ]------------
[ 0.180589] kernel BUG at /build/buildd/linux-2.6.32/arch/x86/xen/spinlock.c:343!
[ 0.180595] invalid opcode: 0000 [#2] SMP
[ 0.180600] last sysfs file:
[ 0.180602] CPU 0
[ 0.180606] Modules linked in:
[ 0.180612] Pid: 42, comm: kstop/0 Tainted: G D 2.6.32-33-generic #72-Ubuntu
[ 0.180618] RIP: e030:[<ffffffff8100fcc4>] [<ffffffff8100fcc4>] dummy_handler+0x4/0x10
[ 0.180630] RSP: e02b:ffff88000339ce88 EFLAGS: 00010046
[ 0.180633] RAX: ffffffffff57b000 RBX: ffff88003fc1b060 RCX: 0000000000000000
[ 0.180638] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000001
[ 0.180641] RBP: ffff88000339ce88 R08: 0000000000000000 R09: 0000000000000000
[ 0.180644] R10: ffff8800033a4028 R11: 0000000000012ed0 R12: 0000000000000000
[ 0.180647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000100
[ 0.180653] FS: 0000000000000000(0000) GS:ffff880003399000(0000) knlGS:0000000000000000
[ 0.180656] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.180660] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000000660
[ 0.180663] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.180666] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.180670] Process kstop/0 (pid: 42, threadinfo ffff88003e454000, task ffff88003ef3c4d0)
[ 0.180672] Stack:
[ 0.180675] ffff88000339ced8 ffffffff810c4d80 ffffffff8100f322 0000000000012ed0
[ 0.180680] <0> ffff8800033a4028 ffffffff817af300 0000000000000001 0000000000000000
[ 0.180686] <0> 0000000000000001 0000000000000100 ffff88000339cef8 ffffffff810c70b2
[ 0.180692] Call Trace:
[ 0.180695] <IRQ>
[ 0.180701] [<ffffffff810c4d80>] handle_IRQ_event+0x60/0x170
[ 0.180707] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180711] [<ffffffff810c70b2>] handle_percpu_irq+0x42/0x80
[ 0.180719] [<ffffffff81014d12>] handle_irq+0x22/0x30
[ 0.180729] [<ffffffff8131ee29>] xen_evtchn_do_upcall+0x199/0x1c0
[ 0.180737] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180743] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[ 0.180747] <EOI>
[ 0.180750] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180756] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180760] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180764] [<ffffffff8100eb7d>] ? xen_force_evtchn_callback+0xd/0x10
[ 0.180768] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180772] [<ffffffff8100f2c9>] ? xen_irq_enable_direct_end+0x0/0x7
[ 0.180776] [<ffffffff810b645f>] ? stop_cpu+0xbf/0xf0
[ 0.180782] [<ffffffff810800e7>] ? run_workqueue+0xc7/0x1a0
[ 0.180786] [<ffffffff81080263>] ? worker_thread+0xa3/0x110
[ 0.180791] [<ffffffff81084cc0>] ? autoremove_wake_function+0x0/0x40
[ 0.180796] [<ffffffff810801c0>] ? worker_thread+0x0/0x110
[ 0.180799] [<ffffffff81084946>] ? kthread+0x96/0xa0
[ 0.180803] [<ffffffff810131ea>] ? child_rip+0xa/0x20
[ 0.180808] [<ffffffff810123d1>] ? int_ret_from_sys_call+0x7/0x1b
[ 0.180812] [<ffffffff81012b5d>] ? retint_restore_args+0x5/0x6
[ 0.180815] [<ffffffff810131e0>] ? child_rip+0x0/0x20
[ 0.180817] Code: 89 e5 c9 0f 95 c0 c3 55 b8 01 00 00 00 86 07 84 c0 48 89 e5 0f 94 c0 c9 0f b6 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 ba ff ff ff ff 48 89 e5
[ 0.180859] RIP [<ffffffff8100fcc4>] dummy_handler+0x4/0x10
[ 0.180863] RSP <ffff88000339ce88>
[ 0.180868] ---[ end trace f17e946d22a56016 ]---
[ 0.180871] Kernel panic - not syncing: Fatal exception in interrupt
[ 0.180874] Pid: 42, comm: kstop/0 Tainted: G D 2.6.32-33-generic #72-Ubuntu
[ 0.180877] Call Trace:
[ 0.180879] <IRQ> [<ffffffff8153ea9e>] panic+0x78/0x139
[ 0.180888] [<ffffffff81542a9a>] oops_end+0xea/0xf0
[ 0.180892] [<ffffffff8101613b>] die+0x5b/0x90
[ 0.180896] [<ffffffff81542334>] do_trap+0xc4/0x170
[ 0.180900] [<ffffffff81013ee5>] do_invalid_op+0x95/0xb0
[ 0.180904] [<ffffffff8100fcc4>] ? dummy_handler+0x4/0x10
[ 0.180908] [<ffffffff8106084a>] ? update_curr_rt+0x12a/0x230
[ 0.180913] [<ffffffff81086b2c>] ? run_posix_cpu_timers+0x2c/0x260
[ 0.180918] [<ffffffff8103886f>] ? pvclock_clocksource_read+0x4f/0xb0
[ 0.180922] [<ffffffff81012f7b>] invalid_op+0x1b/0x20
[ 0.180926] [<ffffffff8100fcc4>] ? dummy_handler+0x4/0x10
[ 0.180930] [<ffffffff810c4d80>] handle_IRQ_event+0x60/0x170
[ 0.180933] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180937] [<ffffffff810c70b2>] handle_percpu_irq+0x42/0x80
[ 0.180941] [<ffffffff81014d12>] handle_irq+0x22/0x30
[ 0.180945] [<ffffffff8131ee29>] xen_evtchn_do_upcall+0x199/0x1c0
[ 0.180949] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180952] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[ 0.180954] <EOI> [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180961] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180965] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180969] [<ffffffff8100eb7d>] ? xen_force_evtchn_callback+0xd/0x10
[ 0.180972] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180976] [<ffffffff8100f2c9>] ? xen_irq_enable_direct_end+0x0/0x7
[ 0.180980] [<ffffffff810b645f>] ? stop_cpu+0xbf/0xf0
[ 0.180984] [<ffffffff810800e7>] ? run_workqueue+0xc7/0x1a0
[ 0.180988] [<ffffffff81080263>] ? worker_thread+0xa3/0x110
[ 0.180992] [<ffffffff81084cc0>] ? autoremove_wake_function+0x0/0x40
[ 0.180996] [<ffffffff810801c0>] ? worker_thread+0x0/0x110
[ 0.181000] [<ffffffff81084946>] ? kthread+0x96/0xa0
[ 0.181003] [<ffffffff810131ea>] ? child_rip+0xa/0x20
[ 0.181007] [<ffffffff810123d1>] ? int_ret_from_sys_call+0x7/0x1b
[ 0.181011] [<ffffffff81012b5d>] ? retint_restore_args+0x5/0x6
[ 0.181014] [<ffffffff810131e0>] ? child_rip+0x0/0x20
[ 0.180000] Kernel panic - not syncing: Fatal exception in interrupt
[ 0.180000] Pid: 46, comm: kstop/4 Tainted: G D 2.6.32-33-generic #72-Ubuntu
[ 0.180000] Call Trace:
[ 0.180000] <IRQ> [<ffffffff8153ea9e>] panic+0x78/0x139
[ 0.180000] [<ffffffff81542a9a>] oops_end+0xea/0xf0
[ 0.180000] [<ffffffff8101613b>] die+0x5b/0x90
[ 0.180000] [<ffffffff81542334>] do_trap+0xc4/0x170
[ 0.180000] [<ffffffff81013ee5>] do_invalid_op+0x95/0xb0
[ 0.180000] [<ffffffff8100fcc4>] ? dummy_handler+0x4/0x10
[ 0.180000] [<ffffffff8106084a>] ? update_curr_rt+0x12a/0x230
[ 0.180000] [<ffffffff81086b2c>] ? run_posix_cpu_timers+0x2c/0x260
[ 0.180000] [<ffffffff8103886f>] ? pvclock_clocksource_read+0x4f/0xb0
[ 0.180000] [<ffffffff81012f7b>] invalid_op+0x1b/0x20
[ 0.180000] [<ffffffff8100fcc4>] ? dummy_handler+0x4/0x10
[ 0.180000] [<ffffffff810c4d80>] handle_IRQ_event+0x60/0x170
[ 0.180000] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180000] [<ffffffff810c70b2>] handle_percpu_irq+0x42/0x80
[ 0.180000] [<ffffffff81014d12>] handle_irq+0x22/0x30
[ 0.180000] [<ffffffff8131ee29>] xen_evtchn_do_upcall+0x199/0x1c0
[ 0.180000] [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180000] [<ffffffff8101333e>] xen_do_hypervisor_callback+0x1e/0x30
[ 0.180000] <EOI> [<ffffffff810b63a0>] ? stop_cpu+0x0/0xf0
[ 0.180000] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180000] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1010
[ 0.180000] [<ffffffff8100eb7d>] ? xen_force_evtchn_callback+0xd/0x10
[ 0.180000] [<ffffffff8100f322>] ? check_events+0x12/0x20
[ 0.180000] [<ffffffff8100f2c9>] ? xen_irq_enable_direct_end+0x0/0x7
[ 0.180000] [<ffffffff810b645f>] ? stop_cpu+0xbf/0xf0
[ 0.180000] [<ffffffff810800e7>] ? run_workqueue+0xc7/0x1a0
[ 0.180000] [<ffffffff81080263>] ? worker_thread+0xa3/0x110
[ 0.180000] [<ffffffff81084cc0>] ? autoremove_wake_function+0x0/0x40
[ 0.180000] [<ffffffff810801c0>] ? worker_thread+0x0/0x110
[ 0.180000] [<ffffffff81084946>] ? kthread+0x96/0xa0
[ 0.180000] [<ffffffff810131ea>] ? child_rip+0xa/0x20
[ 0.180000] [<ffffffff810123d1>] ? int_ret_from_sys_call+0x7/0x1b
[ 0.180000] [<ffffffff81012b5d>] ? retint_restore_args+0x5/0x6
[ 0.180000] [<ffffffff810131e0>] ? child_rip+0x0/0x20

Looking at http://www.gossamer-threads.com/lists/xen/devel/195525 and http://xen.1045712.n5.nabble.com/PATCH-xen-events-do-not-unmask-polled-ipis-on-restore-td3241695.html problem is known but for some reason patch didn't get integrated in normal kernels.

Can you integrate the patch (from -ec2 or official Linux?)

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-meta (Ubuntu):
status: New → Confirmed
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

I forgot. Problem happen on 64bit machines.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.5kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc1-quantal/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: lucid needs-upstream-testing
Revision history for this message
Gonçalo Gomes (gagomes) wrote :

The problem is not reproducible using the images below in a Lucid guest on XenServer 6.0.2

linux-image-3.5.0-030500rc1-generic_3.5.0-030500rc1.201206022335_amd64.deb
linux-image-extra-3.5.0-030500rc1-generic_3.5.0-030500rc1.201206022335_amd64.deb

(Images supplied by Joseph Salisbury, see comment above)

tags: added: kernel-fixed-upstream
removed: needs-upstream-testing
Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

Same issue with updated 11.04 (amd64).

Revision history for this message
Stefan Bader (smb) wrote :

The patch mentioned in comment #1 is already in the 10.04 kernel for a while and I remember there have been some follow-ups on it since then. So I would appreciate to see a backtrace produced by what is considered the up-to-date kernel (10.04 should be 2.6.32-41.91 at least). And best repeated for all the kernel versions that are assumed to be broken.
I believe I remember the issue going through some loops as fixing the oops then caused migration to fail which then needed another fix.

Stefan Bader (smb)
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

Thank you Stefan.

Ok, I'll do the test again with all 3 versions.

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

Lucid (10.04) with kernel 3.0.0-23-generic works correctly.

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :
Download full text (48.9 KiB)

Ubuntu 10.04 Maverick with linux-image-2.6.35-32-generic crash (8 -> 7 vCPUs)

[ 150.864022] Cannot set affinity for irq 0
[ 150.864040] Broke affinity for irq 42
[ 150.864042] Broke affinity for irq 43
[ 150.864044] Broke affinity for irq 44
[ 150.864046] Broke affinity for irq 45
[ 150.864048] Broke affinity for irq 46
[ 150.864050] Broke affinity for irq 47
[ 150.865076] ------------[ cut here ]------------
[ 150.865102] kernel BUG at /build/buildd/linux-2.6.35/arch/x86/xen/spinlock.c:344!
[ 150.865110] invalid opcode: 0000 [#1] SMP
[ 150.865118] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
[ 150.865126] CPU 1
[ 150.865129] Modules linked in: lp parport xen_blkfront xen_netfront
[ 150.865143]
[ 150.865148] Pid: 6, comm: migration/1 Not tainted 2.6.35-32-generic #67-Ubuntu /
[ 150.865157] RIP: e030:[<ffffffff81007d24>] [<ffffffff81007d24>] dummy_handler+0x4/0x10
[ 150.865173] RSP: e02b:ffff880003e89ea8 EFLAGS: 00010046
[ 150.865179] RAX: ffffffffff57b000 RBX: ffff88003fc0c2a0 RCX: 0000000000000000
[ 150.865188] RDX: 0000000000400200 RSI: 0000000000000000 RDI: 0000000000000007
[ 150.865196] RBP: ffff880003e89ea8 R08: 0000000000000200 R09: 0000000000000000
[ 150.865204] R10: ffff880003e91028 R11: 0000000000012ef0 R12: 0000000000000000
[ 150.865212] R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000100
[ 150.865225] FS: 00007f4000384700(0000) GS:ffff880003e86000(0000) knlGS:0000000000000000
[ 150.865235] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 150.865242] CR2: 00007f3fff9b5cc1 CR3: 0000000001a2a000 CR4: 0000000000002660
[ 150.865250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 150.865259] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 150.865267] Process migration/1 (pid: 6, threadinfo ffff88003ea5a000, task ffff88003ea52de0)
[ 150.865276] Stack:
[ 150.865280] ffff880003e89ef8 ffffffff810cb300 0000000000000200 0000000000000000
[ 150.865293] <0> ffffffff81a23840 ffffffff81a23940 0000000000000007 0000000000000200
[ 150.865307] <0> 0000000000000001 0000000000000100 ffff880003e89f18 ffffffff810cda42
[ 150.865323] Call Trace:
[ 150.865328] <IRQ>
[ 150.865337] [<ffffffff810cb300>] handle_IRQ_event+0x50/0x160
[ 150.865347] [<ffffffff810cda42>] handle_percpu_irq+0x42/0x80
[ 150.865358] [<ffffffff81349a56>] xen_evtchn_do_upcall+0x1d6/0x200
[ 150.865368] [<ffffffff810b3300>] ? cpu_stopper_thread+0x110/0x1d0
[ 150.865378] [<ffffffff8100b02e>] xen_do_hypervisor_callback+0x1e/0x30
[ 150.865385] <EOI>
[ 150.865392] [<ffffffff810b3300>] ? cpu_stopper_thread+0x110/0x1d0
[ 150.865403] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
[ 150.865409] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
[ 150.865416] [<ffffffff81006bad>] ? xen_force_evtchn_callback+0xd/0x10
[ 150.865421] [<ffffffff81007352>] ? check_events+0x12/0x20
[ 150.865427] [<ffffffff810072f9>] ? xen_irq_enable_direct_end+0x0/0x7
[ 150.865433] [<ffffffff810b346f>] ? stop_machine_cpu_stop+0xaf/0xe0
[ 150.865438] [<ffffffff810b33c0>] ? stop_machine_cpu_stop+0x0/0xe0
[ 150.865444] [<ffffffff810b32e6>] ? cpu_stopper_thread+0xf6...

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :
Download full text (6.5 KiB)

Ubuntu Natty (11.04) with linux-image-2.6.38-15-generic

[ 91.662162] ------------[ cut here ]------------
[ 91.662169] kernel BUG at /build/buildd/linux-2.6.38/arch/x86/xen/spinlock.c:344!
[ 91.662173] invalid opcode: 0000 [#1] SMP
[ 91.662178] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[ 91.662182] CPU 1
[ 91.662184] Modules linked in: lp parport
[ 91.662190]
[ 91.662193] Pid: 7, comm: migration/1 Not tainted 2.6.38-15-generic #64-Ubuntu
[ 91.662200] RIP: e030:[<ffffffff81009964>] [<ffffffff81009964>] dummy_handler+0x4/0x10
[ 91.662209] RSP: e02b:ffff88001ffa0ea8 EFLAGS: 00010046
[ 91.662212] RAX: ffff88001ffa8060 RBX: ffff88001e9a8300 RCX: 0000000000000001
[ 91.662216] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000128
[ 91.662219] RBP: ffff88001ffa0ea8 R08: ffff88001e824b00 R09: ffff88001e4005c8
[ 91.662223] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000128
[ 91.662226] R13: 0000000000000128 R14: 0000000000000000 R15: ffff88001ffa8060
[ 91.662234] FS: 00007f10fd122740(0000) GS:ffff88001ff9d000(0000) knlGS:0000000000000000
[ 91.662237] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 91.662241] CR2: 00007f10fd129000 CR3: 0000000003938000 CR4: 0000000000002660
[ 91.662244] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 91.662248] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 91.662252] Process migration/1 (pid: 7, threadinfo ffff88001e9a6000, task ffff88001e99db80)
[ 91.662255] Stack:
[ 91.662257] ffff88001ffa0ef8 ffffffff810d3ba4 000000000000000a 0000000000000200
[ 91.662264] ffff88001ffa0ee8 ffff88001e824b00 0000000000000128 0000000000000200
[ 91.662271] 0000000000000001 ffff88001ffa8060 ffff88001ffa0f18 ffffffff810d6783
[ 91.662278] Call Trace:
[ 91.662281] <IRQ>
[ 91.662287] [<ffffffff810d3ba4>] handle_IRQ_event+0x54/0x180
[ 91.662292] [<ffffffff810d6783>] handle_percpu_irq+0x43/0x80
[ 91.662297] [<ffffffff8137249f>] __xen_evtchn_do_upcall+0x19f/0x1c0
[ 91.662303] [<ffffffff810bb740>] ? stop_machine_cpu_stop+0x0/0xe0
[ 91.662327] [<ffffffff813745bf>] xen_evtchn_do_upcall+0x2f/0x50
[ 91.662332] [<ffffffff8100cf6e>] xen_do_hypervisor_callback+0x1e/0x30
[ 91.662335] <EOI>
[ 91.662339] [<ffffffff810bb740>] ? stop_machine_cpu_stop+0x0/0xe0
[ 91.662346] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1000
[ 91.662350] [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1000
[ 91.662355] [<ffffffff810077ed>] ? xen_force_evtchn_callback+0xd/0x10
[ 91.662360] [<ffffffff81007ee2>] ? check_events+0x12/0x20
[ 91.662364] [<ffffffff81007e89>] ? xen_irq_enable_direct_end+0x0/0x7
[ 91.662369] [<ffffffff810bb7e7>] ? stop_machine_cpu_stop+0xa7/0xe0
[ 91.662373] [<ffffffff810bb9a8>] ? cpu_stopper_thread+0xe8/0x1b0
[ 91.662379] [<ffffffff81055eaa>] ? finish_task_switch+0x4a/0xe0
[ 91.662385] [<ffffffff815c79ec>] ? schedule+0x3ec/0x760
[ 91.662390] [<ffffffff810bb8c0>] ? cpu_stopper_thread+0x0/0x1b0
[ 91.662395] [<ffffffff81087996>] ? kthread+0x96/0xa0
[ 91.662400] [<ffffffff8100ce24>] ? kernel_thread_helper+0x4/0x10
[ 91.662405] ...

Read more...

Revision history for this message
Stefan Bader (smb) wrote :

Ok, I think I got confused by the traces on Lucid kernel (2.6.32). So, Maverick (2.6.35) there won't be much we can do as it went out of support (see https://wiki.ubuntu.com/Releases). Natty (2.6.38) is just about still in (till October) and looks like it does miss the additional changes that went into Lucid (IRQD_RESUME_EARLY, IRQF_FORCE_RESUME). So let me look at this for Natty.

Revision history for this message
Stefan Bader (smb) wrote :

Oh, and to be really sure: the current 2.6.32 (Lucid) kernel, that is ok, right?

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

Ubuntu Lucid (10.04) with kernel 2.6.32-41-generic hangs.

Revision history for this message
Stefan Bader (smb) wrote :

Ok, so Lucid needs some more investigation. What you do is basically writing 0/1 into /sys/devices/cpu/cpu?/online, right?

For Natty I have prepared some test kernels at: http://people.canonical.com/~smb/lp1007002/
I did a quick offline/online of one CPU and a save/restore on 32bit. This looked ok. Could you give it a try as well?

Revision history for this message
Stefan Bader (smb) wrote :

I did try offlining/onlining a vCPU within a HVM Lucid guest here and that seems to be working. Maybe you could give me the details of your test procedure. Note that I am using a Xen 4.1.2 hypervisor.

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

I'm actually using a PV guest and XenServer (not the OpenSource version of Xen), the command I use It's something like

xe vm-vcpu-hotplug vm=Ubuntu\ Lucid\ Lynx\ 10.04\ \(64-bit\) new-vcpus=4

(changing of course machine name and cpu numbers)

Revision history for this message
Mike McClurg (mike-mcclurg) wrote : Re: [Bug 1007002] Re: Xen VCPUs hotplug does not work

We can probably provide Canonical with a pre-release build of
XenServer, if that would be helpful to you, Stefan.

Revision history for this message
Stefan Bader (smb) wrote :

@Mike, let me check whether I can trigger the problem using a PV guest and xm vpcu-set. If that is not successful then we can consider the next steps. It should somehow be independent from the host since 3.0 guests do work.

Revision history for this message
Stefan Bader (smb) wrote :

Hm, interesting. So using a 64bit PV Lucid guest, I could use "xm vcpu-set 1" to change the vcpus from 2 to 1. This produced some worrying lines in dmesg though:

[ 48.208923] CPU0 attaching NULL sched-domain.
[ 48.208940] CPU1 attaching NULL sched-domain.
[ 48.270196] CPU0 attaching NULL sched-domain.
[ 48.271423] Cannot set affinity for irq 6
[ 48.271433] Broke affinity for irq 7
[ 48.271439] Broke affinity for irq 8
[ 48.271444] Broke affinity for irq 9
[ 48.271450] Broke affinity for irq 10
[ 48.271455] Broke affinity for irq 11
[ 48.370172] SMP alternatives: switching to UP code

But at least the guest kept running. Now a "xm vcpu-set 2" would have no visible effect on the guest. The vpcu number remained 1 but it did not hang either. But I was able to online the second cpu from within the guest. The kernel used was 2.6.32-41-server. That said, the recommended non-ec2 kernel flavours for running in PV guests would be server for 64bit and generic-pae for 32bit (mostly because those had drivers for virtual environments built-in instead of them being modules).

@Frediano, would using the server flavour make a difference for you? And when would the hang occur, when adding or when removing a vcpu? And when running tail -f on the syslog, does that show any other hints?

Revision history for this message
Stefan Bader (smb) wrote :

Oh, just realized the "xm vcpu-set 2" had some effect. It did add the vpcu but offline.

Revision history for this message
Frediano Ziglio (frediano-ziglio) wrote :

I don't know open source side that much on this side, we have utilities (XS Tools) that enable these CPUs.

The problem (hang or crash) occur removing CPUs.

I'll try the tail check.

Revision history for this message
Stefan Bader (smb) wrote :

I am closing this as "won't fix" for the lack of a better closing code. The described problem should be fixed in Precise or later (which also do not use a non-standard kernel to run under Xen). I cannot remember whether it was in the end also working on Lucid but given the limited time of support left there and that there are now two more recent LTS versions which should work better I do not think there will be any change on Lucid, even if the old issue persisted. Feel free to re-open if you think I did the wrong thing.

Changed in linux (Ubuntu):
assignee: Stefan Bader (smb) → nobody
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.