e1000 irq problems after live migration with qemu-kvm 0.12.4

Bug #585113 reported by Peter Lieven
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
QEMU
Fix Released
Undecided
Unassigned

Bug Description

sorry for resubmitting. i accidently moved this bug to qemu-kvm at launchpad where it is stuck...

After live migrating ubuntu 9.10 server (2.6.31-14-server) and suse linux 10.1 (2.6.16.13-4-smp)
it happens sometimes that the guest runs into irq problems. i mention these 2 guest oss
since i have seen the error there. there are likely others around with the same problem.

on the host i run 2.6.33.3 (kernel+mod) and qemu-kvm 0.12.4.

i started a vm with:
/usr/bin/qemu-kvm-0.12.4 -net tap,vlan=141,script=no,downscript=no,ifname=tap0 -net nic,vlan=141,model=e1000,macaddr=52:54:00:ff:00:72 -drive file=/dev/sdb,if=ide,boot=on,cache=none,aio=native -m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU E5430 @ 2.66GHz' -monitor tcp:0:4001,server,nowait -vnc :1 -name 'migration-test-9-10' -boot order=dc,menu=on -k de -incoming tcp:172.21.55.22:5001 -pidfile /var/run/qemu/vm-155.pid -mem-path /hugepages -mem-prealloc -rtc base=utc,clock=host -usb -usbdevice tablet

for testing i have a clean ubuntu 9.10 server 64-bit install and created a small script with fetches a dvd iso from a local server and checking md5sum in an endless loop.

the download performance is approx. 50MB/s on that vm.

to trigger the error i did several migrations of the vm throughout the last days. finally I ended up in the following oops in the guest:

[64442.298521] irq 10: nobody cared (try booting with the "irqpoll" option)
[64442.299175] Pid: 0, comm: swapper Not tainted 2.6.31-14-server #48-Ubuntu
[64442.299179] Call Trace:
[64442.299185] <IRQ> [<ffffffff810b4b96>] __report_bad_irq+0x26/0xa0
[64442.299227] [<ffffffff810b4d9c>] note_interrupt+0x18c/0x1d0
[64442.299232] [<ffffffff810b5415>] handle_fasteoi_irq+0xd5/0x100
[64442.299244] [<ffffffff81014bdd>] handle_irq+0x1d/0x30
[64442.299246] [<ffffffff810140b7>] do_IRQ+0x67/0xe0
[64442.299249] [<ffffffff810129d3>] ret_from_intr+0x0/0x11
[64442.299266] [<ffffffff810b3234>] ? handle_IRQ_event+0x24/0x160
[64442.299269] [<ffffffff810b529f>] ? handle_edge_irq+0xcf/0x170
[64442.299271] [<ffffffff81014bdd>] ? handle_irq+0x1d/0x30
[64442.299273] [<ffffffff810140b7>] ? do_IRQ+0x67/0xe0
[64442.299275] [<ffffffff810129d3>] ? ret_from_intr+0x0/0x11
[64442.299290] [<ffffffff81526b14>] ? _spin_unlock_irqrestore+0x14/0x20
[64442.299302] [<ffffffff8133257c>] ? scsi_dispatch_cmd+0x16c/0x2d0
[64442.299307] [<ffffffff8133963a>] ? scsi_request_fn+0x3aa/0x500
[64442.299322] [<ffffffff8125fafc>] ? __blk_run_queue+0x6c/0x150
[64442.299324] [<ffffffff8125fcbb>] ? blk_run_queue+0x2b/0x50
[64442.299327] [<ffffffff8133899f>] ? scsi_run_queue+0xcf/0x2a0
[64442.299336] [<ffffffff81339a0d>] ? scsi_next_command+0x3d/0x60
[64442.299338] [<ffffffff8133a21b>] ? scsi_end_request+0xab/0xb0
[64442.299340] [<ffffffff8133a50e>] ? scsi_io_completion+0x9e/0x4d0
[64442.299348] [<ffffffff81036419>] ? default_spin_lock_flags+0x9/0x10
[64442.299351] [<ffffffff8133224d>] ? scsi_finish_command+0xbd/0x130
[64442.299353] [<ffffffff8133aa95>] ? scsi_softirq_done+0x145/0x170
[64442.299356] [<ffffffff81264e6d>] ? blk_done_softirq+0x7d/0x90
[64442.299368] [<ffffffff810651fd>] ? __do_softirq+0xbd/0x200
[64442.299370] [<ffffffff810131ac>] ? call_softirq+0x1c/0x30
[64442.299372] [<ffffffff81014b85>] ? do_softirq+0x55/0x90
[64442.299374] [<ffffffff81064f65>] ? irq_exit+0x85/0x90
[64442.299376] [<ffffffff810140c0>] ? do_IRQ+0x70/0xe0
[64442.299379] [<ffffffff810129d3>] ? ret_from_intr+0x0/0x11
[64442.299380] <EOI> [<ffffffff810356f6>] ? native_safe_halt+0x6/0x10
[64442.299390] [<ffffffff8101a20c>] ? default_idle+0x4c/0xe0
[64442.299395] [<ffffffff815298f5>] ? atomic_notifier_call_chain+0x15/0x20
[64442.299398] [<ffffffff81010e02>] ? cpu_idle+0xb2/0x100
[64442.299406] [<ffffffff815123c6>] ? rest_init+0x66/0x70
[64442.299424] [<ffffffff81838047>] ? start_kernel+0x352/0x35b
[64442.299427] [<ffffffff8183759a>] ? x86_64_start_reservations+0x125/0x129
[64442.299429] [<ffffffff81837698>] ? x86_64_start_kernel+0xfa/0x109
[64442.299433] handlers:
[64442.299840] [<ffffffffa0000b80>] (e1000_intr+0x0/0x190 [e1000])
[64442.300046] Disabling IRQ #10

After this the guest is still allive, but download performance is down to approx. 500KB/s

This error is definetly not triggerable with option -no-kvm-irqchip. I have seen this error occasionally
since my first experiments with qemu-kvm-88 and also without hugetablefs.

Help appreciated.

Revision history for this message
Peter Lieven (plieven) wrote :

I did 2 additional tests

1) Stop VM, Live Migrate, Continue -> Triggers BUG
2
) Stop VM, Continue -> Does NOT trigger BUG.

My guess it seems that pending interrupts are incorrectly transferred with kernel irqchip.
As said earlier userspace irqchip does not trigger the bug.

Revision history for this message
Peter Lieven (plieven) wrote :

Additional Info:

1) If I use rtl8139 instead of e1000 NIC driver. The VM freezes at 100% CPU after migration
2) Ubuntu Lucid LTS 64-bit Server is also affected and shows same symtomps

Revision history for this message
Anthony Liguori (anthony-codemonkey) wrote :

Can you attempt to reproduce this against the latest upstream git? I believe a fix for this has been committed and we probably need to backport it to stable.

Revision history for this message
Peter Lieven (plieven) wrote : Re: [Bug 585113] Re: e1000 irq problems after live migration with qemu-kvm 0.12.4
Download full text (5.3 KiB)

Von meinem iPhone gesendet

Am 28.05.2010 um 14:50 schrieb Anthony Liguori <email address hidden>:

> Can you attempt to reproduce this against the latest upstream git? I
> believe a fix for this has been committed and we probably need to
> backport it to stable.
>
Anthony, can you specify which commit contains the bug fix please, i
would like to cherry pick it and apply it to 0.12.4 as i did with the
dma cancel patch. Thx peter
> --
> e1000 irq problems after live migration with qemu-kvm 0.12.4
> https://bugs.launchpad.net/bugs/585113
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in QEMU: New
>
> Bug description:
> sorry for resubmitting. i accidently moved this bug to qemu-kvm at
> launchpad where it is stuck...
>
> After live migrating ubuntu 9.10 server (2.6.31-14-server) and suse
> linux 10.1 (2.6.16.13-4-smp)
> it happens sometimes that the guest runs into irq problems. i
> mention these 2 guest oss
> since i have seen the error there. there are likely others around
> with the same problem.
>
> on the host i run 2.6.33.3 (kernel+mod) and qemu-kvm 0.12.4.
>
> i started a vm with:
> /usr/bin/qemu-kvm-0.12.4 -net
> tap,vlan=141,script=no,downscript=no,ifname=tap0 -net
> nic,vlan=141,model=e1000,macaddr=52:54:00:ff:00:72 -drive file=/
> dev/sdb,if=ide,boot=on,cache=none,aio=native -m 1024 -cpu
> qemu64,model_id='Intel(R) Xeon(R) CPU E5430 @ 2.66GHz' -
> monitor tcp:0:4001,server,nowait -vnc :1 -name 'migration-
> test-9-10' -boot order=dc,menu=on -k de -incoming tcp:
> 172.21.55.22:5001 -pidfile /var/run/qemu/vm-155.pid -mem-path /
> hugepages -mem-prealloc -rtc base=utc,clock=host -usb -usbdevice
> tablet
>
> for testing i have a clean ubuntu 9.10 server 64-bit install and
> created a small script with fetches a dvd iso from a local server
> and checking md5sum in an endless loop.
>
> the download performance is approx. 50MB/s on that vm.
>
> to trigger the error i did several migrations of the vm throughout
> the last days. finally I ended up in the following oops in the guest:
>
> [64442.298521] irq 10: nobody cared (try booting with the "irqpoll"
> option)
> [64442.299175] Pid: 0, comm: swapper Not tainted 2.6.31-14-server
> #48-Ubuntu
> [64442.299179] Call Trace:
> [64442.299185] <IRQ> [<ffffffff810b4b96>] __report_bad_irq+0x26/0xa0
> [64442.299227] [<ffffffff810b4d9c>] note_interrupt+0x18c/0x1d0
> [64442.299232] [<ffffffff810b5415>] handle_fasteoi_irq+0xd5/0x100
> [64442.299244] [<ffffffff81014bdd>] handle_irq+0x1d/0x30
> [64442.299246] [<ffffffff810140b7>] do_IRQ+0x67/0xe0
> [64442.299249] [<ffffffff810129d3>] ret_from_intr+0x0/0x11
> [64442.299266] [<ffffffff810b3234>] ? handle_IRQ_event+0x24/0x160
> [64442.299269] [<ffffffff810b529f>] ? handle_edge_irq+0xcf/0x170
> [64442.299271] [<ffffffff81014bdd>] ? handle_irq+0x1d/0x30
> [64442.299273] [<ffffffff810140b7>] ? do_IRQ+0x67/0xe0
> [64442.299275] [<ffffffff810129d3>] ? ret_from_intr+0x0/0x11
> [64442.299290] [<ffffffff81526b14>] ? _spin_unlock_irqrestore
> +0x14/0x20
> [64442.299302] [<ffffffff8133257c>] ? scsi_dispatch_cmd+0x16c/0x2d0
> [64442.299307] [<...

Read more...

Revision history for this message
Peter Lieven (plieven) wrote :

Update. Testing done with latest GIT.

I can confirm that I no longer see irq nobody cared errors into Ubuntu 9.10 and 10.04 64-bit guests, but I run directly
into 100% CPU after migration (#584516).

if i boot the guest with kernel parameter no-kvmclock or clocksource=acpi_pm the migration succeeds.

Can you please point out with commit fixed the issue with the kernel irqchip so we can backport it to stable.

Ryan Harper (raharper)
Changed in qemu:
status: New → Fix Committed
Revision history for this message
Michael Tokarev (mjt+launchpad-tls) wrote :

Please note that this bug affects 0.12 stable as well. It'd be really nice to know the commit which fixed the issue, in order to backport it to -stable...

Aurelien Jarno (aurel32)
Changed in qemu:
status: Fix Committed → Fix Released
Revision history for this message
Angelo Pantano (ghilteras) wrote :

Unfortunately I cannot confirm the fix, as in version:

ii qemu-kvm 0.12.5+dfsg-5+squeeze8

the guest is a karmic 9.10

when I live migrate it from serverA (dell poweredge r210) to serverB (dual
core intel 3.20GHz) my guest gets always stuck, oddly enough this does
not happens the other way around

if I migrate the guest back then it revives, goes back to being pingable and I can issue again commands on old open shells

they both have the same identical environment as hypervisors, same
packages, same versions.

cheers

Revision history for this message
Michael Tokarev (mjt+launchpad-tls) wrote :

Which fix you're talking about? I don't remember seeing any backporting to 0.12 or even identifcation of the commit which fixed this issue, and don't remember adding it to debian qemu-kvm package. Was there a fix?

Thanks,

/mjt

Revision history for this message
Michael Tokarev (mjt+launchpad-tls) wrote :

Oh. The fix is part of 0.12.5 stable series, also debian #580649 . Well, it fixed the issue for me and for original reporter of #580649, and it was a real bug and fix. So I don't think it is the same issue.

And again, please verify if this issue is still present in latest release. I'm not sure I'll provide any good help for it however - squeeze version is very old now, and used by many people.

Revision history for this message
Michael Tokarev (mjt+launchpad-tls) wrote :

And the original issue is indeed very different: without that fix you'd never see your guest UNstucking when you migrate it back.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.