Live Migration Guest Network Stops Responding

Bug #612671 reported by Moshe Ortov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt
Expired
Medium
libvirt (Ubuntu)
Invalid
Low
Unassigned

Bug Description

After configuring 2 identical servers with KVM and then trying to migrate an image using bridged networking from one to the other, the guest loses network connectivity after the migration completes.

A tcpdump shows the bridged network is passing packets /from/ the guest (a test ping run in the guest via the VNC console) and reply packets are received and bridged back onto the guest network interface without the guest seeing the packet (i.e. packets come from vnet0, sent out on br0, reply comes back from br0, is forwarded onto vnet0 but the guest never sees the reply).

tcpdump -n -i vnet0
18:19:47.072378 IP 192.168.35.49 > 192.168.35.4: ICMP echo request, id 512, seq 2816, length 40
This is the echo packet sent by the guest

18:19:47.073119 IP 192.168.35.4 > 192.168.35.49: ICMP echo reply, id 512, seq 2816, length 40
This is the reply coming back except the actual guest does not ever see the reply.

# lsb_release -rd
Description: Ubuntu 10.04.1 LTS
Release: 10.04

The packet forwarding seems to go amiss on the bridge or in qemu or libvirt - there are a number of posts online with people finding similar problems but nothing ever seems to answer why or how to fix it that I've found.

brctl show
# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.001517c8ef5e no eth1
       eth2
       vnet0

brctl showstp br0
br0
 bridge id 8000.001517c8ef5e
 designated root 8000.001517c8ef5e
 root port 0 path cost 0
 max age 20.00 bridge max age 20.00
 hello time 2.00 bridge hello time 2.00
 forward delay 0.00 bridge forward delay 0.00
 ageing time 300.00
 hello timer 1.64 tcn timer 0.00
 topology change timer 0.00 gc timer 0.64
 flags

eth1 (1)
 port id 8001 state disabled
 designated root 8000.001517c8ef5e path cost 100
 designated bridge 8000.001517c8ef5e message age timer 0.00
 designated port 8001 forward delay timer 0.00
 designated cost 0 hold timer 0.00
 flags

eth2 (2)
 port id 8002 state forwarding
 designated root 8000.001517c8ef5e path cost 4
 designated bridge 8000.001517c8ef5e message age timer 0.00
 designated port 8002 forward delay timer 0.00
 designated cost 0 hold timer 0.64
 flags

vnet0 (3) <-- the guest nic
 port id 8003 state forwarding
 designated root 8000.001517c8ef5e path cost 100
 designated bridge 8000.001517c8ef5e message age timer 0.00
 designated port 8003 forward delay timer 0.00
 designated cost 0 hold timer 0.64
 flags

Revision history for this message
In , Lijian (lijian-redhat-bugs) wrote :

Description of problem:
    I'm testing live migration feature following instructions in https://fedoraproject.org/wiki/QA:Testcase_Live_Migration_using_libvirt/virsh, the migration does work, but the pinging to the guest OS was interrupted just after the migration succeeded, and network doesn't work even if I restart the NetworkManager in guest OS.
    This happens both on Fedora 12 and Windows XP SP3 I've tested, maybe it's a bug.
    BTW, I'm using bridge network on both source and dest machines, and the interface names are all the same("br0").

Version-Release number of selected component (if applicable):
    * kernel-2.6.33.1-19.fc13
    * libvirt-0.7.7-1.fc13
    * python-virtinst-0.500.2-3.fc13
    * qemu-0.12.3-6.fc13
    * seabios-0.5.1-1.fc13
    * virt-manager-0.8.3-2.fc13

How reproducible:
    Always.

Steps to Reproduce:
1. Share the image with dest host via NFS.
2. Start Guest OS, and ping it.
3. Run "virsh migrate --live testxp qemu+ssh://10.66.65.51/system"

Actual results:
    Ping returns nothing after Guest OS in source host closes, re-ping it, and outputs:

From 10.66.65.190 icmp_seq=2611 Destination Host Unreachable
From 10.66.65.190 icmp_seq=2612 Destination Host Unreachable
From 10.66.65.190 icmp_seq=2613 Destination Host Unreachable

    The same when pinging from Guest OS to Host OS.

Expected results:
    Ping continue uninterrupted.

Additional info:

Revision history for this message
In , Cole (cole-redhat-bugs) wrote :

Can you still reproduce this issue? On the remote host, if you do 'virsh define $vmname' for the migrated VM, the reboot the VM, do you get network connectivity?

Do VMs on the remote host have network connectivity to start? Can you provide the XML of the migrated guest? Can the guest ping public websites at all?

Revision history for this message
In , Cole (cole-redhat-bugs) wrote :

Closing as INSUFFICIENT_DATA. Please reopen if you can reproduce with latest F13 or rawhide packages.

Revision history for this message
Moshe Ortov (moshe-ortov) wrote :

In further testing, I have found an Ubuntu 9.04 guest /does/ keep network connectivity.

My guests which are failing seem to be Windows XP SP3. I've tried configurations with and without the virtio network driver but this does not seem to make any difference.

I will post additional information as my debugging finds. If there are specific details required to help track this, please let me know and I'll work to assist.

Revision history for this message
Moshe Ortov (moshe-ortov) wrote :

After much searching, I found what looks like the same bug reported on RedHat : https://bugzilla.redhat.com/show_bug.cgi?id=580806

My set up is the same, bridged networking, XPSP3 (Pro). I am live migrating (migrate --live <vmname> qemu+ssh://server/system).

The servers are AMD64 quad-core - hardware is identical in every way between them so I know there is no hardware variation or cpu variation which accounts for this. (Plus the fact the Ubuntu VMs migrate and work also establishes the system can and does work for different guests)

Revision history for this message
Moshe Ortov (moshe-ortov) wrote :

Forgot to also add that a reboot of the XP VM /does/ return networking to a working state - the reboot requires to be done via the VNC remote connection to the console and the windows shutdown/restart option. The actual VM is running and operational but just seems to have lost the network return packets.

Revision history for this message
In , Moshe (moshe-redhat-bugs) wrote :

I have what seems to be an identical issue.

I can confirm that a reboot of the XP SP3 guest OS /does/ return the networking to operation - at the expense of making live migration a bit of a pointless exercise since the intention is to avoid the reboot and keep things running seamlessly.

The identical cluster migrating a Linux VM works properly - there is no interruption to operation. NeatX/NoMachine sessions continue to operate and no pings are lost.

Right now, this seems to affect XP SP3. I've tried it with and without the virtio network driver but the result is identical.

My systems are AMD64 quad core CPUs - both servers are identical in all hardware.

A tcpdump shows the XP guest sends out packets, these are forwarded over the bridge and the reply packets also come back onto the guest vnet0 device but the guest does not see them. e.g. a tcpdump shows the ping request and the reply packet but the guest itself does not receive the reply packet even though tcpdump shows it is there.

Mathias Gug (mathiaz)
Changed in libvirt (Ubuntu):
importance: Undecided → Low
Revision history for this message
Moshe Ortov (moshe-ortov) wrote :

Further experimentation on this from building a brand new XP SP2 VM then upgrading it to SP3 and performing migrations seems to be a little better, but not perfect.

A migration from node 1 to node 2 (in any order) works for the 1st migration. If a 2nd migration is tried then when it arrives at the back at the original node (i.e. node 1 -> node 2 -> node1) the VM is frozen (hung). CPU usage for the VM process is pegged at 100%.

I tried switching the CPU emulation to i686 from x86_64 - this was based on a posting about XP VMs locking up but that does not seem to have made any difference.

However, if I have a 'long' delay between migrations (> 30 minutes in what I tried) the migration back /does/ seem to work. This has happened a couple of times, but I've not found out how long the VM needs to be on node 2 before it can successfully migrate back to node 1 (i.e. I tried 30 minutes but that was not specifically chosen as a time to wait).

I'm not sure if the wait works in absolutely every case - I'll need to try more experiments. If there are any suggestions on what I can do to help get the cause identified & fixed, let me know.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

thanks for taking the time to help diagnose this problem.

Are you still having this problem? If so, can you look at the mac address of the bridges and veth* before and after migration, as well as 'iptables -L' output?

Can you describe the physical network topology? Are both the source and destination hosts on the same router?

Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in libvirt (Ubuntu):
status: Incomplete → Invalid
Changed in libvirt:
importance: Unknown → Medium
status: Unknown → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.