KVM instance stops communicating after some time

Bug #1016848 reported by Christian Parpart
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned

Bug Description

I am running a single nova-network node as gateway, and have about 20 KVM instances spreaded over 4 compute nodes (one of them is also controller node), and everything is Ubuntu 12.04 LTS.

From time to time one or another instance WILL loose connectivity, that is, it still has its IP address (dhcp lease times raised up to 7 days) but still, no communication back nor forth is possible.

This pretty much looks like some kind of networking problem, but what exactly stopped working?

I connected to the failing KVM instance via VNC, and checked its interface, which looks pretty normal (like the others, working ones)

On the hypervisor, I am having the following state:
root@colossus09:~# ifconfig
br100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
          inet6 addr: fe80::2c48:74ff:fe22:a6cb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:85845 errors:0 dropped:15 overruns:0 frame:0
          TX packets:7463 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2906526 (2.9 MB) TX bytes:641770 (641.7 KB)

eth0 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
          inet addr:10.10.30.189 Bcast:10.10.31.255 Mask:255.255.224.0
          inet6 addr: fe80::225:90ff:fe49:d904/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1359563761 errors:0 dropped:174156 overruns:2 frame:0
          TX packets:1222020947 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1111996716949 (1.1 TB) TX bytes:673176161112 (673.1 GB)
          Memory:fafe0000-fb000000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:4628041 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4628041 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1060925632 (1.0 GB) TX bytes:1060925632 (1.0 GB)

vlan100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
          inet6 addr: fe80::225:90ff:fe49:d904/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:909059394 errors:0 dropped:0 overruns:0 frame:0
          TX packets:907044613 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1053993706102 (1.0 TB) TX bytes:641297608033 (641.2 GB)

vnet0 Link encap:Ethernet HWaddr fe:16:3e:3e:f4:58
          inet6 addr: fe80::fc16:3eff:fe3e:f458/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:62963968 errors:0 dropped:0 overruns:0 frame:0
          TX packets:61960786 errors:0 dropped:0 overruns:1 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:52542425624 (52.5 GB) TX bytes:84912733569 (84.9 GB)

vnet1 Link encap:Ethernet HWaddr fe:16:3e:01:ec:81
          inet6 addr: fe80::fc16:3eff:fe01:ec81/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:1280 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:110032 (110.0 KB) TX bytes:2461222 (2.4 MB)

vnet2 Link encap:Ethernet HWaddr fe:16:3e:3c:46:1b
          inet6 addr: fe80::fc16:3eff:fe3c:461b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:34725792 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35909449 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:2321718823 (2.3 GB) TX bytes:10039460160 (10.0 GB)

vnet0 is almost definitely the device to the failing KVM.

root@colossus09:~# ps afx | grep /kvm
 5080 pts/12 S+ 0:00 \_ grep --color=auto /kvm
 1811 ? Sl 848:32 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 16384 -smp 8,sockets=8,cores=1,threads=1 -name instance-00000036 -uuid 6dee1800-6e1e-42dd-abe9-8d8efa752bc5 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000036.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-00000036/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/instance-00000036/disk.local,if=none,id=drive-virtio-disk1,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=21,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:3c:46:1b,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000036/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:2 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 2275 ? Sl 4235:21 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 32768 -smp 20,sockets=20,cores=1,threads=1 -name instance-00000011 -uuid 48e3db02-a8ec-4140-8faa-d1f1f101ef29 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000011.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-00000011/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=17,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:3e:f4:58,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000011/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
 2667 ? Sl 28:37 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name instance-0000000f -uuid cb9aed4b-5daa-4c1c-85a6-9101adddde8d -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0000000f.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-0000000f/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=16,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:01:ec:81,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-0000000f/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

root@colossus09:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.10.30.4 0.0.0.0 UG 100 0 0 eth0
10.10.0.0 0.0.0.0 255.255.224.0 U 0 0 0 eth0

root@colossus09:~# brctl show
bridge name bridge id STP enabled interfaces
br100 8000.00259049d904 no vlan100
                                                        vnet0
                                                        vnet1
                                                        vnet2

root@colossus09:~# dmesg | grep vnet0 | tail -n 5
[827452.395730] br100: port 2(vnet0) entering disabled state
[827468.595961] device vnet0 entered promiscuous mode
[827468.661699] br100: port 2(vnet0) entering forwarding state
[827468.661705] br100: port 2(vnet0) entering forwarding state
[827479.315601] vnet0: no IPv6 routers present

root@colossus09:~# brctl showmacs br100
port no mac addr is local? ageing timer
  1 00:25:90:2b:63:de no 22.36
  1 00:25:90:49:bf:ce no 21.91
  1 00:25:90:49:bf:e2 no 22.56
  1 00:25:90:49:d9:04 yes 0.00
  3 fa:16:3e:01:ec:81 no 107.35
  1 fa:16:3e:14:b4:16 no 21.90
  1 fa:16:3e:28:e0:ab no 21.74
  1 fa:16:3e:2b:be:38 no 21.65
  1 fa:16:3e:31:92:53 no 21.78
  1 fa:16:3e:3b:74:7a no 21.92
  4 fa:16:3e:3c:46:1b no 0.00
  1 fa:16:3e:3d:ff:f3 no 0.00
  2 fa:16:3e:3e:f4:58 no 0.77
  1 fa:16:3e:42:8f:59 no 22.36
  1 fa:16:3e:43:bb:04 no 21.92
  1 fa:16:3e:47:65:c8 no 21.99
  1 fa:16:3e:4f:e6:4c no 21.78
  1 fa:16:3e:57:a7:e6 no 22.50
  1 fa:16:3e:59:8d:93 no 0.50
  1 fa:16:3e:64:fc:b5 no 21.71
  1 fa:16:3e:67:dc:73 no 22.27
  1 fa:16:3e:72:7f:3d no 21.80
  1 fa:16:3e:7f:8c:5c no 22.03
  3 fe:16:3e:01:ec:81 yes 0.00
  4 fe:16:3e:3c:46:1b yes 0.00
  2 fe:16:3e:3e:f4:58 yes 0.00

I have then added an IP to br100, so I can directly test via PING.

root@colossus09:~# ip addr add 10.10.40.90/21 dev br100
root@colossus09:~# ip route flush table cache

root@colossus09:~# ping -c1 10.10.40.17
PING 10.10.40.17 (10.10.40.17) 56(84) bytes of data.
64 bytes from 10.10.40.17: icmp_req=1 ttl=64 time=0.533 ms

--- 10.10.40.17 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.533/0.533/0.533/0.000 ms
root@colossus09:~# ping -c1 10.10.40.9
PING 10.10.40.9 (10.10.40.9) 56(84) bytes of data.

--- 10.10.40.9 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

The first ping works well, that is 10.10.40.17 (a KVM instance directly on this host, vnet1 or vnet2),
and then tested to ping the failing KVM instance with 10.10.40.9, which times out.

root@colossus09:~# tcpdump -c 4 -n -i vnet0
tcpdump: WARNING: vnet0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:44:19.514639 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:20.514110 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:21.902920 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:22.902762 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
4 packets captured
4 packets received by filter

The above shows, that (I think) 10.10.40.9 wants to know the MAC of 10.10.40.1, but no one seems to answer,
but II might misinterpret here. At least, someone is not answering.

I can see the same ARP requests via tcpdump when inside the KVM instance (via VNC).

What can I do to *fix* this?

For me, this incident is major, since we just cannot add more production instances until we have fixed this. :-(

Best regards,
Christian.

Revision history for this message
Christian Parpart (trapni) wrote :

happened twice last day, again.

`nova boot $INSTANCE` worked around it.

What can I do?

Revision history for this message
Thierry Carrez (ttx) wrote :

Looks like a duplicate of bug 997978 -- could you confirm that the symptoms are the same ?

Changed in nova:
status: New → Incomplete
no longer affects: ubuntu
Revision history for this message
Gunni (fgunni) wrote :

For me the workaround to change the nics from virtio to e1000 helped.

Revision history for this message
Christian Parpart (trapni) wrote :

Gunni, can you please mark yourself as "this also affects me" in bug 997978 - as you seem to have the same issue.

Although, where exactly did you do that? And can you set this up as the default somewhere, so new instances will use e1000 instead of virtio then?

Many thanks,
Christian.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.