KVM instance stops communicating after some time
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
I am running a single nova-network node as gateway, and have about 20 KVM instances spreaded over 4 compute nodes (one of them is also controller node), and everything is Ubuntu 12.04 LTS.
From time to time one or another instance WILL loose connectivity, that is, it still has its IP address (dhcp lease times raised up to 7 days) but still, no communication back nor forth is possible.
This pretty much looks like some kind of networking problem, but what exactly stopped working?
I connected to the failing KVM instance via VNC, and checked its interface, which looks pretty normal (like the others, working ones)
On the hypervisor, I am having the following state:
root@colossus09:~# ifconfig
br100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
inet6 addr: fe80::2c48:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:85845 errors:0 dropped:15 overruns:0 frame:0
TX packets:7463 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:2906526 (2.9 MB) TX bytes:641770 (641.7 KB)
eth0 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
inet addr:10.10.30.189 Bcast:10.10.31.255 Mask:255.255.224.0
inet6 addr: fe80::225:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1359563761 errors:0 dropped:174156 overruns:2 frame:0
TX packets:1222020947 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1111996716949 (1.1 TB) TX bytes:673176161112 (673.1 GB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4628041 errors:0 dropped:0 overruns:0 frame:0
TX packets:4628041 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1060925632 (1.0 GB) TX bytes:1060925632 (1.0 GB)
vlan100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04
inet6 addr: fe80::225:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:909059394 errors:0 dropped:0 overruns:0 frame:0
TX packets:907044613 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1053993706102 (1.0 TB) TX bytes:641297608033 (641.2 GB)
vnet0 Link encap:Ethernet HWaddr fe:16:3e:3e:f4:58
inet6 addr: fe80::fc16:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:62963968 errors:0 dropped:0 overruns:0 frame:0
TX packets:61960786 errors:0 dropped:0 overruns:1 carrier:0
RX bytes:52542425624 (52.5 GB) TX bytes:84912733569 (84.9 GB)
vnet1 Link encap:Ethernet HWaddr fe:16:3e:01:ec:81
inet6 addr: fe80::fc16:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1280 errors:0 dropped:0 overruns:0 frame:0
TX packets:56964 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:110032 (110.0 KB) TX bytes:2461222 (2.4 MB)
vnet2 Link encap:Ethernet HWaddr fe:16:3e:3c:46:1b
inet6 addr: fe80::fc16:
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:34725792 errors:0 dropped:0 overruns:0 frame:0
TX packets:35909449 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:2321718823 (2.3 GB) TX bytes:10039460160 (10.0 GB)
vnet0 is almost definitely the device to the failing KVM.
root@colossus09:~# ps afx | grep /kvm
5080 pts/12 S+ 0:00 \_ grep --color=auto /kvm
1811 ? Sl 848:32 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 16384 -smp 8,sockets=
2275 ? Sl 4235:21 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 32768 -smp 20,sockets=
2667 ? Sl 28:37 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 512 -smp 1,sockets=
root@colossus09:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.10.30.4 0.0.0.0 UG 100 0 0 eth0
10.10.0.0 0.0.0.0 255.255.224.0 U 0 0 0 eth0
root@colossus09:~# brctl show
bridge name bridge id STP enabled interfaces
br100 8000.00259049d904 no vlan100
root@colossus09:~# dmesg | grep vnet0 | tail -n 5
[827452.395730] br100: port 2(vnet0) entering disabled state
[827468.595961] device vnet0 entered promiscuous mode
[827468.661699] br100: port 2(vnet0) entering forwarding state
[827468.661705] br100: port 2(vnet0) entering forwarding state
[827479.315601] vnet0: no IPv6 routers present
root@colossus09:~# brctl showmacs br100
port no mac addr is local? ageing timer
1 00:25:90:2b:63:de no 22.36
1 00:25:90:49:bf:ce no 21.91
1 00:25:90:49:bf:e2 no 22.56
1 00:25:90:49:d9:04 yes 0.00
3 fa:16:3e:01:ec:81 no 107.35
1 fa:16:3e:14:b4:16 no 21.90
1 fa:16:3e:28:e0:ab no 21.74
1 fa:16:3e:2b:be:38 no 21.65
1 fa:16:3e:31:92:53 no 21.78
1 fa:16:3e:3b:74:7a no 21.92
4 fa:16:3e:3c:46:1b no 0.00
1 fa:16:3e:3d:ff:f3 no 0.00
2 fa:16:3e:3e:f4:58 no 0.77
1 fa:16:3e:42:8f:59 no 22.36
1 fa:16:3e:43:bb:04 no 21.92
1 fa:16:3e:47:65:c8 no 21.99
1 fa:16:3e:4f:e6:4c no 21.78
1 fa:16:3e:57:a7:e6 no 22.50
1 fa:16:3e:59:8d:93 no 0.50
1 fa:16:3e:64:fc:b5 no 21.71
1 fa:16:3e:67:dc:73 no 22.27
1 fa:16:3e:72:7f:3d no 21.80
1 fa:16:3e:7f:8c:5c no 22.03
3 fe:16:3e:01:ec:81 yes 0.00
4 fe:16:3e:3c:46:1b yes 0.00
2 fe:16:3e:3e:f4:58 yes 0.00
I have then added an IP to br100, so I can directly test via PING.
root@colossus09:~# ip addr add 10.10.40.90/21 dev br100
root@colossus09:~# ip route flush table cache
root@colossus09:~# ping -c1 10.10.40.17
PING 10.10.40.17 (10.10.40.17) 56(84) bytes of data.
64 bytes from 10.10.40.17: icmp_req=1 ttl=64 time=0.533 ms
--- 10.10.40.17 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.533/0.
root@colossus09:~# ping -c1 10.10.40.9
PING 10.10.40.9 (10.10.40.9) 56(84) bytes of data.
--- 10.10.40.9 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
The first ping works well, that is 10.10.40.17 (a KVM instance directly on this host, vnet1 or vnet2),
and then tested to ping the failing KVM instance with 10.10.40.9, which times out.
root@colossus09:~# tcpdump -c 4 -n -i vnet0
tcpdump: WARNING: vnet0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:44:19.514639 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:20.514110 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:21.902920 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
10:44:22.902762 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28
4 packets captured
4 packets received by filter
The above shows, that (I think) 10.10.40.9 wants to know the MAC of 10.10.40.1, but no one seems to answer,
but II might misinterpret here. At least, someone is not answering.
I can see the same ARP requests via tcpdump when inside the KVM instance (via VNC).
What can I do to *fix* this?
For me, this incident is major, since we just cannot add more production instances until we have fixed this. :-(
Best regards,
Christian.
happened twice last day, again.
`nova boot $INSTANCE` worked around it.
What can I do?