Tap interface does not automatically get an IP address upon a hypervisor reboot

Bug #1084355 reported by Salman Baset
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
dan wendlandt

Bug Description

A very simple configuration, in which there is a subnet and a net. No floating.

The tap interface gets an IP address when a subnet is created. For instance for 172.16.10.0/24 subnet, tap interface gets an IP address of 172.16.10.2

However, upon a reboot, tap interface does not always shows up, and does not automatically get an IP address. As a result, IP assignment to a new instance fails.

Tags: ovs
Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
Can you please clarify a few things:
1. Are you working from packages?
2. The tap interface that you are referring to is from the DHCP agent. Can you please check if this is running after reboot (please also check the log file)
Thanks
Gary

Changed in quantum:
status: New → Incomplete
Revision history for this message
Salman Baset (salman-h) wrote :

I did an install from Folsom packages. The dhcp agent agent is running (from log file and status).

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

Can you please provide the output of: ip link show

I am guessing that the tap is not set to UP by the dhcp agent after reboot. "ip link show" will tell us if that's the case.

Revision history for this message
Phil Hopkins (phil-hopkins-a) wrote :
Download full text (11.4 KiB)

I have run into the same problem. Here is what I found comparing a RHEL 6.3, Fedora 17 and Ubuntu 12.10

Using an "all-in-one" install.
In all three scenarios nova is configured with:
start_guests_on_host_boot=true #(this seems to cause problems in RHEL 6.3, it is set to false there).
resume_guests_state_on_host_boot=true

Quantum is configured with one network and a minimun of one subnet. In this case, the output of quantum subnet-list:

quantum subnet-list
+--------------------------------------+-----------------+-------------+--------------------------------------------+
| id | name | cidr | allocation_pools |
+--------------------------------------+-----------------+-------------+--------------------------------------------+
| 1fff853c-6949-483b-8bdc-3d3aa0fdc23b | private-subnet2 | 10.0.0.8/29 | {"start": "10.0.0.10", "end": "10.0.0.14"} |
| c14577fe-24ed-4af8-9bdd-0a7b976ca20b | private-subnet1 | 10.0.0.0/29 | {"start": "10.0.0.2", "end": "10.0.0.6"} |
+--------------------------------------+-----------------+-------------+--------------------------------------------+

One or more instances are running and are accessable through its network interface.

After issuing a reboot on the "all-in-one" node, the system reboots, the instance(s) are retsarted, however network access to the instance(s) does not work. It can be restarted using the following process:
For RHEL 6.3:
after a reboot:
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:99:14:71 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.58/24 brd 192.168.122.255 scope global eth0
    inet6 fe80::5054:ff:fe99:1471/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 52:54:00:09:7b:24 brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 4e:6f:72:f9:e3:ed brd ff:ff:ff:ff:ff:ff
5: br-int: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 5e:61:a5:f5:f0:4f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::7438:28ff:feee:e5a6/64 scope link
       valid_lft forever preferred_lft forever
6: tapaacc2584-09: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 0a:89:f4:70:5a:e5 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/29 brd 10.0.0.7 scope global tapaacc2584-09
    inet6 fe80::889:f4ff:fe70:5ae5/64 scope link
       valid_lft forever preferred_lft forever
8: br-ex: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 52:fa:95:eb:a9:4d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ce2:5cff:fe82:9c5e/64 scope link
       valid_lft forever preferred_lft forever
10: tap6de885fb-d0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 500
    link/ether 8e:2d:05:45:cd:7b brd ff:ff:ff:ff:ff:ff
11: br-tun: <BROADCA...

Revision history for this message
dan wendlandt (danwent) wrote :

Thanks for the very detailed report Phil!

This behavior probably depends on what vif-plugging mechanism you are using in Nova (and is hence, likely a Nova change, not a Quantum change, but the the quantum team is probably best to debug it, so I'd keep this issue also filed against Quantum).

Based on the fact that you're using OVS and you are seeing tap devices, it is correct to assume you are using

libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtOpenVswitchDriver

It may be that this line should not be underneath the if-check that checks if the device already exists. I wonder if libvirt somehow saves the fact that the type device exists, and thus it exists when plug() is called, which prevents us from setting it up. If that tis the case, we should move this line out from under the if-check that tests if the device exists.

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/vif.py#L159

Are you able to directly edit the code and check?

It would also be very interesting to understand what happens in this same scenario with other vif-drivers.

In particular, if you are using libvirt 0.9.11 or newer, the preferred vif-driver is actually:

libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtOpenVswitchVirtualPortDriver

In this case, libvirt automatically manages the devices connected to br-int, and thus I would expect that you wouldn't see this problem (but can't say for sure...)

If libvirt behaves as described above when "resuming" VMs, there may also be negative complications for the hybrid drivers, which is used when there is a plugin like the OVS plugin that also needs to use IPtables rules (e.g., for security groups).

libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtHybridOVSBridgeDriver

Changed in quantum:
status: Incomplete → Confirmed
importance: Undecided → High
assignee: nobody → dan wendlandt (danwent)
Changed in nova:
assignee: nobody → dan wendlandt (danwent)
status: New → Confirmed
Revision history for this message
Phil Hopkins (phil-hopkins-a) wrote :

First I am using all of the standard packages from EPEL for RHEL (http://repos.fedorapeople.org/repos/openstack/openstack-folsom/epel-6) and

Rhel6.3

libvirt-0.9.10-21.el6.x86_64

also

/etc/nova/nova.conf:libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtOpenVswitchDriver

changing line 159 in /usr/lib/python2.6/site-packages/nova/virt/libvirt/vif.py from:

        if not linux_net._device_exists(dev):
            # Older version of the command 'ip' from the iproute2 package
            # don't have support for the tuntap option (lp:882568). If it
            # turns out we're on an old version we work around this by using
            # tunctl.
            try:
                # First, try with 'ip'
                utils.execute('ip', 'tuntap', 'add', dev, 'mode', 'tap',
                          run_as_root=True)
            except exception.ProcessExecutionError:
                # Second option: tunctl
                utils.execute('tunctl', '-b', '-t', dev, run_as_root=True)
                utils.execute('ip', 'link', 'set', dev, 'up', run_as_root=True)

to:
        if not linux_net._device_exists(dev):
            # Older version of the command 'ip' from the iproute2 package
            # don't have support for the tuntap option (lp:882568). If it
            # turns out we're on an old version we work around this by using
            # tunctl.
            try:
                # First, try with 'ip'
                utils.execute('ip', 'tuntap', 'add', dev, 'mode', 'tap',
                          run_as_root=True)
            except exception.ProcessExecutionError:
                # Second option: tunctl
                utils.execute('tunctl', '-b', '-t', dev, run_as_root=True)
        utils.execute('ip', 'link', 'set', dev, 'up', run_as_root=True)

Seems to fix the problem in RHEL 6.3. The tap interface is in the up state after a reboot.

Making that change did not affect either the Ubuntu of Fedora systems. I suspect that the fact that their packaging systems appear to use different points of the Openstack release will have some effect. All three of these systems are virtual machines that I run using KVM on a Fedora workstation. That allows for quick comaprison between them. I also had to set start_guests_on_host_boot=false on the RHEL system. It was causing very bizzare behaviour which I will be documenting next.

That change did fix the RHEL system.

Do you need anything else? I may try other VIF drivers if get a chance. If you think that is essential for this bug let me know and I will give that some priority.

Phil

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → High
Revision history for this message
Akihiro Motoki (amotoki) wrote :

This issue still exists in nova master (after libvirt-vif-driver refactoring)
https://github.com/openstack/nova/blob/3fd1c63e37436eaf4621df62f112ae1886d238cc/nova/network/linux_net.py#L1170

As Phil tested, we need to link up the tap interface even if it exists already to fix it.

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
I think that there are two problems here and we have addressed them both:
1. The first is that when the host was rebooting the OVS tap devices were being saved by the OVS. We introducted the quantum-ovs-cleanup utility. When this is invoked on reboot it enables the DHCP agent to receive the necessary IP address
2. The resync intervale of the DHCP agent was 30 seconds. (bug 1128180) After reboot it could take upto 2 minutes for the tap device to get a IP address. This too has been addressed upstream and in stable folsom.
I think that the above mentioned problems have been addressed. We just need to make sure that they are included inthe latest stable folsom packages.
Thanks
Gary

Revision history for this message
YunQiang Su (wzssyqa) wrote :

We just upgraded our quantum version to 2012.2.3 with custom built package based on cloud archive, but we're still seeing the issues described by Phil Hopkings on ubuntu 12.04.

After we a reboot the instance won't be able to get an ip, but if we launch a new instance after the reboot and then reboot the instance which didn't get an ip, its able to get its ip.

If we execute the following two commands the instances are also able to get an ip again:
ip netns exec qdhcp-338a57f5-aa60-4b3e-b519-0683d26467e9 bash
ip link set tap98eb6fb8-e4 up
service openvswitch-switch restart

Changing the following flag also fixed the issue:
libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtHybridOVSBridgeDriver
to
libvirt_vif_driver=nova.virt.libvirt.vif.LibvirtOpenVswitchVirtualPortDriver

So I think this issue is not solved in latest stable folsom packages.

Revision history for this message
YunQiang Su (wzssyqa) wrote :

Edit: Correction changing the flag to nova.virt.libvirt.vif.LibvirtOpenVswitchVirtualPortDriver doesn't fix the problem. Sorry for confusion

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
Can you please check if the quantum-ovs-cleanup script is running on boot?
Thanks
Gary

Revision history for this message
YunQiang Su (wzssyqa) wrote :

Ohhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh,

Why add such a binary and let's call it manually.
Why not call it directly from quantum-ovs service directly.

Revision history for this message
Gary Kotton (garyk) wrote :

There are number fo reasons:
1. some plugins make use of openvswicth but do not use the openvswitch-agent
2. it complicates the boot process having this in the agent - if the agent restarts we will need to know whether or not to invoke - you would not like it to delete a tap device of the dhcp aget
Hence we added the binary which enables the packages to run this prior to all other queantum services and user to run it if and when they choose
We need to try and ensure that it is added to the ubuntu start up scripts
Thanks
Gary

Revision history for this message
YunQiang Su (wzssyqa) wrote :

I run quantum-ovs-cleanup in upstart of quantum-plugin-openvswitch-agent, at either pre-start or post-start.

Non of them can work.

Revision history for this message
YunQiang Su (wzssyqa) wrote :

This is my current workaround

1. /etc/init/{quantum-dhcp-agent,quantum-l3-agent}.conf
replace
start on runlevel [2345]
with
start on starting nova-compute

2. edit /etc/init/quantum-plugin-openvswitch-agent.conf to

start on starting nova-compute
stop on stopped openvswitch-switch

chdir /var/run

pre-start script
        mkdir -p /var/run/quantum
        chown quantum:root /var/run/quantum
        service openvswitch-switch restart
        quantum-ovs-cleanup
end script

exec start-stop-daemon --start --chuid quantum --exec /usr/bin/quantum-openvswitch-agent -- --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/plugins/openvswitch/ovs_quantum_plugin.ini --log-file=/var/log/quantum/openvswitch-agent.log

post-start script
        service openvswitch-switch restart
end script

Revision history for this message
YunQiang Su (wzssyqa) wrote :

With cleanup, the suspend function will be unusable.

Revision history for this message
Gary Kotton (garyk) wrote :

Sorry, I do not understand the part about the suspend function. Can you please clarify?
Thanks
Gary

Revision history for this message
YunQiang Su (wzssyqa) wrote :

It is not caused by cleanup, while it is a bug of quantum itself.

When use it, the suspend instance cannot wake up again.

dan wendlandt (danwent)
Changed in quantum:
milestone: none → grizzly-rc1
dan wendlandt (danwent)
summary: - Tap interface does not automatically get an IP address upon a reboot
+ Tap interface does not automatically get an IP address upon a hypervisor
+ reboot
Revision history for this message
dan wendlandt (danwent) wrote :

Ok, we need to figure out what to do with this bug.

My understanding is that when a hypervisor (or a network node?) is rebooted, in some cases, devices do not seem to get IPs.

When I worked with Phil on this thread earlier, it seems like at least part of the problem was that we were only if-up'ing a device if it also needed to be added to ovs. He said that doing the if-up outside of the check if the device already exists helped on RHEL 6.3, but not on Ubuntu. I'm now confused about why that would help at all though, as tap devices should not persist across a reboot of the physical box (I had originally thought this bug was about the reboot or suspend of a VM).

I suspect that garyk is correct that a combination of the resync interval changing and the quantum-cleanup script are a viable explanation. If anyone is able to still repro this, please update this bug.

Changed in quantum:
milestone: grizzly-rc1 → none
no longer affects: nova
Changed in quantum:
status: Confirmed → Incomplete
importance: High → Medium
tags: added: ovs
Revision history for this message
chetandiwani (chetandiwani) wrote :

I have modified Multi node setup in to 1 and was facing same problem while Physical Node get rebooted, the Guest was not able to get the IP address.

Setup Details : Ubuntu 12.04.2 LTS : Quantum Version
ii python-quantum 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virutal network service for Openstack - Python library
ii python-quantumclient 1:2.2.0-0ubuntu1~cloud0 client - Quantum is a virtual network service for Openstack
ii quantum-common 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - common
ii quantum-dhcp-agent 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - DHCP agent
ii quantum-l3-agent 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - l3 agent
ii quantum-metadata-agent 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - metadata agent
ii quantum-plugin-openvswitch 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - Open vSwitch plugin
ii quantum-plugin-openvswitch-agent 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - Open vSwitch plugin agent
ii quantum-server 1:2013.1.2-0ubuntu1~cloud0 Quantum is a virtual network service for Openstack - server

For me putting quantum-ovs-cleanup -v &> /root/cleanupovs.log in /etc/rc.local helps the VM to get the IP address.

Revision history for this message
Marios Andreou (marios-b) wrote :

is this still a reproducible bug? seems from discussion may be fixed now. Can we mark this bug as done?

Revision history for this message
Ryan Moats (rmoats) wrote :

marking invalid as it was incomplete and hasn't been updated in over a year

Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.