IP address isn't assigned to a subnet gateway interface in some cases.

Bug #1750563 reported by Anton Kremenetsky on 2018-02-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Unassigned

Bug Description

Hello everyone,

Seems I caught a race condition bug in neutron-l3-agent.
We have automated tests. One of the test performs the following scenario. Creates different resources such as network, subnet and so on. Then the test connects the subnet to a router and perform other things that are not related to this bug. The test is performed in a cycle with different parameters but we use the same parameters for the Neutron resources. I mean the test always creates subnet with the same CIDR 192.168.0.0/24 and the subnet gateway interface gets 192.168.0.1 IP address. The bug happens in the moment when the subnet is connecting to the router. I would like to note that is not a permanent bug, sometimes it happens but sometimes not.
So bug looks like you don't access to the instances(VMs) using floating IPs. It's not possible to ping them. I did some debug, it turned out the subnet gateway interface didn't get an IP sometimes. For example, when the bug happens the interface looks so

root@network-N6-rmfqne:/var/log/neutron# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-a35f384b-549e-41c6-8076-2283be384e1b ip a
...
389: qr-7dc17e0a-97@if786: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:1e:47:78 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::f816:3eff:fe1e:4778/64 scope link
       valid_lft forever preferred_lft forever

For a success case it looks so.

root@network-N6-rmfqne:/var/log/neutron# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qrouter-a35f384b-549e-41c6-8076-2283be384e1b ip a
...
393: qr-cccf794e-86@if794: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:54:0c:11 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.1/24 scope global qr-cccf794e-86
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe54:c11/64 scope link
       valid_lft forever preferred_lft forever

We are using Juju to deploy OpenStack. The version of the neutron-gateway 9.4.1, the version of the charm is 244.

Download full text (4.2 KiB)

More debug. I've slightly modified a couple of functions in neutron/agent/linux/keepalived.py. Just for debug purpose.

def add_vip(self, ip_cidr, interface_name, scope):
    vip = KeepalivedVipAddress(ip_cidr, interface_name, scope)
    if vip not in self.vips:
        self.vips.append(vip)
        LOG.debug('VIP %s was added', vip)
    else:
        LOG.debug('VIP %s already present in %s', vip, self.vips)

def remove_vips_vroutes_by_interface(self, interface_name):
    LOG.debug('remove_vips_vroutes_by_interface: %s', interface_name)
    self.vips = [vip for vip in self.vips
                 if vip.interface_name != interface_name]

    self.virtual_routes.remove_routes_on_interface(interface_name)

def remove_vip_by_ip_address(self, ip_address):
    LOG.debug('remove_vip_by_ip_address: %s', ip_address)
    self.vips = [vip for vip in self.vips
                 if vip.ip_address != ip_address]

The original code can be found here https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha_router.py#L213

Got the following logs. The unimportant part was delete for readability. Also comments were added for every line.

# Create a gateway interface for the subnet.
2018-02-16 10:04:37.152 14476 DEBUG neutron.agent.linux.keepalived [req-f3c3106a-d72b-4eb4-b9ba-3ba6ba3b35fd - 0d042481da874d229f542e35ef7ac589 - - -] VIP [fe80::f816:3eff:fe8c:7ef8/64, qr-8c8b4c01-40, link] was added add_vip /usr/lib/python2.7/dist-packages/neutron/agent/linux/keepalived.py:206
# Assign an IP address to the interface
2018-02-16 10:04:37.153 14476 DEBUG neutron.agent.linux.keepalived [req-f3c3106a-d72b-4eb4-b9ba-3ba6ba3b35fd - 0d042481da874d229f542e35ef7ac589 - - -] VIP [192.168.0.1/24, qr-8c8b4c01-40, None] was added add_vip /usr/lib/python2.7/dist-packages/neutron/agent/linux/keepalived.py:206
# Update keepalived configuration
2018-02-16 10:04:40.970 14476 DEBUG neutron.agent.linux.keepalived [req-f3c3106a-d72b-4eb4-b9ba-3ba6ba3b35fd - 0d042481da874d229f542e35ef7ac589 - - -] Keepalived spawned with config /var/lib/neutron/ha_confs/a35f384b-549e-41c6-8076-2283be384e1b/keepalived.conf spawn /usr/lib/python2.7/dist-packages/neutron/agent/linux/keepalived.py:447
...
# Create a new gateway interface for the subnet.
2018-02-16 10:06:11.728 14476 DEBUG neutron.agent.linux.keepalived [req-7ae4a777-fa22-42e1-b9a4-726d75ff3620 - 0d042481da874d229f542e35ef7ac589 - - -] VIP [fe80::f816:3eff:fe1e:4778/64, qr-7dc17e0a-97, link] was added add_vip /usr/lib/python2.7/dist-packages/neutron/agent/linux/keepalived.py:206
# Tries to assign an address but this operation fails as the address (192.168.0.1/24) is already in use.
2018-02-16 10:06:11.729 14476 DEBUG neutron.agent.linux.keepalived [req-7ae4a777-fa22-42e1-b9a4-726d75ff3620 - 0d042481da874d229f542e35ef7ac589 - - -] VIP [192.168.0.1/24, qr-7dc17e0a-97, None] already present in [<neutron.agent.linux.keepalived.KeepalivedVipAddress object at 0x7f6d5750cd90>, <neutron.agent.linux.keepalived.KeepalivedVipAddress object at 0x7f6d571ce7d0>, <neutron.agent.linux.keepalived.KeepalivedVipAddress object at 0x7f6d5722dc50>, <neutron.agent.linux.keepalived.KeepalivedVipAddress object at 0x7f6d57235490>, <neutron.agent.linux.ke...

Read more...

tags: added: l3-ha
Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Brian Haley (brian-haley) wrote :

I'm a little confused by this bug.

In your debug above, we are adding both IPv6 and IPv4 subnets to the router, on two different interfaces - qr-8c8b4c01-40 and qr-7dc17e0a-97. The IPv6 (first vip) of each succeeds since the link-local is based on the MAC address. The IPv4 (second) of each fails because it matches an existing vip - this is because of this in the KeepalivedVipAddress() class:

    def __eq__(self, other):
        return (isinstance(other, KeepalivedVipAddress) and
                self.ip_address == other.ip_address)

Only the address is checked for uniqueness and not the interface or scope.

But checking the interface name in addition doesn't seem correct, since you can then have two interfaces with the same IP. Trying to do this manually on a non-HA router seems to blow-up early with an overlapping address error from the API.

Is it possible to create a list of commands we can run from the client to reproduce this? If we really are trying to add a subnet with the same cidr to two router interfaces then that might be a bug on the server not detecting the overlap.

Hello Brian,

The logs in the attachments. Please take a look at them.

My comments regarding the logs. As I mentioned before we have automated tests. These tests use Python API (python clients) to interact with OpenStack. Therefore the list of commands are a list of python API calls. It's the python_api_calls.log file. Also I specified versions of python clients that we are using there. Other files.
juju_status.log - Output of the "juju status" commands.
neutron-l3-agent_network-gateway-0.log - The log of neutron-l3-agent on the first network-gateway node.
neutron-l3-agent_network-gateway-1.log - The log of neutron-l3-agent on the second network-gateway node.
neutron-openvswitch-agent_network-gateway-0.log - The log of neutron-openvswitch-agent on the first network-gateway node.
neutron-openvswitch-agent_network-gateway-1.log - The log of neutron-openvswitch-agent on the second network-gateway node.

Let me know if you have any questions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers