Neutron start radvd and mess up the routing table when: ipv6_ra_mode=not set ipv6-address-mode=slaac

Bug #1888256 reported by Peter
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
Slawek Kaplonski

Bug Description

Hello!

I would like to report a possible bug.
We currently using Rocky with Ubuntu 18.04.
We use custom ansible for deployment.

We have a setup, where the upstream core Cisco nexus DC switches answers to RA-s. This works fine with a network, which we had for years (upgraded from kilo)

Now, we made a new region, with new network nodes, etc. and the IPv6 not works as in the old region.

In the new region, we had this subnet:

[PROD][root(cc1:0)] <~> openstack subnet show Flat1-subnet-v6
+-------------------+------------------------------------------------------+
| Field | Value |
+-------------------+------------------------------------------------------+
| allocation_pools | 2001:738:0:527::2-2001:738:0:527:ffff:ffff:ffff:ffff |
| cidr | 2001:738:0:527::/64 |
| created_at | 2020-07-01T22:59:53Z |
| description | |
| dns_nameservers | |
| enable_dhcp | True |
| gateway_ip | 2001:738:0:527::1 |
| host_routes | |
| id | a5a9991c-62f3-4f46-b1ef-e293dc0fb781 |
| ip_version | 6 |
| ipv6_address_mode | slaac |
| ipv6_ra_mode | None |
| name | Flat1-subnet-v6 |
| network_id | fa55bfc7-ab42-4d97-987e-645cca7a0601 |
| project_id | b48a9319a66e45f3b04cc8bb70e3113c |
| revision_number | 0 |
| segment_id | None |
| service_types | |
| subnetpool_id | None |
| tags | |
| updated_at | 2020-07-01T22:59:53Z |
+-------------------+------------------------------------------------------+

As you can see, the address mode is SLAAC, the RA mode is: None.

Checking from network node, we see the qrouter:

[PROD][root(net1:0)] </home/ocadmin> ip netns exec qrouter-4ffa4f55-95aa-4ce1-b4f8-8bbb2f9d53e1 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
35: ha-5dfb8647-f7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:1c:4d:8d brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.3/18 brd 169.254.255.255 scope global ha-5dfb8647-f7
       valid_lft forever preferred_lft forever
    inet 169.254.0.162/24 scope global ha-5dfb8647-f7
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe1c:4d8d/64 scope link
       valid_lft forever preferred_lft forever
36: qr-a6d7ceab-80: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a1:7e:69 brd ff:ff:ff:ff:ff:ff
    inet 193.224.218.251/24 scope global qr-a6d7ceab-80
       valid_lft forever preferred_lft forever
    inet6 2001:738:0:527:f816:3eff:fea1:7e69/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea1:7e69/64 scope link nodad
       valid_lft forever preferred_lft forever

If I check the running process on our net1 node, I got this:

[PROD][root(net1:0)] </home/ocadmin> ps aux |grep radvd |grep 4ffa4f55-95aa-4ce1-b4f8-8bbb2f9d53e1
neutron 32540 0.0 0.0 19604 2372 ? Ss júl02 0:05 radvd -C /var/lib/neutron/ra/4ffa4f55-95aa-4ce1-b4f8-8bbb2f9d53e1.radvd.conf -p /var/lib/neutron/external/pids/4ffa4f55-95aa-4ce1-b4f8-8bbb2f9d53e1.pid.radvd -m syslog -u neutron

The specific radvd config:
[PROD][root(net1:0)] </home/ocadmin> cat /var/lib/neutron/ra/4ffa4f55-95aa-4ce1-b4f8-8bbb2f9d53e1.radvd.conf
interface qr-a6d7ceab-80
{
   AdvSendAdvert on;
   MinRtrAdvInterval 30;
   MaxRtrAdvInterval 100;
   AdvLinkMTU 1500;
};

If I spin up an instance, I see this:

debian@test:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:71:ca:8d brd ff:ff:ff:ff:ff:ff
    inet 193.224.218.9/24 brd 193.224.218.255 scope global dynamic eth0
       valid_lft 86353sec preferred_lft 86353sec
    inet6 2001:738:0:527:f816:3eff:fe71:ca8d/64 scope global dynamic mngtmpaddr
       valid_lft 2591994sec preferred_lft 604794sec
    inet6 fe80::f816:3eff:fe71:ca8d/64 scope link
       valid_lft forever preferred_lft forever
debian@test:~$ ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2001:738:0:527::/64 dev eth0 proto kernel metric 256 expires 2591990sec pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via fe80::f816:3eff:fea1:7e69 dev eth0 proto ra metric 1024 expires 251sec hoplimit 64 pref medium
default via fe80::5:73ff:fea0:2cf dev eth0 proto ra metric 1024 expires 1790sec hoplimit 64 pref medium

As you can see, I'v got two default routes, where the upper one is not ment to be there.

Could you point out something I missed, or there are some kind of bug, which makes this?

Thanks:
 Peter ERDOSI (Fazy)

Tags: ipv6
Revision history for this message
Brian Haley (brian-haley) wrote :

What type of network is this? Is it marked external? I'm just wondering if a neutron router is supposed to be attached at all.

It could also be that a bug was introduced since I'm not sure this specific case is tested.

tags: removed: ra-mode
Revision history for this message
Peter (fazy) wrote :
Download full text (5.9 KiB)

This is a shared, but not external router (so no NAT-ing). We allocate public v4/v6 to instance from it.
Actually, we use it, as a "VPS" network (it's a network of admin user anyway)

So, we have routers here, because we use v4 DHCP and metadata from it.

To be more specific:
[PROD][root(cc1:0)] <~> openstack network show Flat1
+---------------------------+----------------------------------------------------------------------------+
| Field | Value |
+---------------------------+----------------------------------------------------------------------------+
| admin_state_up | UP |
| availability_zone_hints | |
| availability_zones | nova |
| created_at | 2020-07-01T22:59:47Z |
| description | |
| dns_domain | None |
| id | fa55bfc7-ab42-4d97-987e-645cca7a0601 |
| ipv4_address_scope | None |
| ipv6_address_scope | None |
| is_default | None |
| is_vlan_transparent | None |
| mtu | 1500 |
| name | Flat1 |
| port_security_enabled | True |
| project_id | b48a9319a66e45f3b04cc8bb70e3113c |
| provider:network_type | vlan |
| provider:physical_network | vltrunk |
| provider:segmentation_id | 719 |
| qos_policy_id | None |
| revision_number | 3 |
| router:external | Internal |
| segments | None |
| shared | True |
| status | ACTIVE ...

Read more...

Revision history for this message
Brian Haley (brian-haley) wrote :

Sorry for the slow response.

I have more questions as I'm trying to understand the use case.

1) You have a shared provider network, but instead of being external it's internal.

2) You allocated public v4/v6 addresses to instances on it, so don't require NAT.

3) What purpose does the neutron router perform? Is it routing between subnets on this network, or multiple shared internal networks? It almost doesn't seem like you need the neutron router.

When I tried to recreate this using a subnet I created with ra-mode=None/address-mode=slaac, and then adding an interface to a router I get:

Error: Failed to add interface: Bad router request: IPv6 subnet 6c7c4a89-15fd-4627-b30f-92f306c8e11f configured to receive RAs from an external router cannot be added to Neutron Router.. Neutron server returns request_ids: ['req-ef172dc4-b5ea-4edc-84f1-53173287b4bc']

I could successfully add interfaces for IPv6 subnets with None/None for the modes, and I don't see radvd advertising a prefix, so I'm not sure how you did this yet.

Revision history for this message
Peter (fazy) wrote :

1) Yes. We want to have a "simple" or as we call "Flat" network, which can be used by all projects, and gives public v4/v6 addresses. (next to our "Smart" networks, where dual qrouter, floating IP, VPNaaS etc available, but no IPv6 now)
Since the project administrators cannot manage ports, they got new IP with new instance.

2) Yes.

3) We use the router only for metadata in our "Flat" named networks. (cannot remember why, but we use this method from kilo. ).
As you can see, the "Flat1-subnet-v4" has a static route: (destination='169.254.169.254/32', gateway='193.224.218.251' ) where the 193.224.218.251 is the floating IP of the qrouter.

Maybe I misunderstand you, but I should try with None/None?

I thought, the address-mode=slaac will make the neutron allocate the proper address (which calculated from the MAC) and it's a must for the iptables rules in the back of the qrouters.

However, I used the documentation (https://docs.openstack.org/neutron/rocky/admin/config-ipv6.html) and it's clearly says with RA mode=none, and address mode=slaac: "Guest instance obtains IPv6 address from non-OpenStack router using SLAAC."

For some reason, the radvd process spawned with the qrouter with this configuration, and I cannot really understand, why.

The another odd thing is that our RegionOne, which upgraded from Kilo to Rocky in a past few years works with the old networks in this way. (our Flat1 and Flat2 which created in Kilo) Our new "Flat3" named network in RegionOne, and the two new "Flat" in RegionTwo behave this way only.

Revision history for this message
Brian Haley (brian-haley) wrote :

I don't think you should attach the router, in which case the dhcp-agent should add a route in it's reply to use it for metadata. I don't think you'll need to change dhcp_agent.ini, but if this fails you can change enable_isolated_metadata=True.

As far as the docs go, I thought they were correct, but this bug is making me think we need to revisit them to double-check.

Revision history for this message
Peter (fazy) wrote :

If I remember well, the router based metadata had been choosen because of High Availability considerations.
As I mentinoned, we started with Kilo in 2015.
Our design consists two dedicated network node, which runs the neutron-{ovs,l3,dhcp,metadata} agents.

And... I may not remember well with the next part (or it's just not true now, and may never was :) )

So, the DHCP high availability granted by running two independent DHCP agents (1-1 qdhcp on the network nodes/network with different IP addresses)

The router (1-1 router on network nodes/network) however only has one, but floating IP address with VRRP.

In DHCP based metadata, we (at least back in Kilo) not able to add both qDHCP instances with double static routes to the guests, therefor the HA was not granted. (or we missed something back then)

With the router based metadata, we added one static route with the "side" qrouter pair floating IP, so if one of our network node restarted/died/etc the another one got the floating IP by keepalived, and the metadata worked just fine.

Since our setup runs on production, this kind of change (qrouter to qdhcp metadata) is hard, beacuse this config change will have impact on all of our networks (but not impossible)

That's the reason, why I want to figure out our qrouter - radvd problem.

Maybe our unsuccessful tries with qdhcp metadata was based on a bug [*], which with Rocky just works fine, or we missed something with the configuration.

[*] https://bugzilla.redhat.com/show_bug.cgi?id=1256816

Back to the radvd problem:
I tried to understand the neutron code, and the radvd behaviour, and I may have a guess...
Neutron relevant code (in master, it's the same): https://github.com/openstack/neutron/blob/8c80267bb6699c86e10aade13c54b715e1eae1bf/neutron/agent/linux/ra.py

I've a few toughts:

1)
AdvSendAdvert on|off
A flag indicating whether or not the router sends periodic router advertisements and responds to router solicitations.

This option no longer has to be specified first, but it needs to be on to enable advertisement on this interface.

If I'm right, this option make radvd to "work" and because my ra_mode is none, therefor no prefix specification will generated in the radvd.conf, and the radvd possibly listen and advertise on all interfaces, which has IPv6 address.
Since this parameter hardcoded to Jinja template, it cannot be avoided.

2) _spawn_radvd(self, radvd_conf) function.
There are no condition, to check the ra_mode, so even if I set it to none, the process will be spawned.

Revision history for this message
Peter (fazy) wrote :

I've just made a test setup to clarify the radvd configuration.
My setup built in debian 10 (we use ubuntu for openstack, but the behaviour possibly the same) with two virtual machines:
 - v6rtr
 - snet1

The v6rtr have two NIC-s:

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 56:6f:87:b2:00:0c brd ff:ff:ff:ff:ff:ff
    inet 193.224.186.112/25 brd 193.224.186.127 scope global dynamic ens3
       valid_lft 30993sec preferred_lft 30993sec
    inet6 2001:738:0:534:546f:87ff:feb2:c/64 scope global dynamic mngtmpaddr
       valid_lft 2591720sec preferred_lft 604520sec
    inet6 fe80::546f:87ff:feb2:c/64 scope link
       valid_lft forever preferred_lft forever
3: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 56:6f:87:b2:00:11 brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.1/24 brd 10.1.0.255 scope global ens8
       valid_lft forever preferred_lft forever
    inet6 2001:738:0:51d::/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::546f:87ff:feb2:11/64 scope link
       valid_lft forever preferred_lft forever

I've made a configuration, based on the qrouter radvd config in question (only changed the interface name, and removed some \n -s)

It looks like this:
interface ens8
{
   AdvSendAdvert on;
   MinRtrAdvInterval 30;
   MaxRtrAdvInterval 100;
   AdvLinkMTU 1500;
};

The snet1 machine in the same L2 VLAN of course.
The result on snet1:

[fazy(snet1:0)] <~> ip a
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 56:6f:87:b2:00:0e brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.10/24 brd 10.1.0.255 scope global dynamic ens3
       valid_lft 14052sec preferred_lft 14052sec
    inet6 fe80::546f:87ff:feb2:e/64 scope link
       valid_lft forever preferred_lft forever

[fazy(snet1:0)] <~> ip -6 route
::1 dev lo proto kernel metric 256 pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
default via fe80::546f:87ff:feb2:11 dev ens3 proto ra metric 1024 expires 247sec hoplimit 64 pref medium

As you can see, the fe80::546f:87ff:feb2:11/64 (which is the LL address of the v6rtr machine ens8) and this link-local address advertised to the snet1 machine, but no prefix (since no prefix configured in radvd)

So overall, we got this in production, i assume:
 - the instance got an IPv6 prefix, and the (proper) gateway from the Ciscos
 - the instance got another IPv6 gateway from the qrouter

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
importance: Undecided → Medium
Revision history for this message
Peter (fazy) wrote :

Thanks for the assignment!

I just can't stop thinking on this bug, so I've tried to understand the ra.py.
https://github.com/openstack/neutron/blob/8c80267bb6699c86e10aade13c54b715e1eae1bf/neutron/agent/linux/ra.py

So far, I think, a small change in template generation would be enough.

(I've tried to understand, which value will come in ra_modes, when it's in None state by reading and searching the code, but without luck)

It's possibly not the best, because of the 3 part IF, but something like this should work:

CONFIG_TEMPLATE = jinja2.Template("""interface {{ interface_name }}
{

   {% if (constants.DHCPV6_STATELESS not in ra_modes) and (constants.DHCPV6_STATEFUL not in ra_modes) and (constants.IPV6_SLAAC not in ra_modes) %}
   AdvSendAdvert off;
   {% else %}
   AdvSendAdvert on;
   {% endif %}

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

I was looking into that issue today. And when I was trying to reproduce this locally, I created network, then subnet with same setting as Yours and then I tried to attach this subnet to the neutron router and I got error like:

(overcloud) [stack@undercloud-0 ~]$ neutron router-interface-add test-router 3982beee-c13f-4e51-beca-60b82405d34a
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
Bad router request: IPv6 subnet 3982beee-c13f-4e51-beca-60b82405d34a configured to receive RAs from an external router cannot be added to Neutron Router..
Neutron server returns request_ids: ['req-af19e405-49d9-4efd-ad78-dd033dab9532']

But when I created port in this network and added it to the router then it worked without error.
So IMHO this sounds like it shouldn't be supported by Neutron at all to plug such subnet to the Neutron router but we have bug in the validation when we are plugging port to the router.

This validation is in neutron since https://review.opendev.org/#/c/136733/ which was merged many years ago.

I opened separate bug for that validation issue https://bugs.launchpad.net/neutron/+bug/1889619 and I will propose patch to fix it so Neutron will not allow to plug such port to the router anymore.

Revision history for this message
Peter (fazy) wrote :

I make the network with this commands:

openstack network create Flat1 --share --provider-network-type vlan --provider-physical-network vltrunk --provider-segment 719

openstack subnet create \
--gateway 193.224.218.254 \
--subnet-range 193.224.218.0/24 \
--allocation-pool start=193.224.218.1,end=193.224.218.250 \
--host-route destination=169.254.169.254/32,gateway=193.224.218.251 \
--dns-nameserver 193.224.161.70 --dns-nameserver 193.224.161.86 \
--network Flat1 Flat1-subnet-v4

openstack subnet create \
--subnet-range 2001:738:0:527::/64 \
--ip-version 6 --ipv6-address-mode slaac \
--network Flat1 Flat1-subnet-v6

openstack router create Flat1-Router

openstack port create --network Flat1 --fixed-ip ip-address=193.224.218.251 flat1-rtr-port

openstack router add port Flat1-Router flat1-rtr-port

So I not plug any v6 subnet or port to the router, however in Horizon, I can see an IPv6 address assigned with every port in this network (flat1-rtr-port, DHCP ports, etc)
Printscreen here: https://drive.kifu.niif.hu/index.php/s/y5DrkLyP7TcoLps

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Peter,

This is exactly what I did too. And by doing openstack router add port Flat1-Router flat1-rtr-port You are effectively plugging also IPv6 subnet to this router which shouldn't be possible (and it's not when You plug subnet "directly" to the router by doing "neutron router-interface-add <router-id> <subnet-id>".

Revision history for this message
Peter (fazy) wrote :

Hi Slawek,

I think, i started to understand the problem, but still not 100%.

Just to clarify:

If I have an IPv4 and IPv6 subnet in the network, and this v6 subnet has RA Mode: None, the qrouter interface (qr-a6d7ceab-80 in this one) should not get the IPv6 address, and/or should not spawn the radvd process?

How this affect the IPv6 for the end users? (if affects)

Now, the horizon shows an IPv6 address for all instances in this networks (possibly it's just calculated from the MAC, but it's the same what the instance choose) for the users, and also creates ip6tables rules with MAC-IP pairs based on this address (AFAIK) and any other rule from Security Groups

I think, this is because the IPv6 address mode is "slaac" in our setup.

So after patching, the users still got the IPv6 address shown in horizon/CLI to instance, and also the security groups will be working?

If I can get a pre-release test package (in a ppa repository for example), we have a dev environment, where I can test it.

Thanks,
 Peter

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Peter,

This is correct what You are saying. In code it's in https://github.com/openstack/neutron/blob/master/neutron/db/ipam_backend_mixin.py#L452 - neutron will always allocate IPv6 address from such SLAAC subnet if it's not router port.
And in Your workflow, first port is created and it's not router port yet so it has assigned such IPv6 address. Later this port is attached to the router but it already have this IPv6 address allocated.

I understand Your use case but I'm not sure what would be the best way to address it.

For sure to workaround this issue You can plug IPv4 subnet to the router directly. So it will then not allocate IP from this IPv6 subnet. But that means that it will use always IP address defined as gateway_ip in the subnet.

Revision history for this message
Peter (fazy) wrote :

Hi Slawek,

That gives me an idea, why this bug happend only our new subnets...
When we started with Kilo 5 years before, we only used IPv4, the v6 subnet added 1-2 months later.

So if I remove the whole subnet/network/etc, and create it again (this time, first the v4 subnet, then adding port to router, then adding the IPv6 subnet, it may work?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Peter,

It may be workaround for You but You will have to test that first.

Revision history for this message
Peter (fazy) wrote :

Hy!

I've tried this reorder thing, and it's looks good!

I did this (this is our DEV, so another subs, but same setup):

openstack network create Flat1 --share --provider-network-type vlan --provider-physical-network vltrunk --provider-segment 712

openstack subnet create \
--gateway 193.224.110.94 \
--subnet-range 193.224.110.64/27 \
--allocation-pool start=193.224.110.65,end=193.224.110.90 \
--host-route destination=169.254.169.254/32,gateway=193.224.110.91 \
--dns-nameserver 193.224.161.70 --dns-nameserver 193.224.161.86 \
--network Flat1 Flat1-subnet-v4

openstack router create Flat1-Router

openstack port create --network Flat1 --fixed-ip ip-address=193.224.110.91 flat1-rtr-port

openstack router add port Flat1-Router flat1-rtr-port

openstack subnet create \
--subnet-range 2001:738:0:5b2::/64 \
--ip-version 6 --ipv6-address-mode slaac \
--network Flat1 Flat1-subnet-v6

This time the DHCP agents got IPv6 address too after adding the subnet, but there are no IPv6 address on flat1-rtr-port, and no spawned radvd processes on network nodes.

Thanks pointing this out!

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Thx for testing that and for the info that it works.
Based on Your last comment and on fact that this shouldn't work in other way in Neutron I'm going to close this bug now.
Feel free to reopen it if You think there is something more to do there.

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.