Some VMs get a bad metadata route

Bug #1450548 reported by Mark Rawlings
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

In a configuration using the dhcp_agent.ini setting

           enable_isolated_metadata = True

When creating a network configuration that is *not* isolated it has been observed that the dnsmasq processes are being configured with static routes for the metadata-service (169.254.169.254) that point to the local dhcp server.

ci-info: +-------+-----------------+------------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+------------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 71.0.0.161 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 71.0.0.160 | 0.0.0.0 | 255.255.255.240 | eth0 | U |
ci-info: | 2 | 169.254.169.254 | 71.0.0.163 | 255.255.255.255 | eth0 | UGH |

However, in this particular scenario the dnsmasq processes have no metadata-proxy processes.

When a VM boots it gets the static route via DHCP and is unable to access the metadata service.

This issue seems to have appeared due to patch #116832 "Don't spawn metadata-proxy for non-isolated nets".

Is it possible that the basis for that optimisation is flawed?

The optimisation implements checks of whether a subnet is considered isolated. These checks include whether a subnet has a neutron router port available. However, it appears that decision can change during network construction or manipulation.
That potential change of decision would appear to defeat the previous optimisation.

Once it has been decided that a network is isolated the static route for metadata-service may be passed to VMs. At which point we cannot run without metadata-proxies on the dhcp-servers, even if a neutron router becomes available and the network become non-isolated.

A proposal would be to remove the optimisation of not launching metadata-proxy-agents on dhcp-servers. Which means we will return to carrying the metadata-proxy-agents processes.

Revision history for this message
Mark Rawlings (mark-rawlings) wrote :

This has been observed in a Juno environment.
It has not yet been verified in Kilo

Revision history for this message
John Schwarz (jschwarz) wrote :

I don't see how removing the optimization so that your use-case works as intended, is good for the use-case the optimization solved (2 processes are used when only 1 is truly needed for plenty of deployments - when you have a lot of networks with DHCP and routers, this adds up very quickly).

Instead, lets focus on making the transition actually work. Perhaps the dhcp agent could be made aware of the state change and remove the routing rules once the change is made. I want to see what more people say about this before heading to the code since I believe this will be a somewhat complicated fixup.

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Mark Rawlings (mark-rawlings) wrote :

Hi John,
In our testing we have only remove a portion of the optimisation code because as you say it it would be somewhat complex to do much more.
The only portion we have removed is the decision to deploy proxy-agents if there are isolated subnets; this now always returns True. This forces the 'extra' proxies to deploy. However, these are currently required to ensure there is a metadata-proxy available at the end of the static route which has been previously deployed by DHCP. In this way we have disabled the major goal of the original optimisation.

I completely agree this requires further and wider discussion before any further approach is proposed.

Revision history for this message
John Schwarz (jschwarz) wrote :

I'm thinking a possible solution could involve the neutron-server to inform the DHCP agent that a neutron router has been added to it, and then the DHCP agent should know to send a "remove these static routes" when that happens.

Still waiting for other replies on this subject.

Revision history for this message
Mark Rawlings (mark-rawlings) wrote :

Is there s DHCP mechanism for pushing such updates to already booted clients?
As far as I am aware DHCP uses client-pull requests at boot/lease-renewal-time, not server-pushes?

Changed in neutron:
assignee: nobody → Cedric Brandily (cbrandily)
Revision history for this message
Cedric Brandily (cbrandily) wrote :

@John: such approach is not working as you can update routes provided by dnsmasq but the ones in VMs because you can ensure when existing VMs will request the dnsmasq and if they will request and renew their routes (afaik, in general they don't).

IMO, the trouble comes from the change[1] associated t the bug[2] which changes how we decide if a subnet is isolated or not.

[1] https://review.openstack.org/50292
[2] https://bugs.launchpad.net/neutron/+bug/1236783

Revision history for this message
Cedric Brandily (cbrandily) wrote :

Oups

@John: such approach is not working as you can update routes provided by dnsmasq but NOT the ones in VMs because you CANNOT ensure when existing VMs will request the dnsmasq and if they will request and renew their routes (afaik, in general they don't).

It implies that if at one moment a subnet is considered as isolated by the dhcp driver then we cannot undeploy its associated metadata proxy when the subnet becomes non-isolated: we mean we can destroy a metadata proxy only when all subnets of a network are deleted or when the network is deleted

Revision history for this message
John Schwarz (jschwarz) wrote :

Cedric, that is a correct observation.

I think we should talk about a transition-based solution here. While this bug report discusses a problem where a once-isolated subnet has become not-isolated and stops receiving metadata service, if I'm not mistaken there's a problem the other way around as well - a not-isolated subnet becomes isolated (router is removed) and the DHCP agent doesn't know to provide metadata's static route until some event occurs (subnet update/delete, port update/delete, etc).

Alas, the optimal solution should be (IMO) 'aware' to an extent of what has happened in the network (router deleted/added), and respond appropriately. Is it possible to make the DHCP agent aware of router changes? If so, this will allow us to, for example, supplying metadata services to existing instances but not for new ones ("don't distribute static routes") in the case of an isolated-subnet-gone-unisolated.

Revision history for this message
Mark Rawlings (mark-rawlings) wrote :

@John, I believe you are correct the issue could occur on both the transitions to and from isolation.

We have:
A) For isolated networks. {<metadata via VM static route to DHCP agent>, <metadata proxy on DHCP agents>}
B) For non-isolated networks. {<metadata via VM default route and router>, <metadata proxy on routers>}

These pairs cannot be mixed and matched.

The transitions are when these issues can occur
    A->B we may have pushed static routes into VMs pointing metadata requests to DHCP agents with no meta-data proxies
    B->A we are relying on the VM's default route to get to metadata, but there is no longer a router to respond to metadata requests.

The issue reported here was on the A->B transition, but I take your point that B->A should also have issues, and potentially much harder failures.
I don't have a viable solution for B->A. If the method for getting the metadata was via a router and the router is removed, I don't know how we support VMs caught in the window?

I'm still tending towards avoiding the transitions at all.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Cedric Brandily (cbrandily) → nobody
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.