ha dhcp agents and port mismatch kills dhcp for a tenant network

Bug #1489912 reported by Eric Peterson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

We have a setup with 3 control nodes, and each tenant network gets 2 dhcp agents. Most of the time, this works fine / well.

However, we have seen where the dhcp agents assigned to a network might point to control nodes 1 & 2 - but the port listing for the tenant network has dhcp ports which point to the dhcp agent being on control nodes 1& 3. We are not sure why / when this happens, but over time it seems to occur.

When this happens, the tenant vms won't even get dhcp request fulfilled.
Once their lease expires, they lose FIP networking and tend to get pretty
upset... Even though one of the ports is pointing to a valid agent, dhcp requests go out, but never get a reply.

We have found a workaround is to delete all the dhcp ports in the tenant network, then remove the agents - and allow neutron to recreate them both. Once this happens, dhcp works again.

Tags: l3-ipam-dhcp
Matt Fischer (mfisch)
description: updated
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Without a definite way to reproduce this, it will be difficult to work on. Could you provide any more data to help understand how serious this is? How often have you seen it? Have you talked to anyone else who has seen this?

Could you possibly add some instrumentation to the code to catch this problem when it happens? Maybe that could give us a better understanding.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Eric Peterson (ericpeterson-l) wrote :

We are now on Kilo, and see this happen in about 1 out of 100 tenant networks, every 24-48 hours.

I wish we could reproduce it as well. I'm not sure if this is being triggered by an ha dhcp option where the agent gets moved or what is going on at this time. Right now we see when it happens, and are still trying to dial into what happened in the moments preceding this.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Eric, you could try to collect debug logs for the issue, and attach them to the bug. Without that kind of information, the bug is doomed to expire.

tags: added: l3-ipam-dhcp
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

There was an issue in Kilo with DHCP ports where port bindings were not updated when network is rescheduled from one DHCP agent to another. But this issue is only relevant to core plugins/network backends where port binding matters for connectivity.
With OVS this was not an issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.