[RFE] Prevent DHCP agent from processing stale RPC messages when restarting up

Bug #1795212 reported by Kailun Qin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Wishlist
Kailun Qin

Bug Description

Network rescheduling would be triggered when neutron server is discovering that agents are down. At the same time, some bare metal and node management systems will reboot those same nodes. When those two actions happen together, it will result in the server sending RPC notifications to agents that just get rebooted which will lead to stale RPC messages when the DHCP agents return to service. These messages were sent to the agent before the node was rebooted but were not processed by the agent because it was shutdown at that time.

The negative effects brought by this case would be:
when an agent has received a stale network create/end notification, it will be triggered to start servicing a network even though the server may have already had that network assigned to a different agent. Since the agent does not periodically audit the list of networks that it is servicing it could potentially continue servicing a network that was not assigned to it forever. Similarly, it is possible that a stale delete message is processed thus causing the agent to stop servicing a network that it was actually supposed to service.

Kailun Qin (kailun.qin)
Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
Revision history for this message
Kailun Qin (kailun.qin) wrote :

Two solutions are considered to tackle with this issue:
1) Introduce a new configurable startup delay to the DHCP agent. Within that delay, no stale RPC messages should be allowed to be processed. We'll wait for the RPC messages to be synced during this startup delay.
2) Introduce a timestamp/sequence number in the RPC messages sent between the server and DHCP agents. In this way, we may define a dead time/limit for the stale RPC messages so that they could be discarded.

tags: added: rfe
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Miguel Lavalle (minsel) wrote :

Hi Kailun Qin,

Is this proposal a replacement for this spec (that we discussed in the Denver PTG): https://review.openstack.org/#/c/595978/

Revision history for this message
Kailun Qin (kailun.qin) wrote :

Hi Miguel,
It is not a replacement for the cited spec related with agent load re-balancing.
This one targets to address another issue about stale RPC messages processed by DHCP agents when they restart up, which is usually seen in a rescheduling scenario.
Let me know if any further question, thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/609463

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: Kailun Qin (kailun.qin) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Kailun Qin (kailun.qin)
Revision history for this message
Kailun Qin (kailun.qin) wrote :

Dear drivers team,
Since the patch is pretty much in a good shape, can we have this RFE triaged/discussed during this week's drivers meeting on October 26th? Thanks!

Kailun Qin (kailun.qin)
Changed in neutron:
status: In Progress → New
Changed in neutron:
status: New → In Progress
Kailun Qin (kailun.qin)
description: updated
Revision history for this message
Kailun Qin (kailun.qin) wrote :

Questions to follow up based on the discussions during drivers meeting on October 26th:
1. amotoki: what kind of corner cases and negative behaviors could happen (even with full sync)?
2. slaweq: in some (very) old releases rpc queues which agent's registered had some random id, so when agent was e.g. restarted it created new queue and would not consume messages from this old one.
AR: To double check slaweq's statement and clarify failure modes in question, i.e., what actually happens, together w/ Q1.
3. haleyb: doing it without a config option somehow? e.g. Resource queue with ExclusiveResourceProcessor used in L3 agent

Revision history for this message
Miguel Lavalle (minsel) wrote :

Moving this RFE to the confirmed stage. We are waiting for submitter to update his proposal before moving the RFE to triaged state, where we will discuss it again in the drivers meeting

tags: added: rfe-confirmed
removed: rfe
Revision history for this message
Kailun Qin (kailun.qin) wrote :
Download full text (3.4 KiB)

Some answers to the questions raised in the last driver meeting.

1. amotoki: failure mode.
[kailun] Inputs from Wind River who met this issue in the first place:
We only observed this type of issue in a large office configuration where the neutron-server is overloaded during a DOR test (dead office recovery) where all nodes are powered off and back on. In such a scenario the system is overloaded for an extended period and there is a long delay between when events occur and when notifications are received by subscribers. It is difficult to reproduce this on small systems where the time between event and notification is short.

I don’t remember the exact details of the entire scenario, but the high level issue was that we wanted to avoid agents receiving and processing RPC messages that were sent to them before they started up. That happens more frequently in a DOR test because the server has a stale view of the system state and can send RPC messages to nodes that are not enabled yet. That is, its agent DB table may show that all agents are healthy depending on how long it took for the DOR to recover the controller node.

What we found was that it was possible for the server to think that the agent was up when it was actually down. During the window where the server sees the agent as up it can send it RPC messages. Those messages get queued up and delivered to the agent once it is finally up. The problem is since the agent was not actually up in the first place those messages were never really valid. Therefore we wanted the agent to discard any RPC requests until after it was able to resync to the server. This allowed the system to avoid unnecessary transitions based on old data.

One of the specific problems that this was addressing was something like this:
1) A subnet had no remaining IP addresses to allocated
2) A DCHP agent (agent-X) received a stale message to “create network” so it reserved a DHCP port with an IP address (this used the last available IP address)
3) Meanwhile, the DHCP agent (agent-Y) that actually was assigned the network came up and was not able to reserve a DHCP port because there were no IP addresses available
4) The first agent (agent-X) was taken down because its node was rebooted by system maintenance
5) The second agent (agent-Y) never retries the DHCP port creation because the DHCP agent has no periodic audit so there was no DHCP server servicing the network

2. slaweq: in some (very) old releases rpc queues which agent's registered had some random id, so when agent was e.g. restarted it created new queue and would not consume messages from this old one.
[kailun] Queue name is the identifier of the specific rpc queue. The broker can generate a unique queue name if none is specified. However, in Neutron, we have queues w/ fixed names in the format of "dhcp_agent.host_name". Thus, when an agent is restarted, it will consume messages from the old one.

3. haleyb: doing it without a config option somehow? e.g. Resource queue with ExclusiveResourceProcessor used in L3 agent
[kailun] Resource queue with ExclusiveResourceProcessor used in L3 agent can make a timestamp tag when the first time an agent has *received* a rp...

Read more...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I understand problem which You described here and I can imagine how it can happen in some corner cases.
I'm however afraid that approach with some configurable delay isn't good approach because You can't know how much time agent will need to do full sync after restart.
IMO if You want to solve this problem, You should try to do something based on timestamps and discard messages which came before agent was started.

Revision history for this message
Kailun Qin (kailun.qin) wrote :

@slaweq
Thanks for the comments.
From our point of view, it does not matter how long the full sync takes. Processing any RPC messages, even ones that are not stale, before the initial full sync completes is not guaranteed to provide consistent results.
For example, if a port-update-end arrives before that port is received as part of the initial sync it will unnecessarily result in a full resync on that port’s network. Similarly, if a port-delete-end arrives before that port is received as part of the initial sync then it will be added to the “deleted_ports” list but that list is not referenced during the full sync so the information for that port will remain in the DHCP configuration for that network even though the port no longer exists. That will cause issues later when a new port is created and uses the IP address of that deleted port.
We were opting the agent delay approach in the spirit of avoiding compatibility changes and changes that would impact running against an unmodified server. We agree that timestamp is a good approach but that will come with backward compatibility constraints and additional complexity.
Let us know if any further question or concern. Thanks!

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Ok, one more question. Imagine the case when:
1. Agent starts and is doing full sync - so it gets list of ports and networks from server and starts configuring it one by one, right?
2. During this time, processing of incoming RPC messages is blocked, right?
3. Now (still during initial full sync) someone deleted ports so port-delete-end message is send to DHCP agent but this agent refuse to process this message, right?
4. Full sync is end and agent is still handling port which was deleted in 3. - am I right? Or will it be cleaned somehow?

Revision history for this message
Kailun Qin (kailun.qin) wrote :

@slaweq
The RPC handlers (e.g., port_update_end) are all wrapped with “_wait_if_syncing” so they don’t actually start processing until after sync has completed. We are only trying to prevent messages from being processed between the start of the process lifetime and the beginning of the initial sync. That window is what leads to the issues we have noted.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Looking at the proposed code (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@168), while that initial sync_state() is running, the corner case described by Slawek can still happen, right? Because at that point self._block_rpc == True if conf.initial_state_delay > 0

tags: added: rfe-triaged
removed: rfe-confirmed
Revision history for this message
Kailun Qin (kailun.qin) wrote :

@Miguel
The RPC handlers (e.g., port_update_end) are all wrapped with "_wait_if_syncing" (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@491).
When that initial "sync_state()" is running, it will acquire a write lock to block all operations for a global sync call (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@491). During this time, if a RPC message is sent to the DHCP agent, it will wait if any sync operations/writers are in progress due to the "_wait_if_syncing" decorator mentioned above. And it will start processing after the sync ("sync_state()") has completed.
So, IMO the corner case described by Slawek will not happen since the message received during that sync process will not be dropped and will be handled later.

Miguel Lavalle (minsel)
tags: added: rfe-confirmed
removed: rfe-triaged
Miguel Lavalle (minsel)
tags: added: rfe-postponed
removed: l3-ipam-dhcp rfe-confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/609463
Reason: Please, feel free to retake this patch if needed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers