Bug #1795212 “[RFE] Prevent DHCP agent from processing stale RPC...” : Bugs : neutron

Kailun Qin (kailun.qin) on 2018-09-30

Changed in neutron:
assignee:	nobody → Kailun Qin (kailun.qin)

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-09-30:

#1

Two solutions are considered to tackle with this issue:
1) Introduce a new configurable startup delay to the DHCP agent. Within that delay, no stale RPC messages should be allowed to be processed. We'll wait for the RPC messages to be synced during this startup delay.
2) Introduce a timestamp/sequence number in the RPC messages sent between the server and DHCP agents. In this way, we may define a dead time/limit for the stale RPC messages so that they could be discarded.

Nate Johnston (nate-johnston) on 2018-10-01

tags:	added: rfe
tags:	added: l3-ipam-dhcp

Nate Johnston (nate-johnston) on 2018-10-01

Changed in neutron:
importance:	Undecided → Wishlist

Revision history for this message

Miguel Lavalle (minsel) wrote on 2018-10-04:

#2

Hi Kailun Qin,

Is this proposal a replacement for this spec (that we discussed in the Denver PTG): https://review.openstack.org/#/c/595978/

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-10-08:

#3

Hi Miguel,
It is not a replacement for the cited spec related with agent load re-balancing.
This one targets to address another issue about stale RPC messages processed by DHCP agents when they restart up, which is usually seen in a rescheduling scenario.
Let me know if any further question, thanks!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10: Fix proposed to neutron (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/609463

Changed in neutron:
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2018-10-19

Changed in neutron:
assignee:	Kailun Qin (kailun.qin) → Brian Haley (brian-haley)

OpenStack Infra (hudson-openstack) on 2018-10-20

Changed in neutron:
assignee:	Brian Haley (brian-haley) → Kailun Qin (kailun.qin)

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-10-24:

#5

Dear drivers team,
Since the patch is pretty much in a good shape, can we have this RFE triaged/discussed during this week's drivers meeting on October 26th? Thanks!

Kailun Qin (kailun.qin) on 2018-10-26

Changed in neutron:
status:	In Progress → New

OpenStack Infra (hudson-openstack) on 2018-10-26

Changed in neutron:
status:	New → In Progress

Kailun Qin (kailun.qin) on 2018-10-26

description:

updated

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-10-26:

#6

Questions to follow up based on the discussions during drivers meeting on October 26th:
1. amotoki: what kind of corner cases and negative behaviors could happen (even with full sync)?
2. slaweq: in some (very) old releases rpc queues which agent's registered had some random id, so when agent was e.g. restarted it created new queue and would not consume messages from this old one.
AR: To double check slaweq's statement and clarify failure modes in question, i.e., what actually happens, together w/ Q1.
3. haleyb: doing it without a config option somehow? e.g. Resource queue with ExclusiveResourceProcessor used in L3 agent

Revision history for this message

Miguel Lavalle (minsel) wrote on 2018-11-07:

#7

Moving this RFE to the confirmed stage. We are waiting for submitter to update his proposal before moving the RFE to triaged state, where we will discuss it again in the drivers meeting

tags:

added: rfe-confirmed
removed: rfe

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-11-21:

#8

Download full text (3.4 KiB)

Some answers to the questions raised in the last driver meeting.

1. amotoki: failure mode.
[kailun] Inputs from Wind River who met this issue in the first place:
We only observed this type of issue in a large office configuration where the neutron-server is overloaded during a DOR test (dead office recovery) where all nodes are powered off and back on. In such a scenario the system is overloaded for an extended period and there is a long delay between when events occur and when notifications are received by subscribers. It is difficult to reproduce this on small systems where the time between event and notification is short.

I don’t remember the exact details of the entire scenario, but the high level issue was that we wanted to avoid agents receiving and processing RPC messages that were sent to them before they started up. That happens more frequently in a DOR test because the server has a stale view of the system state and can send RPC messages to nodes that are not enabled yet. That is, its agent DB table may show that all agents are healthy depending on how long it took for the DOR to recover the controller node.

What we found was that it was possible for the server to think that the agent was up when it was actually down. During the window where the server sees the agent as up it can send it RPC messages. Those messages get queued up and delivered to the agent once it is finally up. The problem is since the agent was not actually up in the first place those messages were never really valid. Therefore we wanted the agent to discard any RPC requests until after it was able to resync to the server. This allowed the system to avoid unnecessary transitions based on old data.

One of the specific problems that this was addressing was something like this:
1) A subnet had no remaining IP addresses to allocated
2) A DCHP agent (agent-X) received a stale message to “create network” so it reserved a DHCP port with an IP address (this used the last available IP address)
3) Meanwhile, the DHCP agent (agent-Y) that actually was assigned the network came up and was not able to reserve a DHCP port because there were no IP addresses available
4) The first agent (agent-X) was taken down because its node was rebooted by system maintenance
5) The second agent (agent-Y) never retries the DHCP port creation because the DHCP agent has no periodic audit so there was no DHCP server servicing the network

2. slaweq: in some (very) old releases rpc queues which agent's registered had some random id, so when agent was e.g. restarted it created new queue and would not consume messages from this old one.
[kailun] Queue name is the identifier of the specific rpc queue. The broker can generate a unique queue name if none is specified. However, in Neutron, we have queues w/ fixed names in the format of "dhcp_agent.host_name". Thus, when an agent is restarted, it will consume messages from the old one.

3. haleyb: doing it without a config option somehow? e.g. Resource queue with ExclusiveResourceProcessor used in L3 agent
[kailun] Resource queue with ExclusiveResourceProcessor used in L3 agent can make a timestamp tag when the first time an agent has *received* a rp...

Some answers to the questions raised in the last driver meeting.

1. amotoki: failure mode.
[kailun] Inputs from Wind River who met this issue in the first place:
We only observed this type of issue in a large office configuration where the neutron-server is overloaded during a DOR test (dead office recovery) where all nodes are powered off and back on. In such a scenario the system is overloaded for an extended period and there is a long delay between when events occur and when notifications are received by subscribers. It is difficult to reproduce this on small systems where the time between event and notification is short.

I don’t remember the exact details of the entire scenario, but the high level issue was that we wanted to avoid agents receiving and processing RPC messages that were sent to them before they started up. That happens more frequently in a DOR test because the server has a stale view of the system state and can send RPC messages to nodes that are not enabled yet. That is, its agent DB table may show that all agents are healthy depending on how long it took for the DOR to recover the controller node.

What we found was that it was possible for the server to think that the agent was up when it was actually down. During the window where the server sees the agent as up it can send it RPC messages. Those messages get queued up and delivered to the agent once it is finally up. The problem is since the agent was not actually up in the first place those messages were never really valid. Therefore we wanted the agent to discard any RPC requests until after it was able to resync to the server. This allowed the system to avoid unnecessary transitions based on old data.

One of the specific problems that this was addressing was something like this:
1) A subnet had no remaining IP addresses to allocated
2) A DCHP agent (agent-X) received a stale message to “create network” so it reserved a DHCP port with an IP address (this used the last available IP address)
3) Meanwhile, the DHCP agent (agent-Y) that actually was assigned the network came up and was not able to reserve a DHCP port because there were no IP addresses available
4) The first agent (agent-X) was taken down because its node was rebooted by system maintenance
5) The second agent (agent-Y) never retries the DHCP port creation because the DHCP agent has no periodic audit so there was no DHCP server servicing the network

2. slaweq: in some (very) old releases rpc queues which agent's registered had some random id, so when agent was e.g. restarted it created new queue and would not consume messages from this old one.
[kailun] Queue name is the identifier of the specific rpc queue. The broker can generate a unique queue name if none is specified. However, in Neutron, we have queues w/ fixed names in the format of "dhcp_agent.host_name". Thus, when an agent is restarted, it will consume messages from the old one.

3. haleyb: doing it without a config option somehow? e.g. Resource queue with ExclusiveResourceProcessor used in L3 agent
[kailun] Resource queue with ExclusiveResourceProcessor used in L3 agent can make a timestamp tag when the first time an agent has *received* a rpc message. However, based on the failure mode described in 1, we may need a timestamp (or something similar) to avoid agents receiving and processing RPC messages that were *sent* to them before they started up, so that stale messages could be discarded with more certainty. It seems that resource queue cannot fit in this case.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2018-12-16:

#9

I understand problem which You described here and I can imagine how it can happen in some corner cases.
I'm however afraid that approach with some configurable delay isn't good approach because You can't know how much time agent will need to do full sync after restart.
IMO if You want to solve this problem, You should try to do something based on timestamps and discard messages which came before agent was started.

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-12-18:

#10

@slaweq
Thanks for the comments.
From our point of view, it does not matter how long the full sync takes. Processing any RPC messages, even ones that are not stale, before the initial full sync completes is not guaranteed to provide consistent results.
For example, if a port-update-end arrives before that port is received as part of the initial sync it will unnecessarily result in a full resync on that port’s network. Similarly, if a port-delete-end arrives before that port is received as part of the initial sync then it will be added to the “deleted_ports” list but that list is not referenced during the full sync so the information for that port will remain in the DHCP configuration for that network even though the port no longer exists. That will cause issues later when a new port is created and uses the IP address of that deleted port.
We were opting the agent delay approach in the spirit of avoiding compatibility changes and changes that would impact running against an unmodified server. We agree that timestamp is a good approach but that will come with backward compatibility constraints and additional complexity.
Let us know if any further question or concern. Thanks!

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2018-12-18:

#11

Ok, one more question. Imagine the case when:
1. Agent starts and is doing full sync - so it gets list of ports and networks from server and starts configuring it one by one, right?
2. During this time, processing of incoming RPC messages is blocked, right?
3. Now (still during initial full sync) someone deleted ports so port-delete-end message is send to DHCP agent but this agent refuse to process this message, right?
4. Full sync is end and agent is still handling port which was deleted in 3. - am I right? Or will it be cleaned somehow?

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-12-19:

#12

@slaweq
The RPC handlers (e.g., port_update_end) are all wrapped with “_wait_if_syncing” so they don’t actually start processing until after sync has completed. We are only trying to prevent messages from being processed between the start of the process lifetime and the beginning of the initial sync. That window is what leads to the issues we have noted.

Revision history for this message

Miguel Lavalle (minsel) wrote on 2018-12-21:

#13

Looking at the proposed code (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@168), while that initial sync_state() is running, the corner case described by Slawek can still happen, right? Because at that point self._block_rpc == True if conf.initial_state_delay > 0

tags:

added: rfe-triaged
removed: rfe-confirmed

Revision history for this message

Kailun Qin (kailun.qin) wrote on 2018-12-21:

#14

@Miguel
The RPC handlers (e.g., port_update_end) are all wrapped with "_wait_if_syncing" (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@491).
When that initial "sync_state()" is running, it will acquire a write lock to block all operations for a global sync call (https://review.openstack.org/#/c/609463/10/neutron/agent/dhcp/agent.py@491). During this time, if a RPC message is sent to the DHCP agent, it will wait if any sync operations/writers are in progress due to the "_wait_if_syncing" decorator mentioned above. And it will start processing after the sync ("sync_state()") has completed.
So, IMO the corner case described by Slawek will not happen since the message received during that sync process will not be dropped and will be handled later.

Miguel Lavalle (minsel) on 2019-01-03

tags:

added: rfe-confirmed
removed: rfe-triaged

Miguel Lavalle (minsel) on 2019-06-13

tags:

added: rfe-postponed
removed: l3-ipam-dhcp rfe-confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-21: Change abandoned on neutron (master)

#15

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/609463
Reason: Please, feel free to retake this patch if needed.

neutron

[RFE] Prevent DHCP agent from processing stale RPC messages when restarting up

Bug Description

Other bug subscribers

Remote bug watches