DHCP Worker skips queued tasks if earlier tasks fail

Bug #2037876 reported by Grisha Tsukerman
262
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Incomplete
Undecided
Unassigned
neutron
New
Critical
Unassigned

Bug Description

Steps to reproduce:

1.Create a network with several subnets and a router.
2.Delete the router and quickly afterwards delete the subnets and finally the network.

Expected behavior:
 - Subnet and networks should be deleted as expected after deleting the router.

Actual behavior:
1.Router is not deleted properly (the port is not deleted)
2.Because of the above, the subnet and network deletion tasks are dropped because of the design of the task management in DHCP agent.

RCA:
1. Router deletion failure:
   a. Eventually the task port_delete_end is called from the router deletion for the port: https://github.com/openstack/neutron/blob/stable/yoga/neutron/agent/dhcp/agent.py
   b. As part of the event queue, the resource __lt__ function is called to check for the IP:
https://github.com/openstack/neutron/blob/cf096344b07b80524c3888e44e0b895465598a74/neutron/agent/common/resource_processing_queue.py#L177C1-L178C1
   c. The __lt__ function fails because when a router uses the port_delete_end, the fixed_ip 'ip_address' key is not accessible.
https://github.com/openstack/neutron/blob/cf096344b07b80524c3888e44e0b895465598a74/neutron/agent/dhcp/agent.py#L86
   d. Since there is no error handling in the primary loop, all other tasks that were within the queue are forgotten
https://github.com/openstack/neutron/blob/cf096344b07b80524c3888e44e0b895465598a74/neutron/agent/common/resource_processing_queue.py#L156

As far as I understand, there are two problems:
1. In this commit https://github.com/openstack/neutron/commit/53000704f211bbbd5e439890015891039ef6752e the __lt__ functionality was changed but did not support the router port deleteion.

2. The primary worker loop mechanism does not support unexpected behavior like crashes and such. Is it by design that all other tasks will drop in this case?

Here's a small visualization: TBD

Version:
Yoga

Tags: l3-ipam-dhcp
Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Thanks for reporting this issue, as I see the code on master this issue is affecting master also, but I try to reproduce it anyway.
If I understand your steps you propose the partial revert of https://review.opendev.org/c/openstack/neutron/+/815853/5/neutron/agent/dhcp/agent.py ?

Revision history for this message
Jeremy Stanley (fungi) wrote :

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

Can someone explain what the suspected security vulnerability is in this report? Ideally include an exploit scenario describing how a malicious actor would use it to their advantage, and any mitigating factors they'd need to overcome in order to do so. Thanks!

description: updated
Changed in ossa:
status: New → Incomplete
Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :

@Lajos

Part of what I propose is the partial revet of https://review.opendev.org/c/openstack/neutron/+/815853/5/neutron/agent/dhcp/agent.py (return the first if statement)

The other part is possibly to add a protection layer to the agent worker queue. Roughly speaking: https://github.com/openstack/neutron/blob/ea0bc376a2ec93a78ee5c9d4772a970832dc367b/neutron/agent/common/resource_processing_queue.py#L156-L160
Wrap a try except around this part.

@Jeremy
As far as I understand, there's a resource that is not properly cleaned.
(Specifically hanging DHCP process)

Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :

Just an important note - I automate deletion of a network with an extra step of deleting the router beforehand, so I was able to reproduce this 100% of the time.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Thanks, so the vulnerability concern is that a registered user with permission to create and delete networks/subnets/routers could intentionally create a denial of service condition by rapidly repeating that action, starving the environment of available address space to allocate to other tenants before the operators find out and disable their account?

Revision history for this message
Jeremy Stanley (fungi) wrote :

Oh, also, do the leaked network resources continue to count against the tenant's quotas? Otherwise, presumably basic quota checks would prevent them from consuming all available address space.

Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :

@Jeremy, I haven't tried these out, theoretically since the router usually uses the first address of the subnet it should be fine, but maybe there's a way to workaround this - not sure.

But for sure if one creates a network/subnet and router and then deletes them rapidly, there will be hanging dnsmasq processes in the host and hanging ports in the host.
Repeating this will potentially starve the host from memory and port resources.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

@Grisha: do yo have time to propose the change or shall I find somebody who can work on it (I am not sure I can allocate time to it in the coming weeks, but will check)

Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :

@lajos Sure, I can propose the change (I tested it in in my env and it fixed the issue) but I'm unfamiliar with the testing and CI

Revision history for this message
Jeremy Stanley (fungi) wrote :

If the Neutron security reviewers don't consider this to be a severe enough exploit risk to warrant privately notifying operators and distributors prior to public disclosure, we could simply switch the report to public now and just follow OpenStack's normal code review and testing workflow instead of bothering with attaching patches to the bug and trying to test locally in secret.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

This bug is invalid. The referred patches ([1] in particular in the commentes) have the information for this issue: the port resource has always the "fixed_ips" information since [2]. The comment:
    # NOTE(ralonsoh): both resources should have "fixed_ips"
    # information. That key was added to the deleted ports in this
    # patch but this code runs in the Neutron API (server). Both the
    # server and the DHCP agent should be updated.
    # This check could be removed in Y release.

That means since Wallaby, the Neutron API provide this information in the port resource sent to the DHCP agent via RPC. Any "port" dictionary sent to the DHCP agent has the "ip_address" key, in particular in the port deletion command [3].

The removal of this condition [1] was done in Yoga. That means you have updated your DHCP agents to Yoga but your Neutron API is still in an older version of Wallaby that doesn't contain [2]. ***Please confirm that.***

This bug, IMO, should not be under an embargo. This issue will leave a left over (router port) that will need to be removed manually.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/815853
[2]https://review.opendev.org/c/openstack/neutron/+/773160
[3]https://review.opendev.org/c/openstack/neutron/+/773160/7/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#272

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Jeremy Stanley (fungi) wrote :

Grisha, Lajos: Can you check Rodolfo's comments against your previous assumptions? If the issue is an outdated/incomplete upgrade or misconfiguration, then we should be safe to switch this report to public immediately. Thanks!

Revision history for this message
Grisha Tsukerman (grishatsuker) wrote :

I'm pretty sure that this is not a partial upgrade since both server and agent were upgraded to Yoga.
Unfortunately I wont be available to verify and recreate this in the coming months.
Please leave me an email address and I'll ask a colleague of mine to take over for me while I'm gone.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Grisha:

You can update this bug with any new information. If that is not possible, please use <email address hidden> and I'll update this bug with any new info provided.

If you can reproduce this issue, please provide a set of steps (with CLI commands) to try to reproduce it locally and investigate it.

Regards.

Revision history for this message
Jeremy Stanley (fungi) wrote :

We're now several days past the publication deadline for this embargo, so should not continue to leave the bug report private indefinitely. Since the details requested from the reporter in comment #15 were not provided after nearly 3 months, I'm switching the report to public security now.

Jeremy Stanley (fungi)
description: updated
information type: Private Security → Public Security
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.