[agent][dhcp] When revision_number matches the one in cache it is not considered stale

Bug #1874400 reported by Arun S A G
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Triaged
Low
Miguel Lavalle

Bug Description

The revision_number is used to check if the message received at the agent is stale or not. A message is treated as stale only if the revision_number in NetworkCache is greater than the revision_number in incoming payload

https://opendev.org/openstack/neutron/src/commit/10230683a2ce2f26279feaa34af4c0eccbfcb16c/neutron/agent/dhcp/agent.py#L852

IMO, we should consider making a message stale if the NetworkCache['revision_number'] >= payload['revision_number']

This will discard lot more messages as stale than the current logic and remove dhcp bottleneck in busy openstack environments.

Arun S A G (sagarun)
description: updated
description: updated
Revision history for this message
Brian Haley (brian-haley) wrote :

I think this was slightly broken with https://review.opendev.org/#/c/373566/ a few years ago:

     def is_port_message_stale(self, payload):
- orig = self.get_port_by_id(payload['id'])
- if orig and orig.get('revision', 0) > payload.get('revision', 0):
+ orig = self.get_port_by_id(payload['id']) or {}
+ if orig.get('revision_number', 0) > payload.get('revision_number', 0):
             return True
         if payload['id'] in self.deleted_ports:
             return True

If the line had stayed 'if orig and orig.get(... ' then it would have also returned False if there was no original port, where now it can return True. So I think you are correct and something like this works better:

if orig and orig.get('revision_number', 0) >= payload.get('revision_number', 0):
    return True

We can then add tests to cover the two new cases:

1) original port not found
2) revision numbers the same

Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Nate Johnston (nate-johnston) wrote :

I believe this request was anticipated by the author of the relevant code [1]:

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/resource_cache.py#L158-L161

            # NOTE(kevinbenton): we could be strict and check for >=, but this
            # makes us more tolerant of bugs on the server where we forget to
            # bump the revision_number.

Do you have estimates of how much the potential improvement would be, so that we could weigh the risks against the benefits?

Changed in neutron:
importance: Medium → Low
status: Confirmed → Triaged
Revision history for this message
Arun S A G (sagarun) wrote :

To process the first payload without revision_number we could potentially make the comparison

if orig and orig.get('revision_number', -1) >= payload.get('revision_number', 0):
    return True

So if the orig is empty or revision_number is zero (First time) We still process the first message and not discard it. I tested this

 + if orig.get('revision_number', 0) > payload.get('revision_number', 0):

in production this didn't cause any issues.

Revision history for this message
Arun S A G (sagarun) wrote :

Our downstream driver an ISC dhcpd driver deletes and recreates dhcp leases whenever it receives 'port_update_end' message before the BUG* was fixed it deleted and recreated the lease 18 times and caused issues when i tried to boot 50 baremetal servers in a loop, Only 25 of them would succeed and 25 of them failed to get dhcp, because neutron-dhcp-agent was busy deleting and recreating leases 18 times per server.

After the BUG* was fixed the deletes and recreates reduced from 18 to 2, I ran the same test this time all 50 out of 50 hosts succeeded (got the DHCP lease fine and became active).

Most users won't see this problem because if baremetal servers long time to POST and by the time they PXE boot the agent would have done deleting and recreating (responding to port_update_end) messages 18 times.

But if there is a burst of boots at the same time (rally scale test for example) The first bottle neck is neutron-dhcp-agent in our case.

Revision history for this message
Arun S A G (sagarun) wrote :

If we are not comfortable with making this change everywhere, I'd at-least suggest making it just in the DHCP agent.
https://opendev.org/openstack/neutron/src/commit/10230683a2ce2f26279feaa34af4c0eccbfcb16c/neutron/agent/dhcp/agent.py#L852

Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.