dhcp bulk reload fails with python3

Bug #1886969 reported by Boyvinall
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Boyvinall

Bug Description

In ussuri, configuring the neutron.conf bulk_reload_interval enables bulk reload mode. The current code looks to be incompatible with python3.

With the current ussuri code, which looks unchanged on master, I get the following error in docker logs:

---
Running command: 'neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini'
+ exec neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/usr/lib/python3.6/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
  File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 154, in _reload_bulk_allocations
    for network_id in self._network_bulk_allocations.keys():
RuntimeError: dictionary changed size during iteration
---

After this, I see no further updates to the dnsmasq hosts file.

The current code looks like this:

https://github.com/openstack/neutron/blob/56bb42fcbc43b619c3c07897c7de88f29158e4b8/neutron/agent/dhcp/agent.py#L157
---
    def _reload_bulk_allocations(self):
        while True:
            for network_id in self._network_bulk_allocations.keys():
                network = self.cache.get_network_by_id(network_id)
                self.call_driver('bulk_reload_allocations', network)
                del self._network_bulk_allocations[network_id]
            eventlet.greenthread.sleep(self.conf.bulk_reload_interval)
---

I think the problem is the "del" statement .. code like this works in python2 but not in python3. However, I also wonder if there's some race hazard here with new IDs being appended.

I suspect something like this might work better:

---
    def _reload_bulk_allocations(self):
        while True:
            deleted = self._network_bulk_allocations.copy()
            self._network_bulk_allocations = {}

            for network_id in deleted:
                network = self.cache.get_network_by_id(network_id)
                self.call_driver('bulk_reload_allocations', network)
            eventlet.greenthread.sleep(self.conf.bulk_reload_interval)
---

However, even this is probably susceptible to a race hazard. Probably need a mutex around any update to self._network_bulk_allocations.

Revision history for this message
Brian Haley (brian-haley) wrote :

Strange we haven't seen this before, thanks for filing the bug.

I think doing something like you propose would work, and it shouldn't need the copy()

while True:
    to_reload = self._network_bulk_allocations
    self._network_bulk_allocations = {}
    for network_id in to_reload:
        network = self.cache.get_network_by_id(network_id)
        self.call_driver('bulk_reload_allocations', network)
    eventlet.greenthread.sleep(self.conf.bulk_reload_interval)

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Boyvinall (boyvinall) wrote :

I have a fix here https://github.com/boyvinall/neutron/commit/56d3d6552912f59965a36464a7c5071823809a9a but sadly I'm struggling with getting gerrit running. (Not used it before, but `git-review -s` is giving me a 404.) Will try to poke that again a bit later, but currently I'm working more to verify some things before a critical deployment.

Revision history for this message
Boyvinall (boyvinall) wrote :

But yeah, `to_reload` is prob better than `deleted` that I used.

Revision history for this message
Boyvinall (boyvinall) wrote :

Actually, I guess the alternative to a lock is simply to iterate through the `to_reload` list/dict and delete any item there from `_network_bulk_allocations`. I can see the history of this code is very sensitive to locks, certainly we've seen problems with the dhcp agent rabbit queues building up. My thinking was that the locks I've put in there were held only for a very short period of time, so probably ok. The code you had above is sensitive to a race hazard if a new item is set in `_network_bulk_allocations` inbetween copying and clearing it.

Revision history for this message
Brian Haley (brian-haley) wrote :

Can I assign this bug to you then?

This link has a good description of the basic workflow in Openstack, hopefully that helps:

https://docs.opendev.org/opendev/infra-manual/latest/developers.html

Revision history for this message
Boyvinall (boyvinall) wrote :

Sure, no problem. FYI, we're already running this patch in production, the bulk reload has made a huge difference for us. :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/740988

Changed in neutron:
assignee: nobody → Boyvinall (boyvinall)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/742363

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Just a heads-up. This bug is not the same but closely related to https://bugs.launchpad.net/neutron/+bug/1890027. I proposed in the patch to solve both at once. Everything is focused in the same 4 code lines.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Boyvinall (boyvinall) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ussuri)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/740988
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Boyvinall (boyvinall)
Changed in neutron:
assignee: nobody → Boyvinall (boyvinall)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/742363
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=20b138ff3118029e86f0525695160c4c7ca8b551
Submitter: Zuul
Branch: master

commit 20b138ff3118029e86f0525695160c4c7ca8b551
Author: Matt Vinall <email address hidden>
Date: Thu Jul 9 21:08:21 2020 +0100

    fix dhcp bulk reload exceptions

    1886969 - The bulk reload code was written for python2 and caused
    an exception running under python3. This change works under python3.

    1890027 - There was an additional exception triggered when
    deleting networks - reading the network from the cache returned 'None'
    and this was not properly checked before use.

    Change-Id: I4e546c0e37146b1f34d8b5e6637c407b0c04ad4d
    Closes-Bug: 1886969
    Closes-Bug: 1890027
    Signed-off-by: Matt Vinall <email address hidden>

Changed in neutron:
status: New → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.3.0

This issue was fixed in the openstack/neutron 16.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.1.0

This issue was fixed in the openstack/neutron 17.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.0.0.0rc1

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.