neutron

dhcp bulk reload fails with python3

Bug #1886969 reported by Boyvinall on 2020-07-09

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Boyvinall

Bug Description

In ussuri, configuring the neutron.conf bulk_reload_interval enables bulk reload mode. The current code looks to be incompatible with python3.

With the current ussuri code, which looks unchanged on master, I get the following error in docker logs:

---
Running command: 'neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini'
+ exec neutron-dhcp-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/usr/lib/python3.6/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
  File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 154, in _reload_bulk_allocations
    for network_id in self._network_bulk_allocations.keys():
RuntimeError: dictionary changed size during iteration
---

After this, I see no further updates to the dnsmasq hosts file.

The current code looks like this:

https://github.com/openstack/neutron/blob/56bb42fcbc43b619c3c07897c7de88f29158e4b8/neutron/agent/dhcp/agent.py#L157
---
    def _reload_bulk_allocations(self):
        while True:
            for network_id in self._network_bulk_allocations.keys():
                network = self.cache.get_network_by_id(network_id)
                self.call_driver('bulk_reload_allocations', network)
                del self._network_bulk_allocations[network_id]
            eventlet.greenthread.sleep(self.conf.bulk_reload_interval)
---

I think the problem is the "del" statement .. code like this works in python2 but not in python3. However, I also wonder if there's some race hazard here with new IDs being appended.

I suspect something like this might work better:

---
    def _reload_bulk_allocations(self):
        while True:
            deleted = self._network_bulk_allocations.copy()
            self._network_bulk_allocations = {}

            for network_id in deleted:
                network = self.cache.get_network_by_id(network_id)
                self.call_driver('bulk_reload_allocations', network)
            eventlet.greenthread.sleep(self.conf.bulk_reload_interval)
---

However, even this is probably susceptible to a race hazard. Probably need a mutex around any update to self._network_bulk_allocations.

Tags:

Revision history for this message

Brian Haley (brian-haley) wrote on 2020-07-09:

#1

Strange we haven't seen this before, thanks for filing the bug.

I think doing something like you propose would work, and it shouldn't need the copy()

while True:
    to_reload = self._network_bulk_allocations
    self._network_bulk_allocations = {}
    for network_id in to_reload:
        network = self.cache.get_network_by_id(network_id)
        self.call_driver('bulk_reload_allocations', network)
    eventlet.greenthread.sleep(self.conf.bulk_reload_interval)

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Boyvinall (boyvinall) wrote on 2020-07-10:

#2

I have a fix here https://github.com/boyvinall/neutron/commit/56d3d6552912f59965a36464a7c5071823809a9a but sadly I'm struggling with getting gerrit running. (Not used it before, but `git-review -s` is giving me a 404.) Will try to poke that again a bit later, but currently I'm working more to verify some things before a critical deployment.

Revision history for this message

Boyvinall (boyvinall) wrote on 2020-07-10:

#3

But yeah, `to_reload` is prob better than `deleted` that I used.

Revision history for this message

Boyvinall (boyvinall) wrote on 2020-07-10:

#4

Actually, I guess the alternative to a lock is simply to iterate through the `to_reload` list/dict and delete any item there from `_network_bulk_allocations`. I can see the history of this code is very sensitive to locks, certainly we've seen problems with the dhcp agent rabbit queues building up. My thinking was that the locks I've put in there were held only for a very short period of time, so probably ok. The code you had above is sensitive to a race hazard if a new item is set in `_network_bulk_allocations` inbetween copying and clearing it.

Revision history for this message

Brian Haley (brian-haley) wrote on 2020-07-14:

#5

Can I assign this bug to you then?

This link has a good description of the basic workflow in Openstack, hopefully that helps:

https://docs.opendev.org/opendev/infra-manual/latest/developers.html

Revision history for this message

Boyvinall (boyvinall) wrote on 2020-07-14:

#6

Sure, no problem. FYI, we're already running this patch in production, the bulk reload has made a huge difference for us. :)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-14: Fix proposed to neutron (stable/ussuri)

#7

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/740988

Brian Haley (brian-haley) on 2020-07-14

Changed in neutron:
assignee:	nobody → Boyvinall (boyvinall)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-22: Fix proposed to neutron (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/742363

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2020-08-07:

#9

Just a heads-up. This bug is not the same but closely related to https://bugs.launchpad.net/neutron/+bug/1890027. I proposed in the patch to solve both at once. Everything is focused in the same 4 code lines.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2020-09-25: auto-abandon-script

#10

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee:	Boyvinall (boyvinall) → nobody
status:	In Progress → New
tags:	added: timeout-abandon

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-25: Change abandoned on neutron (stable/ussuri)

#11

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/740988
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Boyvinall (boyvinall) on 2020-09-27

Changed in neutron:
assignee:	nobody → Boyvinall (boyvinall)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-10: Fix merged to neutron (master)

#12

Reviewed: https://review.opendev.org/742363
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=20b138ff3118029e86f0525695160c4c7ca8b551
Submitter: Zuul
Branch: master

commit 20b138ff3118029e86f0525695160c4c7ca8b551
Author: Matt Vinall <email address hidden>
Date: Thu Jul 9 21:08:21 2020 +0100

fix dhcp bulk reload exceptions

1886969 - The bulk reload code was written for python2 and caused
an exception running under python3. This change works under python3.

    1890027 - There was an additional exception triggered when
    deleting networks - reading the network from the cache returned 'None'
    and this was not properly checked before use.

    Change-Id: I4e546c0e37146b1f34d8b5e6637c407b0c04ad4d
    Closes-Bug: 1886969
    Closes-Bug: 1890027
    Signed-off-by: Matt Vinall <email address hidden>

Changed in neutron:
status:	New → Fix Released

Bernard Cafarelli (bcafarel) on 2020-11-27

tags:

added: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-28: Fix included in openstack/neutron 16.3.0

#13

This issue was fixed in the openstack/neutron 16.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-28: Fix included in openstack/neutron 17.1.0

#14

This issue was fixed in the openstack/neutron 17.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-03-26: Fix included in openstack/neutron 18.0.0.0rc1

#15

This issue was fixed in the openstack/neutron 18.0.0.0rc1 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.