Adding a third DHCP agent to the network fails due to binding index error

Bug #1883513 reported by Justinas Balciunas
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rodolfo Alonso

Bug Description

Using neutron-16.0.0, Ussuri.

Problem: unable to add the third DHCP agent to the network due to a binding index issue, as follows:

2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource [req-df0a70d4-268e-4c00-a208-e564fde17fb3 390ed4681a7995dc8833b1b176b41595b9d1053357a38de2da9b4de2ab0fa5b5 4e04be9ff9824855be22d0f54176f552 - d1c23cab572446aab6742e5185563575 d1c23cab572446aab6742e5185563575] create failed: No details.: sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.IntegrityError) (1062, "Duplicate entry '51893269-80b5-43ed-9696-e44716c8e111-2' for key 'uniq_network_dhcp_agent_binding0network_id0binding_index0'")
[SQL: INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id, binding_index) VALUES (%(network_id)s, %(dhcp_agent_id)s, %(binding_index)s)]
[parameters: {'network_id': '51893269-80b5-43ed-9696-e44716c8e111', 'dhcp_agent_id': '90181fe1-4a97-4a5e-a913-386d26ab9fae', 'binding_index': 2}]
(Background on this error at: http://sqlalche.me/e/gkpj) (Background on this error at: http://sqlalche.me/e/7s2a)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource Traceback (most recent call last):
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/api/v2/resource.py", line 98, in resource
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource result = method(request=request, **args)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/extensions/dhcpagentscheduler.py", line 50, in create
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource network_id)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/db/agentschedulers_db.py", line 393, in add_network_to_dhcp_agent
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource context, [agent_db], network_id, force_scheduling=True)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource next(self.gen)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1064, in _transaction_scope
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource yield resource
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource next(self.gen)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 666, in _session
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource self.session.rollback()
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource self.force_reraise()
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource six.reraise(self.type_, self.value, self.tb)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/six.py", line 703, in reraise
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource raise value
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 663, in _session
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource self._end_session_transaction(self.session)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 691, in _end_session_transaction
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource session.commit()
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1036, in commit
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource self.transaction.commit()
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 501, in commit
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource self._assert_active(prepared_ok=True)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 295, in _assert_active
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource code="7s2a",
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.IntegrityError) (1062, "Duplicate entry '51893269-80b5-43ed-9696-e44716c8e111-2' for key 'uniq_network_dhcp_agent_binding0network_id0binding_index0'")
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource [SQL: INSERT INTO networkdhcpagentbindings (network_id, dhcp_agent_id, binding_index) VALUES (%(network_id)s, %(dhcp_agent_id)s, %(binding_index)s)]
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource [parameters: {'network_id': '51893269-80b5-43ed-9696-e44716c8e111', 'dhcp_agent_id': '90181fe1-4a97-4a5e-a913-386d26ab9fae', 'binding_index': 2}]
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource (Background on this error at: http://sqlalche.me/e/gkpj) (Background on this error at: http://sqlalche.me/e/7s2a)
2020-06-15 12:23:10.874 25 ERROR neutron.api.v2.resource

Database output for the DHCP agent bindings shows that:

MariaDB [neutron]> select * from networkdhcpagentbindings;
+--------------------------------------+--------------------------------------+---------------+
| network_id | dhcp_agent_id | binding_index |
+--------------------------------------+--------------------------------------+---------------+
| 51893269-80b5-43ed-9696-e44716c8e111 | 4544bcca-5d9c-4573-a76d-1db641d5b79d | 1 |
| 51893269-80b5-43ed-9696-e44716c8e111 | 4f0671c6-60e5-4d5b-adff-ee29a91f7574 | 2 |
+--------------------------------------+--------------------------------------+---------------+

There are three up and running DHCP agents:

root@kolla-ussuri:/etc/kolla# neutron agent-list | grep -i dhcp
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| 4544bcca-5d9c-4573-a76d-1db641d5b79d | DHCP agent | z141-ost-ctrl1 | AZ2 | :-) | True | neutron-dhcp-agent |
| 4f0671c6-60e5-4d5b-adff-ee29a91f7574 | DHCP agent | z141-ost-ctrl2 | AZ2 | :-) | True | neutron-dhcp-agent |
| 90181fe1-4a97-4a5e-a913-386d26ab9fae | DHCP agent | z141-ost-ctrl3 | AZ2 | :-) | True | neutron-dhcp-agent |

And the error happens upon adding the third DHCP agent, however, as visible from the error output above, the binding index stays at 2, even though it should be 3.

tags: added: db l3-ipam-dhcp
Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

Is it on initial deployment, or after a scaleout like changing dhcp_agents_per_network? 100% reproducible?

If not 100% and at initial deploy, my hunch would be some race condition

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

After the inital deployment. I've tried removing both DHCP agents from the network and add them back.

I can add two of them only, regardless of the specific agent location, i.e. ctrl-1 and ctrl-2, or ctrl-1 and ctrl-3, or ctrl-2 and ctrl-3, but not all three of them as I hit the binding index error upon adding the third one.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Same questions as Bernard and a couple more:
- Did you schedule more than one agent at the same time? Could be (this is just a quick guess) related to [1], when the agent binding index is calculated, as you mentioned.
- Do you have the same problem with other networks?

Regards.

[1]I1bc3f8b69c337f7c1cf7375509a0da61def9baf1

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

What is the value you have in "dhcp_agents_per_network"?

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

I've tried adding them one by one via CLI, e.g. like neutron dhcp-agent-network-add 90181fe1-4a97-4a5e-a913-386d26ab9fae az2-vlan1083 and via Horizon - same result.

The issue persists regardless of the network, i.e. not more than 2 DHCP agents for any given network, of any type: private, provider, external.

The issue is reproducible 100%.

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

I'm a quite embarrased now because dhcp_agents_per_network = 3 was present in the kolla-ansible configuration files but were not merged to the actual deployment, after adding the entry manually to neutron.conf and restarting neutron-server all three DHCP agents were added sucessfully, apologies.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

No problem! At least we found the error. Good to know this.

Changed in neutron:
status: New → Invalid
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Even if dhcp_agents_per_network is set to some lower value than number of agents that user wants to use, it should still works when adding manually. So IMO this issue shouldn't be closed and we should check it.

Changed in neutron:
status: Invalid → New
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Slawek:

You are right: [1] if "force_scheduling" is True, we should provide a valid index and schedule this new DHCP agent.

The bug is legit and we should fix it.

Regards.

[1]https://review.opendev.org/#/c/288271/28/neutron/scheduler/dhcp_agent_scheduler.py@211

Changed in neutron:
importance: Undecided → Medium
importance: Medium → High
status: New → Incomplete
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/735872

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/738178

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/735872
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ed56429548dd8d08de10a7d451680a43e34c865c
Submitter: Zuul
Branch: master

commit ed56429548dd8d08de10a7d451680a43e34c865c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue Jun 16 10:28:50 2020 +0000

    DhcpFilter should always return a valid index if "force_scheduling"

    WHen retrieving a vacant DHCP agent binding index, if
    "force_scheduling" is set, the method should return a valid binding
    index. If the existing binding indexes are sequentially aligned,
    the method will return a new one on top; if there is a gap in the
    binding indexes list, the first free index will be returned.

    Change-Id: Ib4cbeb7c9f0c1e959ad53570320610925ff3d88f
    Closes-Bug: #1883513

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/738178
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1b86aff85e9d3c17e8243070d004da0cde1c0bdb
Submitter: Zuul
Branch: stable/ussuri

commit 1b86aff85e9d3c17e8243070d004da0cde1c0bdb
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue Jun 16 10:28:50 2020 +0000

    DhcpFilter should always return a valid index if "force_scheduling"

    WHen retrieving a vacant DHCP agent binding index, if
    "force_scheduling" is set, the method should return a valid binding
    index. If the existing binding indexes are sequentially aligned,
    the method will return a new one on top; if there is a gap in the
    binding indexes list, the first free index will be returned.

    Change-Id: Ib4cbeb7c9f0c1e959ad53570320610925ff3d88f
    Closes-Bug: #1883513
    (cherry picked from commit ed56429548dd8d08de10a7d451680a43e34c865c)

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.