[OVN] Deadlock when starting neutron server, during the OVN hash ring deletion
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Lucas Alvares Gomes |
Bug Description
Related bugzilla: https:/
Description of problem:
Neutron server often fails to start and systemd needs to restart it. This is a problem at scale because all workers need to reconnect again to the OVN DBs.
How reproducible:
50%
Steps to Reproduce:
1. Start neutron server
Error log: https:/
Changed in neutron: | |
assignee: | nobody → Rodolfo Alonso (rodolfo-alonso-hernandez) |
importance: | Undecided → Medium |
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote : | #1 |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #2 |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
status: | New → In Progress |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #3 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 819a1bb3e6f3b10
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue Sep 20 13:32:04 2022 +0200
Move the "ovn_hash_ring" clean up to maintenance worker
The "ovn_hash_ring" procedure to clean up the stale/old registers
is now executed on the ``HashRingHealt
is executed on the ``MaintenanceWo
In a HA scenario, if several servers are rebooted at the same time,
the "ovn_hash_ring" clean up operation can clash with API worker
method "_load_hash_ring", that executed a SQL read from this table.
In some high loaded environments, if the OVN database takes time
to be locally cached, this read operation is executed thousand of
times; basically any time an OVN database event occurs.
In order to avoid/skip a deadlock when deleting the "ovn_hash_ring"
table, this clean up is executed in a periodic task. If this task
succeeds, the task is stopped. If the task raises a database
exception, it is processed again.
Now the "ovn_hash_ring" registers are retrieved using the
"created_at" time as a filter. The initial time is taken when the
OVN mechanism driver is initilized, before any API worker is spawned
and any new "ovn_hash_ring" register has been created (an API
worker, when started, will create a new "ovn_hash_ring" register).
Any stale/old register stored in this table will be ignored; that
means any register created before the OVN mechanism driver was
started.
Closes-Bug: #1990174
Change-Id: I07c4cb6e20b8a8
Changed in neutron: | |
status: | In Progress → Fix Released |
Lucas Alvares Gomes (lucasagomes) wrote : | #4 |
Re-opening this as the fix has been reverted at https:/
Changed in neutron: | |
status: | Fix Released → Confirmed |
assignee: | Rodolfo Alonso (rodolfo-alonso-hernandez) → Lucas Alvares Gomes (lucasagomes) |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #5 |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
status: | Confirmed → In Progress |
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote : | #6 |
Thanks Lucas for taking care of this LP bug.
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #7 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit b7b8f7c57144057
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
Changed in neutron: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed) | #8 |
Fix proposed to branch: stable/zed
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga) | #9 |
Fix proposed to branch: stable/yoga
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena) | #10 |
Fix proposed to branch: stable/xena
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby) | #11 |
Fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria) | #12 |
Fix proposed to branch: stable/victoria
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri) | #13 |
Fix proposed to branch: stable/ussuri
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed) | #14 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/zed
commit 3f8b96ec647eb4d
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-zed |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri) | #15 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/ussuri
commit 2adb471f7575ddc
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Conflicts:
neutron/
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-ussuri |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena) | #16 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/xena
commit 921713cbdcf8e96
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Conflicts:
neutron/
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-xena |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga) | #17 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/yoga
commit dd7fc476841649d
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Conflicts:
neutron/
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-yoga |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby) | #18 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit 049435eeabaef2f
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Conflicts:
neutron/
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-wallaby |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria) | #19 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit c336814d0dcdb60
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100
[OVN] Avoid deadlock when cleaning hash ring nodes
This patch avoids the clash of the hash ring cleaning operation and the
API workers by ensuring that the cleaning happens before the nodes for
that host are added to the ring and the connections to the OVSDBs (meaning
no events therefore no SELECTS on the hash ring table for that hostname).
This patch does this by re-using the same hash ring lock that starts
the probing thread. Now, the first worker that acquire the lock is
responsible for cleaning the hash ring for it's own host as well as
starting the probing thread. Subsequently workers only need to register
themselves to the hash ring.
Conflicts:
neutron/
Change-Id: Iba73f7944592a0
Closes-Bug: #1990174
Signed-off-by: Lucas Alvares Gomes <email address hidden>
(cherry picked from commit b7b8f7c57144057
tags: | added: in-stable-victoria |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.6.0 | #20 |
This issue was fixed in the openstack/neutron 18.6.0 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn train-eol | #21 |
This issue was fixed in the openstack/
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.5.0 | #22 |
This issue was fixed in the openstack/neutron 19.5.0 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1 | #23 |
This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.3.0 | #25 |
This issue was fixed in the openstack/neutron 20.3.0 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.1.0 | #26 |
This issue was fixed in the openstack/neutron 21.1.0 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ussuri-eol | #27 |
This issue was fixed in the openstack/neutron ussuri-eol release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron victoria-eom | #28 |
This issue was fixed in the openstack/neutron victoria-eom release.
The "ovn_hash_ring" register deletion [1] is done during the OVN mech driver initialization. This initialization is performed before any API worker is spawned. That means the Neutron service (and its workers) cannot interfere on the database access.
However, in an HA scenario, several servers can be rebooted at the same time. If the OVN database sync is not performed quickly, the API workers will call "_load_hash_ring" [2] each time a new event is received in order to check if it is needed to notify it. This method executes a Neutron database (NOTE 1) read, on the "ovn_hash_ring" table. In some environments, we have seen up to 1/4 million calls to this method, only from one single controller node, during the transient period while the OVN database is locally cached.
These continuous Neutron database read calls can lock a write (delete) operation, as reported. The proposed idea (that will be posted in a patch) is to move the table clean up method out of the mech driver initialization and add is as a periodic method in the "HashRingHealth CheckPeriodics" class.
The issue to be resolved is how to deal with stale/old "ovn_hash_ring" registers before the clean up method has been called.
NOTE 1: Neutron database != OVN database. The Neutron database is the SQL type database that stores the Neutron information, not related to any specific network backend.
[1]https:/ /github. com/openstack/ neutron/ blob/0a6b9cc395 27c13fc663709ae 9da3ac033407423 /neutron/ plugins/ ml2/drivers/ ovn/mech_ driver/ mech_driver. py#L128 /github. com/openstack/ neutron/ blob/d0e33b9f32 12d30654470ea5f 218c6a8fb1662bf /neutron/ common/ ovn/hash_ ring_manager. py#L75- L94 /github. com/openstack/ neutron/ blob/d0e33b9f32 12d30654470ea5f 218c6a8fb1662bf /neutron/ db/ovn_ hash_ring_ db.py#L61
[2]https:/
[3]https:/