Metadata service race condition (WARNING: Metadata service is not ready for port)

Bug #1831224 reported by Lucas Alvares Gomes on 2019-05-31
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-ovn
High
Lucas Alvares Gomes

Bug Description

The metadata agent maintains a record of the list of networks that it currently has proxies running on in the external_ids column of the Chassis entry (reference to the hypervisor the service is running on) in the OVN Southbound database (neutron-metadata-proxy-networks=UUID0,UUID1).

When multiple instances are being spawned and/or teared down at the same time, multiple workers may race while updating that value. One of the problems is that in OVSDB the external_ids column only support values with the type String and in order to update it, the python code has to: Read that string, transform it into a list, add/remove a value, transform it back to string and commit to the datase [0].

In this process, multiple updates can overwrite each other and when that happens the following error can be seeing in the neutron/server.log:

WARNING networking_ovn.ml2.mech_driver [req-62840f48-7a4d-4c00-8978-3e2ad90e2b1d - - - - -] Metadata service is not ready for port 77863715-28f8-446b-8acb-2ea460373ecb, check networking-ovn-metadata-agent status/logs.

This also results in instances failing to become accessible since the datapath to serve the metadata hasn't been setup correctly.

I was able to reproduce this problem consistenly by running BrowBeat [1] with the netcreate-boot-ping [2] scenario (concurrency: 50, times: 150) in a lab with 3 controllers and 2 computes.

[0] https://github.com/openstack/networking-ovn/blob/6e909e252d8f10b2d9c9aeba49a31850962cef1d/networking_ovn/agent/metadata/agent.py#L437-L457
[1] https://github.com/openstack/browbeat
[2] https://github.com/openstack/browbeat/blob/86bfc78e253677cf6029b2aabff1ba1300500fd1/browbeat-config.yaml#L224-L229

Changed in networking-ovn:
status: New → Confirmed
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
importance: Undecided → High

Fix proposed to branch: master
Review: https://review.opendev.org/662431

Changed in networking-ovn:
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/662431
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=d84cb7938a153bd22a6fae5adbfbf8ec3b819bda
Submitter: Zuul
Branch: master

commit d84cb7938a153bd22a6fae5adbfbf8ec3b819bda
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 31 11:39:39 2019 +0100

    Fix metadata agent proxy list updates race condition

    When multiple instances are being spawned and/or teared down at the
    same time, multiple updates to the "neutron-metadata-proxy-networks"
    list may race each other resulting in some networks UUIDs to go missing
    in that process.

    One of the problems with these updates is that the external_ids column
    in the OVSDBs only supports string type values so, the "list" is an
    actual comma-separated string. In order to add or remove an item from
    from it the code in networking-ovn needs to transform it from and back
    to string and commit the whole thing to OVN database when saving it.

    When it races, commits may overwrite each other.

    In order to fix this problem this patch is adding a file lock on the
    method that manipulates this list and commit it to the database. It's
    important to note that the metadata agent only updates the Chassis entry
    which it's currently running on, therefore we do not need a distributed
    lock (the file lock, being local works).

    I was able to reproduce this problem fairly consistently by concurrently
    spawning 50 instances each time with BrowBeat (see bug ticket for more
    information). After this fix was applied I could attest that the problem
    was resolved.

    Change-Id: Idc7bae88ed6f3ec44541b38f559a0296d2159449
    Closes-Bug: #1831224
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Changed in networking-ovn:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/663314
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=5fe83cbcb5d2acc8af5e05d65345b114f859a359
Submitter: Zuul
Branch: stable/stein

commit 5fe83cbcb5d2acc8af5e05d65345b114f859a359
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 31 11:39:39 2019 +0100

    Fix metadata agent proxy list updates race condition

    When multiple instances are being spawned and/or teared down at the
    same time, multiple updates to the "neutron-metadata-proxy-networks"
    list may race each other resulting in some networks UUIDs to go missing
    in that process.

    One of the problems with these updates is that the external_ids column
    in the OVSDBs only supports string type values so, the "list" is an
    actual comma-separated string. In order to add or remove an item from
    from it the code in networking-ovn needs to transform it from and back
    to string and commit the whole thing to OVN database when saving it.

    When it races, commits may overwrite each other.

    In order to fix this problem this patch is adding a file lock on the
    method that manipulates this list and commit it to the database. It's
    important to note that the metadata agent only updates the Chassis entry
    which it's currently running on, therefore we do not need a distributed
    lock (the file lock, being local works).

    I was able to reproduce this problem fairly consistently by concurrently
    spawning 50 instances each time with BrowBeat (see bug ticket for more
    information). After this fix was applied I could attest that the problem
    was resolved.

    Change-Id: Idc7bae88ed6f3ec44541b38f559a0296d2159449
    Closes-Bug: #1831224
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit d84cb7938a153bd22a6fae5adbfbf8ec3b819bda)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/663316
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=9d9a2b0d55e808d155461df6c4614a3b506a6c4e
Submitter: Zuul
Branch: stable/rocky

commit 9d9a2b0d55e808d155461df6c4614a3b506a6c4e
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 31 11:39:39 2019 +0100

    Fix metadata agent proxy list updates race condition

    When multiple instances are being spawned and/or teared down at the
    same time, multiple updates to the "neutron-metadata-proxy-networks"
    list may race each other resulting in some networks UUIDs to go missing
    in that process.

    One of the problems with these updates is that the external_ids column
    in the OVSDBs only supports string type values so, the "list" is an
    actual comma-separated string. In order to add or remove an item from
    from it the code in networking-ovn needs to transform it from and back
    to string and commit the whole thing to OVN database when saving it.

    When it races, commits may overwrite each other.

    In order to fix this problem this patch is adding a file lock on the
    method that manipulates this list and commit it to the database. It's
    important to note that the metadata agent only updates the Chassis entry
    which it's currently running on, therefore we do not need a distributed
    lock (the file lock, being local works).

    I was able to reproduce this problem fairly consistently by concurrently
    spawning 50 instances each time with BrowBeat (see bug ticket for more
    information). After this fix was applied I could attest that the problem
    was resolved.

    Change-Id: Idc7bae88ed6f3ec44541b38f559a0296d2159449
    Closes-Bug: #1831224
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit d84cb7938a153bd22a6fae5adbfbf8ec3b819bda)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/663317
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=dfa4692eb27a99104cc8df24fcf4aed7fa330747
Submitter: Zuul
Branch: stable/queens

commit dfa4692eb27a99104cc8df24fcf4aed7fa330747
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 31 11:39:39 2019 +0100

    Fix metadata agent proxy list updates race condition

    When multiple instances are being spawned and/or teared down at the
    same time, multiple updates to the "neutron-metadata-proxy-networks"
    list may race each other resulting in some networks UUIDs to go missing
    in that process.

    One of the problems with these updates is that the external_ids column
    in the OVSDBs only supports string type values so, the "list" is an
    actual comma-separated string. In order to add or remove an item from
    from it the code in networking-ovn needs to transform it from and back
    to string and commit the whole thing to OVN database when saving it.

    When it races, commits may overwrite each other.

    In order to fix this problem this patch is adding a file lock on the
    method that manipulates this list and commit it to the database. It's
    important to note that the metadata agent only updates the Chassis entry
    which it's currently running on, therefore we do not need a distributed
    lock (the file lock, being local works).

    I was able to reproduce this problem fairly consistently by concurrently
    spawning 50 instances each time with BrowBeat (see bug ticket for more
    information). After this fix was applied I could attest that the problem
    was resolved.

    Change-Id: Idc7bae88ed6f3ec44541b38f559a0296d2159449
    Closes-Bug: #1831224
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit d84cb7938a153bd22a6fae5adbfbf8ec3b819bda)

tags: added: in-stable-queens
tags: added: networking-ovn-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers