OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads

Bug #1929446 reported by Lee Yarwood
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-vif
Fix Released
Medium
sean mooney
ovsdbapp
Fix Released
Undecided
Unassigned

Bug Description

I've been seeing lots of failures caused by timeouts in test_volume_backed_live_migration during the live-migration and multinode grenade jobs, for example:

https://zuul.opendev.org/t/openstack/build/bb6fd21b5d8c471a89f4f6598aa84e5d/logs

During check_can_live_migrate_source I'm seeing the following gap in the logs that I can't explain:

12225 May 24 10:23:02.637600 ubuntu-focal-inap-mtl01-0024794054 nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 tempest-LiveMigrationTest-312230369-project-admin] [instance: 91a0e0ca-e6a8-43ab-8e68-a10a77ad615b] Check if temp file /opt/stack/data/nova/instances/tmp5lcmhuri exists to indicate shared storage is being used for migration. Exists? False {{(pid=107012) _check_shared_storage_test_file /opt/stack/nova/nova/virt/libvirt/driver.py:9367}}
[..]
12282 May 24 10:24:22.385187 ubuntu-focal-inap-mtl01-0024794054 nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 tempest-LiveMigrationTest-312230369-project-admin] skipping disk /dev/sdb (vda) as it is a volume {{(pid=107012) _get_instance_disk_info_from_config /opt/stack/nova/nova/virt/libvirt/driver.py:10458}}

^ this leads to both the HTTP request to live migrate (that's still a synchronous call at this point [1]) *and* the RPC call from the dest to the source both timing out.

[1] https://docs.openstack.org/nova/latest/reference/live-migration.html

Revision history for this message
sean mooney (sean-k-mooney) wrote :

This is caused by the ovs polling loop in ovsdbapp as a result of the connection crated in os-vif

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Lee Yarwood (lyarwood) wrote :
summary: - check_can_live_migrate_source taking > 60 seconds in CI
+ OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

There are several things that I think need to happen:

1) This patch needs to be merged in python-ovs that stops using a non-monkeypatched select.poll() when checking for ovsdb connection completion - http://patchwork.ozlabs<email address hidden>/

2) os-vif should limit the tables that it registers here: https://github.com/openstack/os-vif/blob/d8af3568b8b92748f61029a96c46fd513b6795c2/vif_plug_ovs/ovsdb/impl_idl.py#L26 to only the tables that it uses. On connection it currently pulls in the whole database at connection, which takes a long time

3) In the next line, it could pass a probe_interval argument to idl.Idl which will increase the amount of time from the default of 5s for sending echo probes to the server. If downloading the entire DB from a busy ovsdb-server with lots of connections takes longer than 5s, it'll currently disconnect and try again w/o ever successfully connecting

4) Something like https://review.opendev.org/c/openstack/neutron/+/794892 can be done, overriding the python-ovs streams to use TCP keepliaves instead of ovsdb echo requests, making reconnections less likely on a loaded single-threaded ovsdb-server since the kernel will take care of them. Note that that patch probably should have configured TCP_KEEPIDLE, TCP_KEEPCNT, and TCP_KEEPINTVL instead of relying on whatever is configured with sysctl as the defaults are very long.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting to invalid for nova as the error is in the ovs python bindings.
marked as triaged for os-vif to track the enhancements proposed in comment 3 above.

Changed in os-vif:
status: New → Triaged
importance: Undecided → Medium
Changed in nova:
status: Triaged → Invalid
Changed in os-vif:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/796813

Revision history for this message
Lee Yarwood (lyarwood) wrote :
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-vif/+/805223

Changed in os-vif:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-vif/+/805625

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-vif (master)

Reviewed: https://review.opendev.org/c/openstack/os-vif/+/805223
Committed: https://opendev.org/openstack/os-vif/commit/09c0629bb728ad342a41d844143d8e7437c925c4
Submitter: "Zuul (22348)"
Branch: master

commit 09c0629bb728ad342a41d844143d8e7437c925c4
Author: Sean Mooney <email address hidden>
Date: Thu Aug 19 14:32:42 2021 +0100

    Use TCP keepalives for ovsdb connections

    Ultimately, this is something that should be fixed in python-ovs,
    but setting the SO_KEEPALIVE socket option benefits the client by
    removing the need to send 'echo' requests, which can time out on
    an overloaded ovsdb-server, which causes a disconnection which then#
    adds even more load on the ovsdb-server as it has to send the entire
    db contents over the wire after the connection is restored.

    This patch ports the optimisation form neutron to reduce the likelyhood
    of a reconnection which can cause the nova compute agent to hang
    temporarily while the connection is reestablished.

    Change-Id: I984ec62730276f8ee60d71a02a98fbfc4c37f7d8
    Related-Bug: #1930926
    Partial-Bug: #1929446

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/os-vif/+/805625
Committed: https://opendev.org/openstack/os-vif/commit/e4dc8b5664ccee8bde9e90fc9e618d6b705a0b68
Submitter: "Zuul (22348)"
Branch: master

commit e4dc8b5664ccee8bde9e90fc9e618d6b705a0b68
Author: Sean Mooney <email address hidden>
Date: Mon Aug 23 13:11:16 2021 +0100

    only register tables used by os-vif

    This change limits the tables registered in the native driver
    to the set actully used by os-vif. This will shorten the inital
    startup time and reconnection time if the ovs db connection is dropped.
    as a result this will help mitigate bug #1929446 where on reconnection
    the nova compute agent can stall until reconnection is completed.

    Change-Id: I635dff2b4fcff905ca8f431eb7e928265200f92a
    Partial-Bug: #1929446

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ovsdbapp (master)

Change abandoned by "Terry Wilson <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/795789
Reason: This should be fixed in upstream python-ovs by adding a cooperative_yield() method that can be overriden. ovsdbapp may go ahead and also add a default cooperative_yield() that is just time.sleep(0).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/818446

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovsdbapp (master)

Reviewed: https://review.opendev.org/c/openstack/ovsdbapp/+/818446
Committed: https://opendev.org/openstack/ovsdbapp/commit/a2d3ef2a6491eb63b5ee961fc930070207a79d84
Submitter: "Zuul (22348)"
Branch: master

commit a2d3ef2a6491eb63b5ee961fc930070207a79d84
Author: Terry Wilson <email address hidden>
Date: Fri Nov 12 10:58:22 2021 -0600

    Add cooperative_yield() to OvsdbIdl

    On python-ovs 2.16.0+, overriding cooperative_yield() will allow
    potentially long-running CPU-intensive methods to cooperatively
    yield to greenthreads. This patch + ovs 2.16.0+ will resolve
    the related bug. I'll see if I can get cooperative_yield
    backported in OVS as well.

    Related-Bug: #1929446
    Change-Id: Ibd3c7427cbcab81253e0ed700174be09908bdef7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/841738

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/841739

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/841740

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/841741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/os-vif/+/841771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/os-vif/+/841772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/os-vif/+/841773

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/os-vif/+/841774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/os-vif/+/841775

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/os-vif/+/841776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/os-vif/+/841777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/os-vif/+/841778

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/os-vif/+/841779

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/os-vif/+/841780

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this is basiclaly as fixed as it is going to be in os-vif and ovsdbapp

the actual fix was in ovs
the mitigation for ti are also merged in os-vif and ovsdbapp master (since xena)

i have proposed backports to os-vif down to train
and we can review them as normal.

Changed in os-vif:
status: In Progress → Fix Released
Changed in ovsdbapp:
status: New → Fix Released
Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way the os-vif patches jsut reduce the time it takes to reconnect and makes it happen less often but that is only a mitigation for the issue not a fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovsdbapp (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/ovsdbapp/+/841738
Committed: https://opendev.org/openstack/ovsdbapp/commit/62ee94741206de8340eed16e89e4ddb55a8723d3
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 62ee94741206de8340eed16e89e4ddb55a8723d3
Author: Terry Wilson <email address hidden>
Date: Fri Nov 12 10:58:22 2021 -0600

    Add cooperative_yield() to OvsdbIdl

    On python-ovs 2.16.0+, overriding cooperative_yield() will allow
    potentially long-running CPU-intensive methods to cooperatively
    yield to greenthreads. This patch + ovs 2.16.0+ will resolve
    the related bug. I'll see if I can get cooperative_yield
    backported in OVS as well.

    Related-Bug: #1929446
    Change-Id: Ibd3c7427cbcab81253e0ed700174be09908bdef7
    (cherry picked from commit a2d3ef2a6491eb63b5ee961fc930070207a79d84)

tags: added: in-stable-xena
tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovsdbapp (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/ovsdbapp/+/841740
Committed: https://opendev.org/openstack/ovsdbapp/commit/84c07ca8a3bd827406edb3275a829f5feba5a54e
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 84c07ca8a3bd827406edb3275a829f5feba5a54e
Author: Terry Wilson <email address hidden>
Date: Fri Nov 12 10:58:22 2021 -0600

    Add cooperative_yield() to OvsdbIdl

    On python-ovs 2.16.0+, overriding cooperative_yield() will allow
    potentially long-running CPU-intensive methods to cooperatively
    yield to greenthreads. This patch + ovs 2.16.0+ will resolve
    the related bug. I'll see if I can get cooperative_yield
    backported in OVS as well.

    Related-Bug: #1929446
    Change-Id: Ibd3c7427cbcab81253e0ed700174be09908bdef7
    (cherry picked from commit a2d3ef2a6491eb63b5ee961fc930070207a79d84)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovsdbapp (stable/train)

Reviewed: https://review.opendev.org/c/openstack/ovsdbapp/+/841741
Committed: https://opendev.org/openstack/ovsdbapp/commit/4d9ea84cbafa9cb15d8d91aae29672e98473c3de
Submitter: "Zuul (22348)"
Branch: stable/train

commit 4d9ea84cbafa9cb15d8d91aae29672e98473c3de
Author: Terry Wilson <email address hidden>
Date: Fri Nov 12 10:58:22 2021 -0600

    Add cooperative_yield() to OvsdbIdl

    On python-ovs 2.16.0+, overriding cooperative_yield() will allow
    potentially long-running CPU-intensive methods to cooperatively
    yield to greenthreads. This patch + ovs 2.16.0+ will resolve
    the related bug. I'll see if I can get cooperative_yield
    backported in OVS as well.

    Related-Bug: #1929446
    Change-Id: Ibd3c7427cbcab81253e0ed700174be09908bdef7
    (cherry picked from commit a2d3ef2a6491eb63b5ee961fc930070207a79d84)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovsdbapp (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/ovsdbapp/+/841739
Committed: https://opendev.org/openstack/ovsdbapp/commit/e1a0d7c85783e87be6ca440f223d37ba2e2a888f
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit e1a0d7c85783e87be6ca440f223d37ba2e2a888f
Author: Terry Wilson <email address hidden>
Date: Fri Nov 12 10:58:22 2021 -0600

    Add cooperative_yield() to OvsdbIdl

    On python-ovs 2.16.0+, overriding cooperative_yield() will allow
    potentially long-running CPU-intensive methods to cooperatively
    yield to greenthreads. This patch + ovs 2.16.0+ will resolve
    the related bug. I'll see if I can get cooperative_yield
    backported in OVS as well.

    Related-Bug: #1929446
    Change-Id: Ibd3c7427cbcab81253e0ed700174be09908bdef7
    (cherry picked from commit a2d3ef2a6491eb63b5ee961fc930070207a79d84)

tags: added: in-stable-wallaby
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers