Evacuation takes too long when destination host has a large number of NICs

Bug #1837075 reported by Artom Lifshitz on 2019-07-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Artom Lifshitz
Queens
Low
Artom Lifshitz
Rocky
Low
Artom Lifshitz
Stein
Low
Artom Lifshitz

Bug Description

Description
===========

Evacuation takes a long time if the destination host a large number of network interfaces.

Steps to reproduce
==================

1. Have a host down, or force it down.

2. Evacuate instances to a host with a large number of network interfaces.

Expected result
===============

Evacuation completes in a reasonable time frame.

Actual result
=============

Evacuation takes too long.

Additional info
===============

This was initially reported against OSP10/Newton [1]. In that case, based on the included sosreports, the compute host has 1324 network interfaces, and 109 instances are being evacuated. That means in total, there's 109 * 1324 = 144316 iterations over the loop in get_machine_ips().

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1709400

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: New → In Progress
Changed in nova:
assignee: Artom Lifshitz (notartom) → Eric Fried (efried)
Eric Fried (efried) on 2019-07-19
Changed in nova:
assignee: Eric Fried (efried) → Artom Lifshitz (notartom)

Reviewed: https://review.opendev.org/671471
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=30d8159d4ee51a26a03de1cb134ea64c6c07ffb2
Submitter: Zuul
Branch: master

commit 30d8159d4ee51a26a03de1cb134ea64c6c07ffb2
Author: Artom Lifshitz <email address hidden>
Date: Fri Jul 19 11:35:24 2019 -0400

    libvirt: move checking CONF.my_ip to init_host()

    Migrations use the libvirt driver's get_host_ip_addr() method to
    determine the dest_host field of the migration object.
    get_host_ip_addr() checks whether CONF.my_ip is actually assigned to
    one of the host's interfaces. It does so by calling
    get_machine_ips(), which iterates over all of the host's interfaces.
    If the host has many interfaces, this can take a long time, and
    introduces needless delays in processing the migration.
    get_machine_ips() is only used to print a warning, so this patch moves
    the get_machine_ips() call to a single method in init_host(). This
    way, a warning is still emitted at compute service startup, and
    migration progress is not needlessly slowed down.

    This patch also has a chicken and egg problem with the patch on top of
    it, which poisons use of netifaces.interfaces() in tests. While this
    patch fixes all the tests that break with that poison, it starts
    breaking different tests because of the move of get_machine_ips() into
    init_host(). Therefore, while not directly related to the bug, this
    patch also preventatively mocks or stubs out any use of
    get_machine_ips() that will get poisoned with the subsequent patch.

    Closes-bug: 1837075
    Change-Id: I58a4038b04d5a9c28927d914e71609e4deea3d9f

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem) on 2019-07-22
tags: added: evac
tags: added: evacuate performance
removed: evac
Changed in nova:
importance: Undecided → Low

Reviewed: https://review.opendev.org/672154
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=560317c766afc4a4c4c5017ecfc9ce432fe63ea7
Submitter: Zuul
Branch: stable/stein

commit 560317c766afc4a4c4c5017ecfc9ce432fe63ea7
Author: Artom Lifshitz <email address hidden>
Date: Fri Jul 19 11:35:24 2019 -0400

    libvirt: move checking CONF.my_ip to init_host()

    Migrations use the libvirt driver's get_host_ip_addr() method to
    determine the dest_host field of the migration object.
    get_host_ip_addr() checks whether CONF.my_ip is actually assigned to
    one of the host's interfaces. It does so by calling
    get_machine_ips(), which iterates over all of the host's interfaces.
    If the host has many interfaces, this can take a long time, and
    introduces needless delays in processing the migration.
    get_machine_ips() is only used to print a warning, so this patch moves
    the get_machine_ips() call to a single method in init_host(). This
    way, a warning is still emitted at compute service startup, and
    migration progress is not needlessly slowed down.

    NOTE(artom) While the following paragraph still applies, the poison
    patch will not be backported. Stubbing out use of
    netifaces.interfaces() is still a good thing to do, however.

    This patch also has a chicken and egg problem with the patch on top of
    it, which poisons use of netifaces.interfaces() in tests. While this
    patch fixes all the tests that break with that poison, it starts
    breaking different tests because of the move of get_machine_ips() into
    init_host(). Therefore, while not directly related to the bug, this
    patch also preventatively mocks or stubs out any use of
    get_machine_ips() that will get poisoned with the subsequent patch.

    Closes-bug: 1837075
    Change-Id: I58a4038b04d5a9c28927d914e71609e4deea3d9f
    (cherry picked from commit 30d8159d4ee51a26a03de1cb134ea64c6c07ffb2)

Reviewed: https://review.opendev.org/672155
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=65d2e455e323a627ef228ed57a1f0c86d8252665
Submitter: Zuul
Branch: stable/rocky

commit 65d2e455e323a627ef228ed57a1f0c86d8252665
Author: Artom Lifshitz <email address hidden>
Date: Fri Jul 19 11:35:24 2019 -0400

    libvirt: move checking CONF.my_ip to init_host()

    Migrations use the libvirt driver's get_host_ip_addr() method to
    determine the dest_host field of the migration object.
    get_host_ip_addr() checks whether CONF.my_ip is actually assigned to
    one of the host's interfaces. It does so by calling
    get_machine_ips(), which iterates over all of the host's interfaces.
    If the host has many interfaces, this can take a long time, and
    introduces needless delays in processing the migration.
    get_machine_ips() is only used to print a warning, so this patch moves
    the get_machine_ips() call to a single method in init_host(). This
    way, a warning is still emitted at compute service startup, and
    migration progress is not needlessly slowed down.

    NOTE(artom) While the following paragraph still applies, the poison
    patch will not be backported. Stubbing out use of
    netifaces.interfaces() is still a good thing to do, however.

    This patch also has a chicken and egg problem with the patch on top of
    it, which poisons use of netifaces.interfaces() in tests. While this
    patch fixes all the tests that break with that poison, it starts
    breaking different tests because of the move of get_machine_ips() into
    init_host(). Therefore, while not directly related to the bug, this
    patch also preventatively mocks or stubs out any use of
    get_machine_ips() that will get poisoned with the subsequent patch.

    Closes-bug: 1837075
    Change-Id: I58a4038b04d5a9c28927d914e71609e4deea3d9f
    (cherry picked from commit 30d8159d4ee51a26a03de1cb134ea64c6c07ffb2)
    (cherry picked from commit 560317c766afc4a4c4c5017ecfc9ce432fe63ea7)

This issue was fixed in the openstack/nova 19.0.2 release.

Reviewed: https://review.opendev.org/672161
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=028a8e949e9b9b5731e7f5f283375533ab6f85fa
Submitter: Zuul
Branch: stable/queens

commit 028a8e949e9b9b5731e7f5f283375533ab6f85fa
Author: Artom Lifshitz <email address hidden>
Date: Fri Jul 19 11:35:24 2019 -0400

    libvirt: move checking CONF.my_ip to init_host()

    Migrations use the libvirt driver's get_host_ip_addr() method to
    determine the dest_host field of the migration object.
    get_host_ip_addr() checks whether CONF.my_ip is actually assigned to
    one of the host's interfaces. It does so by calling
    get_machine_ips(), which iterates over all of the host's interfaces.
    If the host has many interfaces, this can take a long time, and
    introduces needless delays in processing the migration.
    get_machine_ips() is only used to print a warning, so this patch moves
    the get_machine_ips() call to a single method in init_host(). This
    way, a warning is still emitted at compute service startup, and
    migration progress is not needlessly slowed down.

    NOTE(artom) While the following paragraph still applies, the poison
    patch will not be backported. Stubbing out use of
    netifaces.interfaces() is still a good thing to do, however.

    This patch also has a chicken and egg problem with the patch on top of
    it, which poisons use of netifaces.interfaces() in tests. While this
    patch fixes all the tests that break with that poison, it starts
    breaking different tests because of the move of get_machine_ips() into
    init_host(). Therefore, while not directly related to the bug, this
    patch also preventatively mocks or stubs out any use of
    get_machine_ips() that will get poisoned with the subsequent patch.

    (cherry picked from commit 30d8159d4ee51a26a03de1cb134ea64c6c07ffb2)

    (cherry picked from commit 560317c766afc4a4c4c5017ecfc9ce432fe63ea7)

    (cherry picked from commit 65d2e455e323a627ef228ed57a1f0c86d8252665)
    Conflicts:
      nova/tests/unit/virt/libvirt/fakelibvirt.py
        Due to 23fd6c2287f1e68336e7752246999de739b9f7c0 which mocked out
        get_fs_info() at the same place as this patch mocks out
        get_machine_ips().
    nova/virt/libvirt/driver.py
        Due to cbc28f0d15287dcf24a07f835210affa41c38993 which added
        _check_file_backed_memory_support() to init_host() at the same
        place this patch added _check_my_ip().

    Closes-bug: 1837075
    Change-Id: I58a4038b04d5a9c28927d914e71609e4deea3d9f

This issue was fixed in the openstack/nova 17.0.12 release.

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers