[ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key

Bug #2043036 reported by Jay Faulkner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned

Bug Description

The methods on the Ironic driver, list_instances and list_instance_uuids are not currently respecting the conductor_group option: https://opendev.org/openstack/nova/src/branch/master/nova/conf/ironic.py#L71.

This leads to significant performance degradation, as querying Ironic for all nodes (/v1/nodes) instead of all nodes managed by the compute (/v1/nodes?conductor_group=blah) is a significantly more expensive API call.

In addition, this can lead to unexpected behavior for operators, such as an action being taken by a compute serving conductor group "A" to resolve an issue that would normally be resolved by a compute service conductor group "B".

While troubleshooting this error, we dug deeply into what this data is used for; it's used for two things:
- Reconciling deleted instances as a periodic job
- Ensuring no instances exist on a newly-started compute host

These are tasks which either could use stale data or would not be impacted by using the Ironic driver's existing node cache. Therefore, a suggested fix is:

Revise list_instances and list_instance_uuids to reuse the node cache to reduce the overall API calls being made to Ironic, and ensure all /v1/nodes calls use the same codepath in the Ironic driver. It's the belief of JayF, TheJulia, and Johnthetubaguy (on a video call right now) that using stale data, without refreshing the cache, should be safe for these use cases. (Even if we decide to refresh the cache, we should use this code path anyway.)

Changed in ironic:
status: New → Confirmed
importance: Undecided → Medium
Changed in ironic:
assignee: nobody → Jay Faulkner (jason-oldos)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/900831

Changed in nova:
status: New → In Progress
Dmitry Tantsur (divius)
Changed in ironic:
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/900831
Committed: https://opendev.org/openstack/nova/commit/fa3cf7d50cba921ea67eb161e6a199067ea62deb
Submitter: "Zuul (22348)"
Branch: master

commit fa3cf7d50cba921ea67eb161e6a199067ea62deb
Author: Jay Faulkner <email address hidden>
Date: Mon Nov 13 15:21:31 2023 -0800

    [ironic] Partition & use cache for list_instance*

    list_instances and list_instance_uuids, as written in the Ironic driver,
    do not currently respect conductor_group paritioning. Given a nova
    compute is intended to limit it's scope of work to the conductor group
    it is configured to work with; this is a bug.

    Additionally, this should be a significant performance boost for a
    couple of reasons; firstly, instead of calling the Ironic API and
    getting all nodes, instead of the subset (when using conductor group),
    we're now properly getting the subset of nodes -- this is the optimized
    path in the Ironic DB and API code. Secondly, we're now using the
    driver's node cache to respond to these requests. Since list_instances
    and list_instance_uuids is used by periodic tasks, these operating with
    data that may be slightly stale should have minimal impact compared to
    the performance benefits.

    Closes-bug: #2043036
    Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/906155

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 29.0.0.0rc1

This issue was fixed in the openstack/nova 29.0.0.0rc1 release candidate.

no longer affects: ironic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/906155
Committed: https://opendev.org/openstack/nova/commit/555d7d0ad02d31476e2d751aa52be4087878f1a2
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 555d7d0ad02d31476e2d751aa52be4087878f1a2
Author: Jay Faulkner <email address hidden>
Date: Mon Nov 13 15:21:31 2023 -0800

    [ironic] Partition & use cache for list_instance*

    list_instances and list_instance_uuids, as written in the Ironic driver,
    do not currently respect conductor_group paritioning. Given a nova
    compute is intended to limit it's scope of work to the conductor group
    it is configured to work with; this is a bug.

    Additionally, this should be a significant performance boost for a
    couple of reasons; firstly, instead of calling the Ironic API and
    getting all nodes, instead of the subset (when using conductor group),
    we're now properly getting the subset of nodes -- this is the optimized
    path in the Ironic DB and API code. Secondly, we're now using the
    driver's node cache to respond to these requests. Since list_instances
    and list_instance_uuids is used by periodic tasks, these operating with
    data that may be slightly stale should have minimal impact compared to
    the performance benefits.

    Conflicts:
      nova/tests/unit/virt/ironic/test_driver.py
      nova/virt/ironic/driver.py

    NOTE(elod.illes): conflict is caused by openstacksdk related feature
    patch (I9fb5a385487f601329bccc9efde325d86be397b6) which was merged in
    development branch (2024.1 Caracal), hence it does not exist in 2023.2.

    Closes-bug: #2043036
    Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5
    (cherry picked from commit fa3cf7d50cba921ea67eb161e6a199067ea62deb)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/922155

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (unmaintained/zed)

Fix proposed to branch: unmaintained/zed
Review: https://review.opendev.org/c/openstack/nova/+/922224

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/922155
Committed: https://opendev.org/openstack/nova/commit/3226318b534847a5c553f2d94f9df19a631fd617
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 3226318b534847a5c553f2d94f9df19a631fd617
Author: Jay Faulkner <email address hidden>
Date: Mon Nov 13 15:21:31 2023 -0800

    [ironic] Partition & use cache for list_instance*

    list_instances and list_instance_uuids, as written in the Ironic driver,
    do not currently respect conductor_group paritioning. Given a nova
    compute is intended to limit it's scope of work to the conductor group
    it is configured to work with; this is a bug.

    Additionally, this should be a significant performance boost for a
    couple of reasons; firstly, instead of calling the Ironic API and
    getting all nodes, instead of the subset (when using conductor group),
    we're now properly getting the subset of nodes -- this is the optimized
    path in the Ironic DB and API code. Secondly, we're now using the
    driver's node cache to respond to these requests. Since list_instances
    and list_instance_uuids is used by periodic tasks, these operating with
    data that may be slightly stale should have minimal impact compared to
    the performance benefits.

    Closes-bug: #2043036
    Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5
    (cherry picked from commit fa3cf7d50cba921ea67eb161e6a199067ea62deb)
    (cherry picked from commit 555d7d0ad02d31476e2d751aa52be4087878f1a2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 27.4.0

This issue was fixed in the openstack/nova 27.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.2.0

This issue was fixed in the openstack/nova 28.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (unmaintained/zed)

Reviewed: https://review.opendev.org/c/openstack/nova/+/922224
Committed: https://opendev.org/openstack/nova/commit/bae6152d4269bd0b3f5fd8ea4f0abe483a6d3061
Submitter: "Zuul (22348)"
Branch: unmaintained/zed

commit bae6152d4269bd0b3f5fd8ea4f0abe483a6d3061
Author: Jay Faulkner <email address hidden>
Date: Mon Nov 13 15:21:31 2023 -0800

    [ironic] Partition & use cache for list_instance*

    list_instances and list_instance_uuids, as written in the Ironic driver,
    do not currently respect conductor_group paritioning. Given a nova
    compute is intended to limit it's scope of work to the conductor group
    it is configured to work with; this is a bug.

    Additionally, this should be a significant performance boost for a
    couple of reasons; firstly, instead of calling the Ironic API and
    getting all nodes, instead of the subset (when using conductor group),
    we're now properly getting the subset of nodes -- this is the optimized
    path in the Ironic DB and API code. Secondly, we're now using the
    driver's node cache to respond to these requests. Since list_instances
    and list_instance_uuids is used by periodic tasks, these operating with
    data that may be slightly stale should have minimal impact compared to
    the performance benefits.

    Closes-bug: #2043036
    Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5
    (cherry picked from commit fa3cf7d50cba921ea67eb161e6a199067ea62deb)
    (cherry picked from commit 555d7d0ad02d31476e2d751aa52be4087878f1a2)
    (cherry picked from commit 3226318b534847a5c553f2d94f9df19a631fd617)

tags: added: in-unmaintained-zed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.