ResourceTracker.stats can leak across multiple ironic nodes

Bug #1784705 reported by Matt Riedemann on 2018-07-31
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
Ocata
High
Matt Riedemann
Pike
High
Matt Riedemann
Queens
High
Matt Riedemann

Bug Description

A single nova-compute service host can manage multiple ironic nodes, which creates multiple ComputeNode records per compute service host, and ironic instances are 1:1 with each compute node.

Before change https://review.openstack.org/#/c/398473/ in Ocata, the ComputeManager would manage multiple ResourceTracker instances, one per compute node (so one per ironic instance managed by that host). As a result of that change, the ComputeManager manages a single ResourceTracker instance, and the ResourceTracker's compute_node entry was changed to a dict, so the RT could manage multiple compute nodes (one per ironic instance).

The problem is the ResourceTracker.stats variable was left to be "shared" across all compute nodes being managed by the single RT, which can cause problems in the ResourceTracker._update_usage_from_instance() method which updates the stats and then assigns it to a compute node record, so it could leak/accumulate information about the stats for a different node.

The compute node stats are used by the ComputeCapabilitiesFilter in the scheduler so it could be possible for a compute node B to be reporting node capabilities which only apply to node A.

This was discovered during code review of this change:

https://review.openstack.org/#/c/576099/2/nova/compute/resource_tracker.py@1130

Matt Riedemann (mriedem) on 2018-07-31
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)

Fix proposed to branch: master
Review: https://review.openstack.org/587636

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/587614
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fc05626d43571733da0803df0fd9a7c69766b8fd
Submitter: Zuul
Branch: master

commit fc05626d43571733da0803df0fd9a7c69766b8fd
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 16:00:28 2018 -0400

    Add recreate test for RT.stats bug 1784705

    With change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, we changed the ComputeManager to manage a single
    ResourceTracker and that single ResourceTracker will
    manage multiple compute nodes. The only time a single
    nova-compute service hosts multiple compute nodes is for
    ironic where there is a compute node per instance. The
    problem is the ResourceTracker.stats variable, unlike the
    ResourceTracker.compute_nodes variable, is not node-specific
    so it's possible for node stats to leak across nodes based
    on how the stats are used (and copied).

    This change adds a functional recreate test to show the issue
    before it's fixed. The fixture setup had to be tweaked a
    bit to avoid modifying class variables by reference between
    test cases.

    Change-Id: Icc5f615baa1042347ec1699eb84ba0670445b995
    Related-Bug: #1784705

Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Matt Riedemann (mriedem) on 2018-08-01
Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/587636
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b5b7d86bb04f92d21cf954cd6b3463c9fcc637e6
Submitter: Zuul
Branch: master

commit b5b7d86bb04f92d21cf954cd6b3463c9fcc637e6
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 17:26:47 2018 -0400

    Make ResourceTracker.stats node-specific

    As of change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, the ResourceTracker manages multiple compute nodes
    via its "compute_nodes" variable, but the "stats" variable
    was still being shared across all nodes, which leads to
    leaking stats across nodes in an ironic deployment where
    a single nova-compute service host is managing multiple
    ironic instances (nodes).

    This change makes ResourceTracker.stats node-specific
    which fixes the ironic leak but also allows us to remove
    the stats deepcopy while iterating over instances which
    should improve performance for single-node deployments with
    potentially a large number of instances, i.e. vCenter.

    Change-Id: I0b9e5b711878fa47ba90e43c0b41437b57cf8ef6
    Closes-Bug: #1784705
    Closes-Bug: #1777422

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/587921
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc8167a1914a70ea7b336856285df2571a754ca0
Submitter: Zuul
Branch: stable/queens

commit cc8167a1914a70ea7b336856285df2571a754ca0
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 16:00:28 2018 -0400

    Add recreate test for RT.stats bug 1784705

    With change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, we changed the ComputeManager to manage a single
    ResourceTracker and that single ResourceTracker will
    manage multiple compute nodes. The only time a single
    nova-compute service hosts multiple compute nodes is for
    ironic where there is a compute node per instance. The
    problem is the ResourceTracker.stats variable, unlike the
    ResourceTracker.compute_nodes variable, is not node-specific
    so it's possible for node stats to leak across nodes based
    on how the stats are used (and copied).

    This change adds a functional recreate test to show the issue
    before it's fixed. The fixture setup had to be tweaked a
    bit to avoid modifying class variables by reference between
    test cases.

    Conflicts:
          nova/tests/functional/compute/test_resource_tracker.py

    NOTE(mriedem): Conflicts are due to not having change
    I433700e833f97c0fec946dafc2cdda9d49e1100b or change
    I075785abcd4f4a8e180959daeadf215b9cd175c8 in Queens.

    Change-Id: Icc5f615baa1042347ec1699eb84ba0670445b995
    Related-Bug: #1784705
    (cherry picked from commit fc05626d43571733da0803df0fd9a7c69766b8fd)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/587976
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d99f5753f992bd4b58b0aaa7f83f71fb044776f
Submitter: Zuul
Branch: stable/queens

commit 7d99f5753f992bd4b58b0aaa7f83f71fb044776f
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 17:26:47 2018 -0400

    Make ResourceTracker.stats node-specific

    As of change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, the ResourceTracker manages multiple compute nodes
    via its "compute_nodes" variable, but the "stats" variable
    was still being shared across all nodes, which leads to
    leaking stats across nodes in an ironic deployment where
    a single nova-compute service host is managing multiple
    ironic instances (nodes).

    This change makes ResourceTracker.stats node-specific
    which fixes the ironic leak but also allows us to remove
    the stats deepcopy while iterating over instances which
    should improve performance for single-node deployments with
    potentially a large number of instances, i.e. vCenter.

    Conflicts:
          nova/compute/resource_tracker.py

    NOTE(mriedem): The conflict is due to not having change
    Ic5188d4c5a7e8aa4c79c37e63124fe3a305ef568 in Queens.

    Change-Id: I0b9e5b711878fa47ba90e43c0b41437b57cf8ef6
    Closes-Bug: #1784705
    Closes-Bug: #1777422
    (cherry picked from commit b5b7d86bb04f92d21cf954cd6b3463c9fcc637e6)

This issue was fixed in the openstack/nova 18.0.0.0rc1 release candidate.

Reviewed: https://review.openstack.org/588036
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dbce613a9aede3fc5463a2d383b355ffb8cda73a
Submitter: Zuul
Branch: stable/pike

commit dbce613a9aede3fc5463a2d383b355ffb8cda73a
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 16:00:28 2018 -0400

    Add recreate test for RT.stats bug 1784705

    With change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, we changed the ComputeManager to manage a single
    ResourceTracker and that single ResourceTracker will
    manage multiple compute nodes. The only time a single
    nova-compute service hosts multiple compute nodes is for
    ironic where there is a compute node per instance. The
    problem is the ResourceTracker.stats variable, unlike the
    ResourceTracker.compute_nodes variable, is not node-specific
    so it's possible for node stats to leak across nodes based
    on how the stats are used (and copied).

    This change adds a functional recreate test to show the issue
    before it's fixed. The fixture setup had to be tweaked a
    bit to avoid modifying class variables by reference between
    test cases.

    Change-Id: Icc5f615baa1042347ec1699eb84ba0670445b995
    Related-Bug: #1784705
    (cherry picked from commit fc05626d43571733da0803df0fd9a7c69766b8fd)
    (cherry picked from commit cc8167a1914a70ea7b336856285df2571a754ca0)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/588037
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e279ac404673784f33610cad55288e2e60649f88
Submitter: Zuul
Branch: stable/pike

commit e279ac404673784f33610cad55288e2e60649f88
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 17:26:47 2018 -0400

    Make ResourceTracker.stats node-specific

    As of change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, the ResourceTracker manages multiple compute nodes
    via its "compute_nodes" variable, but the "stats" variable
    was still being shared across all nodes, which leads to
    leaking stats across nodes in an ironic deployment where
    a single nova-compute service host is managing multiple
    ironic instances (nodes).

    This change makes ResourceTracker.stats node-specific
    which fixes the ironic leak but also allows us to remove
    the stats deepcopy while iterating over instances which
    should improve performance for single-node deployments with
    potentially a large number of instances, i.e. vCenter.

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): The conflict was due to not having change
    Iae904afb6cb4fcea8bb27741d774ffbe986a5fb4 in Pike.

    NOTE(mriedem): The check for node is None was moved from
    _do_build_and_run_instance to build_and_run_instance because
    the functional tests are using the ChanceScheduler since we
    don't have change I12de2e195022593ea2a3e2894f2c3b5226930d4f
    and the ChanceScheduler does not return the nodename during
    select_destinations so the compute has to get it from the
    driver.

    Change-Id: I0b9e5b711878fa47ba90e43c0b41437b57cf8ef6
    Closes-Bug: #1784705
    Closes-Bug: #1777422
    (cherry picked from commit b5b7d86bb04f92d21cf954cd6b3463c9fcc637e6)
    (cherry picked from commit 7d99f5753f992bd4b58b0aaa7f83f71fb044776f)

Reviewed: https://review.openstack.org/588076
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d4179e3e96b3f3820b1055624ca81c441b94332e
Submitter: Zuul
Branch: stable/ocata

commit d4179e3e96b3f3820b1055624ca81c441b94332e
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 16:00:28 2018 -0400

    Add recreate test for RT.stats bug 1784705

    With change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, we changed the ComputeManager to manage a single
    ResourceTracker and that single ResourceTracker will
    manage multiple compute nodes. The only time a single
    nova-compute service hosts multiple compute nodes is for
    ironic where there is a compute node per instance. The
    problem is the ResourceTracker.stats variable, unlike the
    ResourceTracker.compute_nodes variable, is not node-specific
    so it's possible for node stats to leak across nodes based
    on how the stats are used (and copied).

    This change adds a functional recreate test to show the issue
    before it's fixed. The fixture setup had to be tweaked a
    bit to avoid modifying class variables by reference between
    test cases.

    Conflicts:
          nova/tests/functional/compute/test_resource_tracker.py

    NOTE(mriedem): The conflict is due to this test module not
    existing in Ocata. It was introduced in Pike with change
    I59be1cbedc99dcbb0ccde089a9f4737305176324 and changes were
    made to it over time. In this backport, the basics needed
    for the one test case specific to this patch are added and
    everything else, like the placement stuff, is removed.

    Change-Id: Icc5f615baa1042347ec1699eb84ba0670445b995
    Related-Bug: #1784705
    (cherry picked from commit fc05626d43571733da0803df0fd9a7c69766b8fd)
    (cherry picked from commit cc8167a1914a70ea7b336856285df2571a754ca0)
    (cherry picked from commit dbce613a9aede3fc5463a2d383b355ffb8cda73a)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/588077
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f6ea7b871f5273f16921125647bdcd2f4e93c4e
Submitter: Zuul
Branch: stable/ocata

commit 2f6ea7b871f5273f16921125647bdcd2f4e93c4e
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 31 17:26:47 2018 -0400

    Make ResourceTracker.stats node-specific

    As of change I6827137f35c0cb4f9fc4c6f753d9a035326ed01b in
    Ocata, the ResourceTracker manages multiple compute nodes
    via its "compute_nodes" variable, but the "stats" variable
    was still being shared across all nodes, which leads to
    leaking stats across nodes in an ironic deployment where
    a single nova-compute service host is managing multiple
    ironic instances (nodes).

    This change makes ResourceTracker.stats node-specific
    which fixes the ironic leak but also allows us to remove
    the stats deepcopy while iterating over instances which
    should improve performance for single-node deployments with
    potentially a large number of instances, i.e. vCenter.

    Conflicts:
          nova/compute/manager.py
          nova/compute/resource_tracker.py

    NOTE(mriedem): The conflicts are due to not having change
    I02b7cd87d399d487dd1d650540f503a70bc27749 in Ocata which
    added the consecutive failed builds stats counting. As a
    result, the Ocata cherry pick does not need to move the
    "if node is None" check in _do_build_and_run_instance like
    in Pike.

    Change-Id: I0b9e5b711878fa47ba90e43c0b41437b57cf8ef6
    Closes-Bug: #1784705
    Closes-Bug: #1777422
    (cherry picked from commit b5b7d86bb04f92d21cf954cd6b3463c9fcc637e6)
    (cherry picked from commit 7d99f5753f992bd4b58b0aaa7f83f71fb044776f)
    (cherry picked from commit e279ac404673784f33610cad55288e2e60649f88)

This issue was fixed in the openstack/nova 16.1.5 release.

This issue was fixed in the openstack/nova 17.0.6 release.

This issue was fixed in the openstack/nova 15.1.4 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers