ResourceTracker._update should restore previous old_resources value if ComputeNode.save fails

Bug #1834712 reported by Matt Riedemann on 2019-06-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann

Bug Description

This is a follow up to bug 1834694 with the debug information here:

https://review.opendev.org/#/c/668252/1/nova/scheduler/host_manager.py@626

This is on an overloaded system where conductor and mysql are having problems and database connections are getting dropped.

On the first start of the compute service, the compute node record is created without the free_disk_gb field set.

Later in the _update() method in ResourceTracker the _resource_change method returns True and updates the self.old_resources value:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L908

Then the ComputeNode.save() fails with a DB error here:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1010

That kills the update_available_resource run but doesn't kill the service because:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/manager.py#L8130

Later when update_available_resource runs, _resource_change does not detect any changes here because old_resources was updated before:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L906

So we don't try to call ComputeNode.save() again but instead call _update_to_placement here:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/compute/resource_tracker.py#L1012

This can create the resource provider with inventory in the placement service.

As a result, the scheduler can get the compute node resource provider back from placement even though it's not updated which results in hitting this code in the scheduler:

https://github.com/openstack/nova/blob/324da0532f3b59aa16233a93a260d289e55860fb/nova/scheduler/host_manager.py#L193

That leaves some of the HostState fields unset which in turn results in issues like bug 1834691 and bug 1834694.

We could deal with the RT issues in a few ways, like not allowing the compute service to start if we can't create and update the compute node (rather than just catch and swallow Exception in the ComputeManager), but that might have other side effects. An easy thing to do here is make sure to rollback the changes to old_resources in the RT if compute_node.save() fails.

Matt Riedemann (mriedem) on 2019-06-28
tags: added: db scheduler

Fix proposed to branch: master
Review: https://review.opendev.org/668263

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Changed in nova:
assignee: Matt Riedemann (mriedem) → Chris Dent (cdent)
Chris Dent (cdent) on 2019-07-17
Changed in nova:
assignee: Chris Dent (cdent) → Matt Riedemann (mriedem)
Download full text (3.2 KiB)

Reviewed: https://review.opendev.org/668263
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=11cb42f396fdbc1d973e1a1b592c00896f646015
Submitter: Zuul
Branch: master

commit 11cb42f396fdbc1d973e1a1b592c00896f646015
Author: Matt Riedemann <email address hidden>
Date: Fri Jun 28 18:50:33 2019 -0400

    Restore RT.old_resources if ComputeNode.save() fails

    When starting nova-compute for the first time with a new node,
    the ResourceTracker will create a new ComputeNode record in
    _init_compute_node but without all of the fields set on the
    ComputeNode, for example "free_disk_gb".

    Later _update_usage_from_instances will set some fields on the
    ComputeNode record (even if there are no instances on the node,
    why - I don't know) like free_disk_gb.

    This will make the eventual call from _update() to _resource_change()
    update the value in the old_resouces dict and return True, and then
    _update() will try to update those ComputeNode changes to the database.
    If that update fails, for example due to a DBConnectionError, the
    value in old_resources will still be for the current version of the node
    in memory but not what is actually in the database.

    Note that this failure does not result in the compute service failing
    to start because ComputeManager._update_available_resource_for_node
    traps the Exception and just logs it.

    A subsequent trip through the RT._update() method - because of the
    update_available_resource periodic task - will call _resource_change
    but because old_resource matches the current state of the node, it
    returns False and the RT does not attempt to persist the changes to
    the DB. _update() will then go on to call _update_to_placement
    which will create the resource provider in placement along with its
    inventory, making it potentially a candidate for scheduling.

    This can be a problem later in the scheduler because the
    HostState._update_from_compute_node method may skip setting fields
    on the HostState object if free_disk_gb is not set in the
    ComputeNode record - which can then break filters and weighers
    later in the scheduling process (see bug 1834691 and bug 1834694).

    The fix proposed here is simple: if the ComputeNode.save() in
    RT._update() fails, restore the previous value in old_resources
    so that the subsequent run through _resource_change will compare the
    correct state of the object and retry the update.

    An alternative to this would be killing the compute service on startup
    if there is a DB error but that could have unintended side effects,
    especially if the DB error is transient and can be fixed on the next
    try.

    Obviously the scheduler code needs to be more robust also, but those
    improvements are left for separate changes related to the other bugs
    mentioned above.

    Also, ComputeNode.update_from_virt_driver could be updated to set
    free_disk_gb if possible to workaround the tight coupling in the
    HostState._update_from_compute_node code, but that's also sort of
    a whack-a-mole type change best made separatel...

Read more...

Changed in nova:
status: In Progress → Fix Released
Download full text (3.3 KiB)

Reviewed: https://review.opendev.org/672038
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=39ed79ea09fbab590092ef6b1bd4899270f19c9b
Submitter: Zuul
Branch: stable/stein

commit 39ed79ea09fbab590092ef6b1bd4899270f19c9b
Author: Matt Riedemann <email address hidden>
Date: Fri Jun 28 18:50:33 2019 -0400

    Restore RT.old_resources if ComputeNode.save() fails

    When starting nova-compute for the first time with a new node,
    the ResourceTracker will create a new ComputeNode record in
    _init_compute_node but without all of the fields set on the
    ComputeNode, for example "free_disk_gb".

    Later _update_usage_from_instances will set some fields on the
    ComputeNode record (even if there are no instances on the node,
    why - I don't know) like free_disk_gb.

    This will make the eventual call from _update() to _resource_change()
    update the value in the old_resouces dict and return True, and then
    _update() will try to update those ComputeNode changes to the database.
    If that update fails, for example due to a DBConnectionError, the
    value in old_resources will still be for the current version of the node
    in memory but not what is actually in the database.

    Note that this failure does not result in the compute service failing
    to start because ComputeManager._update_available_resource_for_node
    traps the Exception and just logs it.

    A subsequent trip through the RT._update() method - because of the
    update_available_resource periodic task - will call _resource_change
    but because old_resource matches the current state of the node, it
    returns False and the RT does not attempt to persist the changes to
    the DB. _update() will then go on to call _update_to_placement
    which will create the resource provider in placement along with its
    inventory, making it potentially a candidate for scheduling.

    This can be a problem later in the scheduler because the
    HostState._update_from_compute_node method may skip setting fields
    on the HostState object if free_disk_gb is not set in the
    ComputeNode record - which can then break filters and weighers
    later in the scheduling process (see bug 1834691 and bug 1834694).

    The fix proposed here is simple: if the ComputeNode.save() in
    RT._update() fails, restore the previous value in old_resources
    so that the subsequent run through _resource_change will compare the
    correct state of the object and retry the update.

    An alternative to this would be killing the compute service on startup
    if there is a DB error but that could have unintended side effects,
    especially if the DB error is transient and can be fixed on the next
    try.

    Obviously the scheduler code needs to be more robust also, but those
    improvements are left for separate changes related to the other bugs
    mentioned above.

    Also, ComputeNode.update_from_virt_driver could be updated to set
    free_disk_gb if possible to workaround the tight coupling in the
    HostState._update_from_compute_node code, but that's also sort of
    a whack-a-mole type change best made sep...

Read more...

This issue was fixed in the openstack/nova 19.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers