Most instance actions can be called concurrently

Bug #1821373 reported by Matthew Booth
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matthew Booth
Queens
Fix Committed
Low
Matthew Booth
Rocky
Fix Released
Low
Matthew Booth
Stein
Fix Released
Low
Matthew Booth

Bug Description

A customer reported that they were getting DB corruption if they called shelve twice in quick succession on the same instance. This should be prevented by the guard in nova.API.shelve, which does:

  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])

This is intended to act as a robust gate against 2 instance actions happening concurrently. The first will set the task state to SHELVING, the second will fail because the task state is not SHELVING. The comparison is done atomically in db.instance_update_and_get_original(), and should be race free.

However, instance.save() shortcuts if there is no update and does not call db.instance_update_and_get_original(). Therefore this guard fails if we call the same operation twice:

  instance = get_instance()
    => Returned instance.task_state is None
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
    => task_state was None, now SHELVING, updates = {'task_state': SHELVING}
    => db.instance_update_and_get_original() executes and succeeds

  instance = get_instance()
    => Returned instance.task_state is SHELVING
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
    => task_state was SHELVING, still SHELVING, updates = {}
    => db.instance_update_and_get_original() does not execute, therefore doesn't raise the expected exception

This pattern is common to almost all instance actions in nova api. A quick scan suggests that all of the following actions are affected by this bug, and can therefore all potentially be executed multiple times concurrently for the same instance:

restore
force_stop
start
backup
snapshot
soft reboot
hard reboot
rebuild
revert_resize
resize
shelve
shelve_offload
unshelve
pause
unpause
suspend
resume
rescue
unrescue
set_admin_password
live_migrate
evacuate

tags: added: api
tags: added: race-condition
Revision history for this message
Matthew Booth (mbooth-9) wrote :

I intend to fix this by changing instance.save() such that it will:

1. return the expected InstanceUpdateConflict directly without hitting the db if instance.task_state doesn't match expected_task_state.

2. assert that task state is being changed if expected_task_state is set (this would be a programming error).

3. ditto for vm_state.

I believe unit test coverage in test_instance will be sufficient for this change.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

When writing tests for the above I discovered that this is not the issue. Specifically, OVO doesn't check the previous value when setting an attribute, so when we do:

  instance.task_state = SHELVING

this always makes the object dirty, regardless of the previous value of task_state. Therefore in the above case we always hit the DB, and this isn't the problem.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

The root cause of this issue appears to be in sqlalchemy.api._instance_update. This method modifies the contents of the values dict when called, including popping the value of expected_task_state. If this results in db conflict due a concurrent update and it is retried, the second invocation is called without expected_task_state in values, and therefore it isn't checked.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/658845

Changed in nova:
assignee: nobody → Matthew Booth (mbooth-9)
status: New → In Progress
Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/659317

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/659318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/659320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/658845
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aae5c7aa3819ad9161fd2effed3872d540099230
Submitter: Zuul
Branch: master

commit aae5c7aa3819ad9161fd2effed3872d540099230
Author: Matthew Booth <email address hidden>
Date: Mon May 13 16:04:39 2019 +0100

    Fix retry of instance_update_and_get_original

    _instance_update modifies its 'values' argument. Consequently if it is
    retried due to an update conflict, the second invocation has the wrong
    arguments.

    A specific issue this causes is that if we called it with
    expected_task_state a concurrent modification to task_state will cause
    us to fail and retry. However, expected_task_state will have been popped
    from values on the first invocation and will not be present for the
    second. Consequently the second invocation will fail to perform the
    task_state check and therefore succeed, resulting in a race.

    We rewrite the old race unit test which wasn't testing the correct
    thing for 2 reasons:

    1. Due to the bug fixed in this patch, although we were calling
       update_on_match() twice, the second call didn't check the task state.
    2. side_effect=iterable returns function items without executing them,
       but we weren't hitting this due to the bug fixed in this patch.

    Closes-Bug: #1821373
    Change-Id: I01c63e685113bf30e687ccb14a4d18e344b306f6

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/659317
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=61fef49b1555c6509398fc47c21319c1b5e50505
Submitter: Zuul
Branch: stable/stein

commit 61fef49b1555c6509398fc47c21319c1b5e50505
Author: Matthew Booth <email address hidden>
Date: Mon May 13 16:04:39 2019 +0100

    Fix retry of instance_update_and_get_original

    _instance_update modifies its 'values' argument. Consequently if it is
    retried due to an update conflict, the second invocation has the wrong
    arguments.

    A specific issue this causes is that if we called it with
    expected_task_state a concurrent modification to task_state will cause
    us to fail and retry. However, expected_task_state will have been popped
    from values on the first invocation and will not be present for the
    second. Consequently the second invocation will fail to perform the
    task_state check and therefore succeed, resulting in a race.

    We rewrite the old race unit test which wasn't testing the correct
    thing for 2 reasons:

    1. Due to the bug fixed in this patch, although we were calling
       update_on_match() twice, the second call didn't check the task state.
    2. side_effect=iterable returns function items without executing them,
       but we weren't hitting this due to the bug fixed in this patch.

    Closes-Bug: #1821373
    Change-Id: I01c63e685113bf30e687ccb14a4d18e344b306f6
    (cherry picked from commit aae5c7aa3819ad9161fd2effed3872d540099230)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/659318
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4540cd6ef9731837cf4ed1e56850bcfa3512ae1c
Submitter: Zuul
Branch: stable/rocky

commit 4540cd6ef9731837cf4ed1e56850bcfa3512ae1c
Author: Matthew Booth <email address hidden>
Date: Mon May 13 16:04:39 2019 +0100

    Fix retry of instance_update_and_get_original

    _instance_update modifies its 'values' argument. Consequently if it is
    retried due to an update conflict, the second invocation has the wrong
    arguments.

    A specific issue this causes is that if we called it with
    expected_task_state a concurrent modification to task_state will cause
    us to fail and retry. However, expected_task_state will have been popped
    from values on the first invocation and will not be present for the
    second. Consequently the second invocation will fail to perform the
    task_state check and therefore succeed, resulting in a race.

    We rewrite the old race unit test which wasn't testing the correct
    thing for 2 reasons:

    1. Due to the bug fixed in this patch, although we were calling
       update_on_match() twice, the second call didn't check the task state.
    2. side_effect=iterable returns function items without executing them,
       but we weren't hitting this due to the bug fixed in this patch.

    Closes-Bug: #1821373
    Change-Id: I01c63e685113bf30e687ccb14a4d18e344b306f6
    (cherry picked from commit aae5c7aa3819ad9161fd2effed3872d540099230)
    (cherry picked from commit 61fef49b1555c6509398fc47c21319c1b5e50505)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.1

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/659320
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f9c2503609bfad0b871d9e87227aa2bb5c137467
Submitter: Zuul
Branch: stable/queens

commit f9c2503609bfad0b871d9e87227aa2bb5c137467
Author: Matthew Booth <email address hidden>
Date: Mon May 13 16:04:39 2019 +0100

    Fix retry of instance_update_and_get_original

    _instance_update modifies its 'values' argument. Consequently if it is
    retried due to an update conflict, the second invocation has the wrong
    arguments.

    A specific issue this causes is that if we called it with
    expected_task_state a concurrent modification to task_state will cause
    us to fail and retry. However, expected_task_state will have been popped
    from values on the first invocation and will not be present for the
    second. Consequently the second invocation will fail to perform the
    task_state check and therefore succeed, resulting in a race.

    We rewrite the old race unit test which wasn't testing the correct
    thing for 2 reasons:

    1. Due to the bug fixed in this patch, although we were calling
       update_on_match() twice, the second call didn't check the task state.
    2. side_effect=iterable returns function items without executing them,
       but we weren't hitting this due to the bug fixed in this patch.

    Closes-Bug: #1821373
    Change-Id: I01c63e685113bf30e687ccb14a4d18e344b306f6
    (cherry picked from commit aae5c7aa3819ad9161fd2effed3872d540099230)
    (cherry picked from commit 61fef49b1555c6509398fc47c21319c1b5e50505)
    (cherry picked from commit 4540cd6ef9731837cf4ed1e56850bcfa3512ae1c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.11

This issue was fixed in the openstack/nova 17.0.11 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.