Cells: Race between instance 'unlock' and 'stop' can cause 'stop' to fail

Bug #1466696 reported by melanie witt
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Low
Unassigned

Bug Description

Observed in the tempest-dsvm-cells job during tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_lock_unlock_server

The test locks an instance, attempts to stop it, makes sure that fails, unlocks it, attempts to stop it, and makes sure that succeeds.

The problem happens during the succession of actions "unlock" and "stop". The "unlock" does an instance.save() of the locked state at the top cell which will sync to the child. If the "stop" request reaches the child cell before the instance.save() state locked = False syncs to the child cell, the "stop" will fail with the following trace in screen-n-cell-child.txt:

2015-06-18 19:09:23.852 ERROR nova.cells.messaging [req-6b3584bb-b52a-41ba-8e88-000e14ba6ec6 ServerActionsTestJSON-1466672685 ServerActionsTestJSON-639331338] Error processing message locally
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging Traceback (most recent call last):
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging File "/opt/stack/new/nova/nova/cells/messaging.py", line 200, in _process_locally
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging resp_value = self.msg_runner._process_message_locally(self)
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging File "/opt/stack/new/nova/nova/cells/messaging.py", line 1256, in _process_message_locally
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging return fn(message, **message.method_kwargs)
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging File "/opt/stack/new/nova/nova/cells/messaging.py", line 850, in stop_instance
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging clean_shutdown=clean_shutdown)
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging File "/opt/stack/new/nova/nova/cells/messaging.py", line 839, in _call_compute_api_with_obj
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging return fn(ctxt, instance, *args, **kwargs)
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging File "/opt/stack/new/nova/nova/compute/api.py", line 214, in inner
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging raise exception.InstanceIsLocked(instance_uuid=instance.uuid)
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging InstanceIsLocked: Instance cb2485ba-e3e5-4668-869b-f145e5f28a1a is locked
2015-06-18 19:09:23.852 16161 ERROR nova.cells.messaging

Logstash query: message:"InstanceIsLocked" AND tags:"screen-n-cell-child.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSW5zdGFuY2VJc0xvY2tlZFwiIEFORCB0YWdzOlwic2NyZWVuLW4tY2VsbC1jaGlsZC50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQzNDY3NDMyOTQ0Mn0=

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :
Revision history for this message
melanie witt (melwitt) wrote :

I don't think so. The race makes it so stop reaches the child cells first and fails, then the instance.save() of locked=False hits the child cell and unlocks it, so the teardown ends up succeeding.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/243327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/243327
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02de17c72ac0f0824df2b3d460739a58cb6b5285
Submitter: Jenkins
Branch: master

commit 02de17c72ac0f0824df2b3d460739a58cb6b5285
Author: Sean Dague <email address hidden>
Date: Mon Nov 9 16:54:19 2015 -0500

    skip lock_unlock_server test for cells

    The cells architecture of doing db replay makes this call
    fundamentally racey for cells. The contract is that if you request an
    action on a server and it is locked, you get a 409 at the API, and are
    done. If you get a 2xx, the resource is not locked, and life is good.

    Except in cells, the API db will be updated, and the API will return
    all is well. Except the db replay hasn't happened for the cell, so the
    additional check in the nova compute for locked blows up.

    The fix probably involves the nova-compute check doing a timed retry
    under cells because of replication lags.

    Change-Id: Ic18407b8372c286d8e258860f3e2bf76eceef2aa
    Related-Bug: #1466696

Tony Breeds (o-tony)
tags: added: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/316150

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → Won't Fix
tags: removed: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/316150
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d836c7172d5662207ea89a4489d41fc9b28e34f5
Submitter: Jenkins
Branch: stable/liberty

commit d836c7172d5662207ea89a4489d41fc9b28e34f5
Author: Sean Dague <email address hidden>
Date: Mon Nov 9 16:54:19 2015 -0500

    skip lock_unlock_server test for cells

    The cells architecture of doing db replay makes this call
    fundamentally racey for cells. The contract is that if you request an
    action on a server and it is locked, you get a 409 at the API, and are
    done. If you get a 2xx, the resource is not locked, and life is good.

    Except in cells, the API db will be updated, and the API will return
    all is well. Except the db replay hasn't happened for the cell, so the
    additional check in the nova compute for locked blows up.

    The fix probably involves the nova-compute check doing a timed retry
    under cells because of replication lags.

    Conflicts:
            devstack/tempest-dsvm-cells-rc

    NOTE(mriedem): The conflict is due to an additional test being
    skipped in liberty that wasn't skipped in mitaka.

    Change-Id: Ic18407b8372c286d8e258860f3e2bf76eceef2aa
    Related-Bug: #1466696
    (cherry picked from commit 02de17c72ac0f0824df2b3d460739a58cb6b5285)

tags: added: in-stable-liberty
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.