Race failure to delete security group

Bug #1460112 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grenade
Fix Released
Undecided
Matt Riedemann
python-openstackclient
Fix Released
Medium
Matt Riedemann

Bug Description

Grenade tries to delete a server and the security group used for that server in succession and there is a race where the security group is still associated with the instance when the DELETE request comes in to nova-api:

http://logs.openstack.org/97/186497/4/gate/gate-grenade-dsvm/6e5d0fe/logs/grenade.sh.txt.gz#_2015-05-28_20_14_57_391

The server delete request comes in here:

http://logs.openstack.org/97/186497/4/gate/gate-grenade-dsvm/6e5d0fe/logs/new/screen-n-api.txt.gz#_2015-05-28_20_14_56_024

2015-05-28 20:14:56.024 INFO nova.osapi_compute.wsgi.server [req-369befb4-c78d-4ef7-9c07-7034a9f15432 nova_grenade nova_grenade] 127.0.0.1 "DELETE /v2/0ae150f86d8441dba2264ccb22cea2cd/servers/715ba0c5-15d8-47f7-a630-4be5433a8ab4 HTTP/1.1" status: 204 len: 203 time: 0.2008679

And a second later the security group request comes in and fails:

http://logs.openstack.org/97/186497/4/gate/gate-grenade-dsvm/6e5d0fe/logs/new/screen-n-api.txt.gz#_2015-05-28_20_14_57_391

2015-05-28 20:14:57.351 DEBUG nova.api.openstack.wsgi [req-f52e4b8e-4db8-4e4f-a1a2-eb30bb65bd25 nova_grenade nova_grenade] Calling method '<bound method SecurityGroupController.delete of <nova.api.openstack.compute.contrib.security_groups.SecurityGroupController object at 0x7fd1c92b7dd0>>' _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:783
2015-05-28 20:14:57.390 INFO nova.api.openstack.wsgi [req-f52e4b8e-4db8-4e4f-a1a2-eb30bb65bd25 nova_grenade nova_grenade] HTTP exception thrown: Security group is still in use

--

Checking the n-net logs on the instance uuid, we see network deallocation happening here:

http://logs.openstack.org/97/186497/4/gate/gate-grenade-dsvm/6e5d0fe/logs/new/screen-n-net.txt.gz#_2015-05-28_20_14_57_376

2015-05-28 20:14:57.376 DEBUG nova.network.manager [req-369befb4-c78d-4ef7-9c07-7034a9f15432 nova_grenade nova_grenade] [instance: 715ba0c5-15d8-47f7-a630-4be5433a8ab4] Network deallocation for instance deallocate_for_instance /opt/stack/new/nova/nova/network/manager.py:559
2015-05-28 20:14:57.423 DEBUG nova.network.manager [req-369befb4-c78d-4ef7-9c07-7034a9f15432 nova_grenade nova_grenade] [instance: 715ba0c5-15d8-47f7-a630-4be5433a8ab4] Deallocate fixed ip 10.1.0.2 deallocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:975

And finishes up here:

http://logs.openstack.org/97/186497/4/gate/gate-grenade-dsvm/6e5d0fe/logs/new/screen-n-net.txt.gz#_2015-05-28_20_14_57_795

2015-05-28 20:14:57.795 INFO nova.network.manager [req-369befb4-c78d-4ef7-9c07-7034a9f15432 nova_grenade nova_grenade] [instance: 715ba0c5-15d8-47f7-a630-4be5433a8ab4] Network deallocated for instance (fixed ips: '['10.1.0.2']')

In the compute API, it's doing local delete by calling instance.destroy() which calls the instance_destroy() DB API method, and that goes through the related associations, like security groups, and deletes them:

http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py?id=2015.1.0#n1706

The odd thing is, if this is a local delete the grenade run shouldn't have come back to the caller until the server delete request is complete and gets a 204 back, and then when it goes to issue the security group delete, that should already be gone. This isn't an RPC cast from what I can tell.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh nevermind, it is a cast because there is a host and the service group API is up, so it sends the delete request to the compute manager:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/api.py?id=2015.1.0#n1775

which eventually gets us here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0#n2638

So it makes sense that we have a race here since grenade isn't waiting for the instance to be gone before it tries to delete the security group which is still associated with the instance.

We could add a 1 second sleep to grenade and it would probably fix this, or loop on a server get until we get a 404.

Matt Riedemann (mriedem)
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

We could add a --wait option to openstack server delete in python-openstackclient and then use that here:

http://git.openstack.org/cgit/openstack-dev/grenade/tree/projects/60_nova/resources.sh#n113

openstack server delete $NOVA_SERVER

So we wait for the server to be deleted before we call:

nova secgroup-delete $NOVA_USER || /bin/true

Matt Riedemann (mriedem)
Changed in python-openstackclient:
status: New → Confirmed
assignee: nobody → Matt Riedemann (mriedem)
Changed in grenade:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-openstackclient (master)

Fix proposed to branch: master
Review: https://review.openstack.org/186882

Changed in python-openstackclient:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-openstackclient (master)

Reviewed: https://review.openstack.org/186882
Committed: https://git.openstack.org/cgit/openstack/python-openstackclient/commit/?id=224d375ef4120998dc51fbf55f1778d1ccf118a0
Submitter: Jenkins
Branch: master

commit 224d375ef4120998dc51fbf55f1778d1ccf118a0
Author: Matt Riedemann <email address hidden>
Date: Fri May 29 11:55:09 2015 -0700

    Add --wait to server delete

    This allows the server delete command to wait for the server to be
    deleted (obviously).

    The wait method is the same model that Tempest uses, i.e. wait for a 404
    on server GET (successful deletion), fail if the server went to ERROR
    status, or fail if a timeout is reached. The default timeout of 300
    seconds is also what Tempest uses.

    Closes-Bug: #1460112

    Change-Id: I0e66c400903e82832944d1cad61e7eb30177c3e8

Changed in python-openstackclient:
status: In Progress → Fix Committed
Dean Troyer (dtroyer)
Changed in python-openstackclient:
milestone: none → m12
Changed in python-openstackclient:
importance: Undecided → Medium
status: Fix Committed → Fix Released
Changed in grenade:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is the patch to grenade: https://review.openstack.org/#/c/199771/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Related global-requirements update: https://review.openstack.org/#/c/199773/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to grenade (master)

Reviewed: https://review.openstack.org/199771
Committed: https://git.openstack.org/cgit/openstack-dev/grenade/commit/?id=48d23a6bae0ffa7d413f564bc710e30596e187f7
Submitter: Jenkins
Branch: master

commit 48d23a6bae0ffa7d413f564bc710e30596e187f7
Author: Matt Riedemann <email address hidden>
Date: Wed Jul 8 16:18:38 2015 -0700

    Wait for nova_server1 to be deleted before deleting it's security group

    This adds a 30 second wait loop for the nova_server1 resource to be
    deleted before trying to delete it's related security group so we don't
    hit a 400 error from nova-api.

    This is basically a copy of what is done when deleting the
    cinder_server1 resource.

    Added notes that we can clean this up with the --wait option on the
    `openstack server delete` command when python-openstackclient>=1.4.0 at
    least in global-requirements, which will have to wait until at least the
    M release since stable/kilo caps osc at <1.1.0.

    Closes-Bug: #1460112

    Change-Id: I16981765c12db7145f7c4754d454431afb6d5cbb

Changed in grenade:
status: In Progress → Fix Committed
Sean Dague (sdague)
Changed in grenade:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.