OpenStack Compute (Nova)

Deleting a server will cause temporary 404 from GET /servers

Reported by Brian Waldon on 2011-11-02
34
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dean Troyer

Bug Description

I did a 'nova list' (GET /servers) while server '443c3600-7724-4450-9b21-47ffa8544ad3' was being deleted. For some reason, I got a 404 back as it was removed from the list. This should not happen

root@nova1:~# nova list
+--------------------------------------+-------+--------+----------+
| ID | Name | Status | Networks |
+--------------------------------------+-------+--------+----------+
| 443c3600-7724-4450-9b21-47ffa8544ad3 | pants | ACTIVE | |
+--------------------------------------+-------+--------+----------+
root@nova1:~# nova list
The resource could not be found. (HTTP 404)

Thierry Carrez (ttx) on 2011-11-08
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Jesse Andrews (anotherjesse) wrote :

I have run into this as well. Here is my traceback:

    File "/opt/python-novaclient/novaclient/v1_1/servers.py", line 247, in list
      return self._list("/servers%s%s" % (detail, query_string), "servers")
    File "/opt/python-novaclient/novaclient/base.py", line 69, in _list
      resp, body = self.api.client.get(url)
    File "/opt/python-novaclient/novaclient/client.py", line 131, in get
      return self._cs_request(url, 'GET', **kwargs)
    File "/opt/python-novaclient/novaclient/client.py", line 119, in _cs_request
      **kwargs)
    File "/opt/python-novaclient/novaclient/client.py", line 102, in request
      raise exceptions.from_response(resp, body)

I have a script that creates a VM, waits for it to launch and then deletes it and verifies deletion occurs. It fails a significant amount of time due to this exception

Jesse Andrews (anotherjesse) wrote :

my script that triggers this.

It fails on:

    if not any([s.id == server_id for s in nc.servers.list()]):

Dean Troyer (dtroyer) on 2012-01-05
Changed in nova:
assignee: nobody → Dean Troyer (dtroyer)
Dean Troyer (dtroyer) wrote :

I see this in the logs a bit before the HTTP 404:

2012-01-04 22:40:00,376 DEBUG nova.api.openstack.common [1ca75227-460c-4cca-93bc-1e9d70a571c7 demo 2] Generated ACTIVE from vm_state=active task_state=deleting. from (pid=22694) status_from_state /opt/stack/nova/nova/api/openstack/common.py:93

Is the ACTIVE status here causing this indirectly?

Dean - I was thinking the bug was somewhere in the detail server code (perhaps where you cite), which iterates over each of the servers and grabs a whole bunch of extra info about the instance, but couldn't isolate any specific problem area (i was looking for an errant join). In the meantime, some extra logging could help here: https://github.com/openstack/nova/blob/master/nova/api/openstack/v2/servers.py#L71

Dean Troyer (dtroyer) wrote :

I finally found the source of the exception: https://github.com/openstack/nova/blob/master/nova/api/openstack/v2/contrib/extended_status.py#L69. Extensions are still new to me so the flow here isn't obvious, but it appears to explain why every /servers/detail api call appears to be duplicated in the logs.

I'm still not certain about the status=ACTIVE when task_state=deleting, but I don't think that is in play here.

Dean Troyer (dtroyer) wrote :

The race condition we see here is between the original call to compute.api.get_all() and when extended_status _get_and_extend_all() gets around to looping through the server list to call compute.api.routing_get() for each one. I think the Right Thing here would be to remove the server from body['servers'], log a warning and continue. I'm letting the dev cluster test this overnight.

Dean Troyer (dtroyer) on 2012-01-06
Changed in nova:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/2874
Committed: http://github.com/openstack/nova/commit/51c0d545253b9f5618d1923aea3f7061da6cd60b
Submitter: Jenkins
Branch: master

commit 51c0d545253b9f5618d1923aea3f7061da6cd60b
Author: Dean Troyer <email address hidden>
Date: Fri Jan 6 00:22:52 2012 -0600

    Bug 885267: Fix GET /servers during instance delete

    There is a period during an instance delete when GET /servers
    will fail occasionally. The race condition is during GET /servers
    between the initial get_all() and when the extended_status extension
    re-retrieves individual servers via compute.api.routing_get().
    We log a warning and remove the offending server from the list
    as it no longer exists.

    Change-Id: Id75723a21c0d6dc20f446560847e5b8522ec3262

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-01-25
Changed in nova:
milestone: none → essex-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-3 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments