Volumes left in Error state prevent tear down of nova compute resource

Bug #1213215 reported by Matthew Treinish
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
tempest
Fix Released
High
John Griffith

Bug Description

Occasionally running tempest in parallel will fail several tests with timeout errors. The only nontimeout failure message is that the ServerRescueTest failed to delete a volume because it was still marked as in use. My guess is that the leftover volume is somehow interfering with the other tests causing them to timeout. But, I haven't looked at the logs in detail so it's just a wild guess.

2013-08-16 14:11:42.074 | ======================================================================
2013-08-16 14:11:42.075 | FAIL: tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_rebuild_server_with_auto_disk_config[gate]
2013-08-16 14:11:42.075 | tempest.api.compute.servers.test_disk_config.ServerDiskConfigTestJSON.test_rebuild_server_with_auto_disk_config[gate]
2013-08-16 14:11:42.075 | ----------------------------------------------------------------------
2013-08-16 14:11:42.075 | _StringException: Empty attachments:
2013-08-16 14:11:42.075 | stderr
2013-08-16 14:11:42.076 | stdout
2013-08-16 14:11:42.076 |
2013-08-16 14:11:42.076 | Traceback (most recent call last):
2013-08-16 14:11:42.076 | File "tempest/api/compute/servers/test_disk_config.py", line 64, in test_rebuild_server_with_auto_disk_config
2013-08-16 14:11:42.076 | wait_until='ACTIVE')
2013-08-16 14:11:42.076 | File "tempest/api/compute/base.py", line 140, in create_server
2013-08-16 14:11:42.076 | server['id'], kwargs['wait_until'])
2013-08-16 14:11:42.077 | File "tempest/services/compute/json/servers_client.py", line 160, in wait_for_server_status
2013-08-16 14:11:42.077 | time.sleep(self.build_interval)
2013-08-16 14:11:42.077 | File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
2013-08-16 14:11:42.077 | raise TimeoutException()
2013-08-16 14:11:42.077 | TimeoutException
2013-08-16 14:11:42.077 |
2013-08-16 14:11:42.077 |
2013-08-16 14:11:42.078 | ======================================================================
2013-08-16 14:11:42.078 | FAIL: setUpClass (tempest.api.compute.images.test_image_metadata.ImagesMetadataTestXML)
2013-08-16 14:11:42.078 | setUpClass (tempest.api.compute.images.test_image_metadata.ImagesMetadataTestXML)
2013-08-16 14:11:42.078 | ----------------------------------------------------------------------
2013-08-16 14:11:42.078 | _StringException: Traceback (most recent call last):
2013-08-16 14:11:42.078 | File "tempest/api/compute/images/test_image_metadata.py", line 46, in setUpClass
2013-08-16 14:11:42.078 | cls.client.wait_for_image_status(cls.image_id, 'ACTIVE')
2013-08-16 14:11:42.079 | File "tempest/services/compute/xml/images_client.py", line 167, in wait_for_image_status
2013-08-16 14:11:42.079 | raise exceptions.TimeoutException
2013-08-16 14:11:42.079 | TimeoutException: Request timed out
2013-08-16 14:11:42.079 |
2013-08-16 14:11:42.079 |
2013-08-16 14:11:42.079 | ======================================================================
2013-08-16 14:11:42.079 | FAIL: tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON.test_rescued_vm_detach_volume[gate,negative]
2013-08-16 14:11:42.080 | tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON.test_rescued_vm_detach_volume[gate,negative]
2013-08-16 14:11:42.080 | ----------------------------------------------------------------------
2013-08-16 14:11:42.080 | _StringException: Empty attachments:
2013-08-16 14:11:42.080 | stderr
2013-08-16 14:11:42.080 | stdout
2013-08-16 14:11:42.080 |
2013-08-16 14:11:42.081 | Traceback (most recent call last):
2013-08-16 14:11:42.081 | File "tempest/api/compute/servers/test_server_rescue.py", line 184, in test_rescued_vm_detach_volume
2013-08-16 14:11:42.081 | self.servers_client.wait_for_server_status(self.server_id, 'RESCUE')
2013-08-16 14:11:42.081 | File "tempest/services/compute/json/servers_client.py", line 160, in wait_for_server_status
2013-08-16 14:11:42.081 | time.sleep(self.build_interval)
2013-08-16 14:11:42.081 | File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
2013-08-16 14:11:42.081 | raise TimeoutException()
2013-08-16 14:11:42.082 | TimeoutException
2013-08-16 14:11:42.082 |
2013-08-16 14:11:42.082 |
2013-08-16 14:11:42.082 | ======================================================================
2013-08-16 14:11:42.082 | FAIL: tearDownClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON)
2013-08-16 14:11:42.082 | tearDownClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestJSON)
2013-08-16 14:11:42.082 | ----------------------------------------------------------------------
2013-08-16 14:11:42.083 | _StringException: Traceback (most recent call last):
2013-08-16 14:11:42.083 | File "tempest/api/compute/servers/test_server_rescue.py", line 92, in tearDownClass
2013-08-16 14:11:42.083 | client.delete_volume(str(cls.volume_to_detach['id']).strip())
2013-08-16 14:11:42.083 | File "tempest/services/compute/json/volumes_extensions_client.py", line 84, in delete_volume
2013-08-16 14:11:42.083 | return self.delete("os-volumes/%s" % str(volume_id))
2013-08-16 14:11:42.083 | File "tempest/common/rest_client.py", line 265, in delete
2013-08-16 14:11:42.083 | return self.request('DELETE', url, headers)
2013-08-16 14:11:42.084 | File "tempest/common/rest_client.py", line 387, in request
2013-08-16 14:11:42.084 | resp, resp_body)
2013-08-16 14:11:42.084 | File "tempest/common/rest_client.py", line 437, in _error_checker
2013-08-16 14:11:42.084 | raise exceptions.BadRequest(resp_body)
2013-08-16 14:11:42.084 | BadRequest: Bad request
2013-08-16 14:11:42.084 | Details: {u'badRequest': {u'message': u'Invalid input received: Invalid volume: Volume status must be available or error, but current status is: in-use', u'code': 400}}
2013-08-16 14:11:42.084 |
2013-08-16 14:11:42.085 |
2013-08-16 14:11:42.085 | ======================================================================
2013-08-16 14:11:42.085 | FAIL: process-returncode
2013-08-16 14:11:42.085 | process-returncode
2013-08-16 14:11:42.085 | ----------------------------------------------------------------------
2013-08-16 14:11:42.085 | _StringException: Binary content:
2013-08-16 14:11:42.085 | traceback (test/plain; charset="utf8")
2013-08-16 14:11:42.086 |
2013-08-16 14:11:42.086 |
2013-08-16 14:11:42.086 | ======================================================================
2013-08-16 14:11:42.086 | FAIL: process-returncode
2013-08-16 14:11:42.086 | process-returncode
2013-08-16 14:11:42.086 | ----------------------------------------------------------------------
2013-08-16 14:11:42.086 | _StringException: Binary content:
2013-08-16 14:11:42.087 | traceback (test/plain; charset="utf8")
2013-08-16 14:11:42.087 |
2013-08-16 14:11:42.087 |
2013-08-16 14:11:42.087 | ======================================================================
2013-08-16 14:11:42.087 | FAIL: process-returncode
2013-08-16 14:11:42.087 | process-returncode
2013-08-16 14:11:42.087 | ----------------------------------------------------------------------
2013-08-16 14:11:42.088 | _StringException: Binary content:
2013-08-16 14:11:42.088 | traceback (test/plain; charset="utf8")
2013-08-16 14:11:42.088 |

A set of failing logs can be found here:
http://logs.openstack.org/97/41397/5/gate/gate-tempest-devstack-vm-testr-full/a6ab053/

Revision history for this message
Attila Fazekas (afazekas) wrote :

I wonder why this https://bugs.launchpad.net/tempest/+bug/1205344 is not critical in nova.

The other thing what can be related is rescue task state management:
https://bugs.launchpad.net/tempest/+bug/1170118

tags: added: testing
Revision history for this message
ChangBo Guo(gcb) (glongwave) wrote :

met similar issue . show "Invalid volume: Volume status must be available or error, but current status is: in-use"
see http://logs.openstack.org/55/46855/3/check/gate-tempest-devstack-vm-full/9fc9c69/testr_results.html.gz

ft13.1: tearDownClass (tempest.scenario.test_minimum_basic.TestMinimumBasicScenario)_StringException: Traceback (most recent call last):
  File "tempest/scenario/manager.py", line 239, in tearDownClass
    thing.delete()
  File "/opt/stack/new/python-cinderclient/cinderclient/v1/volumes.py", line 35, in delete
    self.manager.delete(self)
  File "/opt/stack/new/python-cinderclient/cinderclient/v1/volumes.py", line 215, in delete
    self._delete("/volumes/%s" % base.getid(volume))
  File "/opt/stack/new/python-cinderclient/cinderclient/base.py", line 162, in _delete
    resp, body = self.api.client.delete(url)
  File "/opt/stack/new/python-cinderclient/cinderclient/client.py", line 216, in delete
    return self._cs_request(url, 'DELETE', **kwargs)
  File "/opt/stack/new/python-cinderclient/cinderclient/client.py", line 174, in _cs_request
    **kwargs)
  File "/opt/stack/new/python-cinderclient/cinderclient/client.py", line 157, in request
    raise exceptions.from_response(resp, body)
BadRequest: Invalid volume: Volume status must be available or error, but current status is: in-use (HTTP 400) (Request-ID: req-2ec82421-8c3a-4f86-9b6e-f6ce2de3c4db)

Sean Dague (sdague)
Changed in cinder:
importance: Undecided → High
Changed in nova:
importance: Undecided → High
Changed in cinder:
status: New → Confirmed
Changed in nova:
status: New → Confirmed
Changed in tempest:
assignee: Adalberto Medeiros (adalbas) → nobody
Sean Dague (sdague)
summary: - ServerRescueTest tearDownClass fails with volume status being in-use
+ Volumes left in Error state prevent tear down of nova compute resource
tags: added: gate-failure
Revision history for this message
John Griffith (john-griffith) wrote :

The issue from the perspective of the Cinder delete is that the tempest min scenario test doesn't bother to deal with things like failures in it's sequence. What's happening here is that the ssh is raising a timeout exception which is not handled and blows things up. So we dump out of the scenario test and try do cleanup, that's great, but we left the instance in it's current state with a volume attached.

From the volume perspective, just catch the exception and do some proper clean up. I'll put a patch up in tempest in a moment to at least address that portion of it.

no longer affects: cinder
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/62014

Changed in tempest:
assignee: nobody → John Griffith (john-griffith)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/62014
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=5462b33673ea23eb9db16d44035d46d29fa455bb
Submitter: Jenkins
Branch: master

commit 5462b33673ea23eb9db16d44035d46d29fa455bb
Author: john-griffith <email address hidden>
Date: Fri Dec 13 10:04:43 2013 -0500

    Catch ssh exception and detach in min scenario

    The test_minimum_basic test goes through a number of tests however
    doesn't do any clean up if a failur occurs. The master clean-up
    routine is ok, but it doesn't account for things like attached volumes.

    One issue here is that the ssh_to_server times out do to some other
    problem and raises an SSHTimeout exception. This exception isn't dealt
    with, the test dumps out and tries to cleanup by deleting the instance
    and the volume, but the volume-delete will fail because it's status is
    still in-use.

    This exposes another problem, it seems that nova delete will clean up
    an instance but fairly regular neglect to detach the volume. This patch
    add an addCleanup() call to nova.volume_detach.

    Change-Id: I76d19f096d6cb4122eea2e7e5de4eb37b97a4b87
    Partial-Bug: #1213215

Revision history for this message
Joe Gordon (jogo) wrote :

It's unclear to me how this is a nova bug, nova is in the same boat as cinder ("The issue from the perspective of the Cinder delete is that the tempest min scenario test doesn't bother to deal with things like failures in it's sequence")

Changed in nova:
status: Confirmed → Invalid
importance: High → Undecided
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/72056

Changed in cinder:
importance: Undecided → High
assignee: nobody → John Griffith (john-griffith)
Revision history for this message
John Griffith (john-griffith) wrote :

@Joe
I think there's some confusion here because a number of different situations are being logged and commented on here. You're correct that all of them boil down to clean up in one way or another, but a number have a root cause of nova boot instance failing.

I've opened a separate bug for one of these common cases:
https://bugs.launchpad.net/tempest/+bug/1278255

Sifting through a number of other ones, that have this signature and are getting logged against this item.

Revision history for this message
Mike Perez (thingee) wrote :

John, what's left for Cinder here?

Changed in tempest:
status: In Progress → Fix Committed
no longer affects: cinder
Changed in tempest:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.