Resource cleanup fails to properly handle broken resources.

Bug #1469622 reported by Pawel Rozlach on 2015-06-29
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Rally
Low
Unassigned

Bug Description

In situation when Openstack is unable to remove resource/remove it quickly enough after performing a test, rally invokes silent 600 seconds wait, and then proceeds as if nothing happened. Next invocations of "rally task start" also wait for 600 seconds and time out, without any message whatsoever.

Recently we had a situation when VM created by rally tests got stuck in "Deleting" state, and neither rally nor OpenStack cmdline tools were unable to remove the VM. We run rally periodically, and it caused following executions of "rally task start" to wait for 600 seconds timeout before they reported success. Before we cleaned up the instance in OpenStack database, we investigated rally a bit, and it seems that there is a cleanup procedure that tries to remove all instances in tenant used by rally. What is more, this procedure has hardcoded timeout of 600s (i think it's rally/benchmark/context/cleanup/base.py -> definition of resource decorator, timeout=600 function arg).

Expected behaviour would be to:
a) expose this timeout somewhere, so that user can configure it in a way that suits his environment
b) resources that rally fails to delete should be renamed to something meaningfull (i.e. failed_remove_rally-XXXXXXX), and it should not retry deleting it in future runs.
c) there should be a way to limit the number of "unremovable" resources, defined in config file, which will prevent rally from creating too many of them. If the number of renamed resources reaches treshold, rally should fail the task IMO.

Thanks
pr

Chris Kirkland (bifurcationman) wrote :

We also encountered a similar situation. We have a set of scenarios which we run using a deployment with an existing tenant and users registered. We will often run these scenarios multiple times over night. Sometimes, during the testing, a number of servers are created which are in error and cannot be deleted via Rally or OpenStack CLI. Because they were created by the tenant registered with the Rally deployment, any subsequent scenarios which use that tenant and specify Nova in their cleanup context will attempt to delete these "bad" servers; this addition to the cleanup process has often added 30 minutes or more to scenarios which take less than a minute to execute.

Is there some way in the existing Rally source or configurations to localize the cleanup process to only resources created by the current scenario? If not, I suggest that this would be a nice feature.

Best,
Chris Kirkland

Kun Huang (academicgareth) wrote :

Dear Pawel and Chris

Firstly, perfect cleanup design is still on designing. Currently your questions could be answered as:

a) expose this timeout somewhere, so that user can configure it in a way that suits his environment
I'll submit a patch for it

b) resources that rally fails to delete should be renamed to something meaningfull (i.e. failed_remove_rally-XXXXXXX), and it should not retry deleting it in future runs.
It is meaningful in current naming style. Now rally creates resources with certain name prefix and deletes those resources who has those prefix. If first round rally task leaves something undeleted, next round task would still find out the rally resources and delete that. If some resources created by rally don't have proper prefix, it is a bug.

c) there should be a way to limit the number of "unremovable" resources, defined in config file, which will prevent rally from creating too many of them. If the number of renamed resources reaches treshold, rally should fail the task IMO.
That's a kind of workaround. I think this could be involved in new cleanup design not be implemented in the current one.

others:

1) new designs:
pls follow https://review.openstack.org/172967 and contribute your ideas or join our meeting wiki.openstack.org/wiki/Meetings/Rally#Agenda

2) example workaround for removing undeleted resources
    source admin-token.rc
    nova list --all-tenants | grep rally
    nova delete <rally_xxxxxx or its id>

Reviewed: https://review.openstack.org/215010
Committed: https://git.openstack.org/cgit/openstack/rally/commit/?id=3ecce33d6eee7d77934acece82a8f2147f545d52
Submitter: Jenkins
Branch: master

commit 3ecce33d6eee7d77934acece82a8f2147f545d52
Author: Kun Huang <email address hidden>
Date: Thu Aug 20 16:24:44 2015 +0800

    Expose config for timeout of resource deletion

    The users who run rally in CI or periodically would like this option,
    because they could config this under different cases.

    Change-Id: I6fd70eb5565aeeb0442475439633baf2f441d174
    Partial-Bug: #1469622

Changed in rally:
importance: Undecided → Low
milestone: none → 0.1.0
Changed in rally:
milestone: 0.1.0 → none
Changed in rally:
status: New → Triaged
Bo Chi (bochi-michael) on 2015-11-03
Changed in rally:
status: Triaged → Fix Committed
status: Fix Committed → New
Bo Chi (bochi-michael) wrote :

Sorry I changed the status my mistake, but I can't change it back to Triaged.

Changed in rally:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers