test_create_backup test not noticing that server has become active

Bug #1257561 reported by Clark Boylan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Fix Released
High
Matt Riedemann

Bug Description

tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_create_backup is failing with an exception:
TimeoutException: Request timed out
Details: Server 024383a6-4a63-4658-96d1-4feaaca90b65 failed to reach ACTIVE status within the required time (196 s). Current status: ACTIVE.

But if you look at the responses from nova below the server.status = ACTIVE. Is tempest not noticing that the server has come up or are there additional checks being made? This is beyond my ability to debug now, but want to track this. The test that ran into it is logged at http://logs.openstack.org/03/48803/23/gate/gate-tempest-dsvm-full/b5ae916/console.html

2013-12-03 12:40:04.552 | 2013-12-03 12:20:04,756 Response Body: {"server": {"status": "ACTIVE", "updated": "2013-12-03T12:16:47Z", "hostId": "bec8169d47d9ec02476bcbb498b77ca607e39a2747ebe40f1b09e192", "addresses": {"private": [{"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:cc:a1:dd", "version": 4, "addr": "10.1.0.6", "OS-EXT-IPS:type": "fixed"}]}, "links": [{"href": "http://127.0.0.1:8774/v2/8a96efb75d39436caac8a0a3a505c620/servers/024383a6-4a63-4658-96d1-4feaaca90b65", "rel": "self"}, {"href": "http://127.0.0.1:8774/8a96efb75d39436caac8a0a3a505c620/servers/024383a6-4a63-4658-96d1-4feaaca90b65", "rel": "bookmark"}], "key_name": null, "image": {"id": "0cac589c-1e4d-4945-a3dc-9b8e1ebca13e", "links": [{"href": "http://127.0.0.1:8774/8a96efb75d39436caac8a0a3a505c620/images/0cac589c-1e4d-4945-a3dc-9b8e1ebca13e", "rel": "bookmark"}]}, "OS-EXT-STS:task_state": "image_backup", "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2013-12-03T12:16:40.000000", "flavor": {"id": "42", "links": [{"href": "http://127.0.0.1:8774/8a96efb75d39436caac8a0a3a505c620/flavors/42", "rel": "bookmark"}]}, "id": "024383a6-4a63-4658-96d1-4feaaca90b65", "security_groups": [{"name": "default"}], "OS-SRV-USG:terminated_at": null, "OS-EXT-AZ:availability_zone": "nova", "user_id": "026a608aa1764dacb327c37f0bdf9284", "name": "ServerActionsTestJSON-instance-tempest-975976752", "created": "2013-12-03T12:16:32Z", "tenant_id": "8a96efb75d39436caac8a0a3a505c620", "OS-DCF:diskConfig": "MANUAL", "os-extended-volumes:volumes_attached": [], "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "config_drive": "", "metadata": {}}}
2013-12-03 12:40:04.552 | 2013-12-03 12:20:05,758 Request: GET http://127.0.0.1:8774/v2/8a96efb75d39436caac8a0a3a505c620/servers/024383a6-4a63-4658-96d1-4feaaca90b65
2013-12-03 12:40:04.552 | 2013-12-03 12:20:05,758 Request Headers: {'X-Auth-Token': '<Token omitted>'}
2013-12-03 12:40:04.552 | 2013-12-03 12:20:05,809 Response Status: 200
2013-12-03 12:40:04.553 | 2013-12-03 12:20:05,810 Nova request id: req-3e9f63a0-3aa3-4fd0-8cf6-5a50b09b0ca4

Tags: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in tempest:
status: New → Confirmed
tags: added: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Seems to me like just a logic/timing issue in the tempest code that's waiting for the server to reach a given status:

https://github.com/openstack/tempest/blob/master/tempest/common/waiters.py#L77

It's a while loop until the status is reached or we timeout. The problem is that the server is probably changing status while sleeping in the loop but we don't check the status against before we calculate and check the timeout.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually it might be a problem with this test hitting a window of the task_state changing quickly enough between the windows where we check the status. From looking at the logs, the vm_state is 'active' most of the time but the task_state is constantly changing, and going from 'image_backup' to 'image_pending_upload' to 'null' to not being in the response at all and then back to 'image_backup'.

If anything, we should log the current task_state here along with the current vm_state:

https://github.com/openstack/tempest/blob/master/tempest/common/waiters.py#L81

Because I think it's the task_state that we're timing out on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/61625

Revision history for this message
Matt Riedemann (mriedem) wrote :

I see the last time this instance uuid shows up in the nova scheduler log is when the weighed host places it:

2013-12-11 23:03:29.068 INFO nova.scheduler.filter_scheduler [req-0819732f-7ccf-4f35-b370-b1ba5b06a611 ServerActionsTestJSON-tempest-2134667295-user ServerActionsTestJSON-tempest-2134667295-tenant] Choosing host WeighedHost [host: devstack-precise-hpcloud-az3-834864, weight: 6450.0] for instance a9950df7-f1ad-4805-bab4-a6ce28c9d776T

Looks like at this point the instance is still snapshotting the 2nd backup when we've hit the timeout:

2013-12-11 23:10:24.188 AUDIT nova.compute.manager [req-dbd34c67-8ff6-442b-9789-c87c33fe48d7 ServerActionsTestJSON-tempest-2134667295-user ServerActionsTestJSON-tempest-2134667295-tenant] [instance: a9950df7-f1ad-4805-bab4-a6ce28c9d776] instance snapshotting

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not seeing anything suspicious for recent nova commits in the compute manager or libvirt driver around snapshots or creating/rotating backups, so my guess is this is just a racy test and the timeout is going to be hit sometimes if the backing host is slow to snapshot the image.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at some of the other compute and image API tests in tempest related to creating images from a server instance, they use a helper method in the base class here:

https://github.com/openstack/tempest/blob/master/tempest/api/compute/base.py#L165

That waits for not only the instance to be active but also the image. When creating backups of a server instance, we're just creating snapshot images and deleting older snapshots (via the rotation parameter to the API). So the test_create_backup test for the compute API should probably also be waiting for the snapshot images to be ACTIVE before continuing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/61647

Changed in tempest:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Sean Dague (sdague)
Changed in tempest:
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/61625
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=629fa7cd59de102261a8e9fc6a2f303af1336056
Submitter: Jenkins
Branch: master

commit 629fa7cd59de102261a8e9fc6a2f303af1336056
Author: Matt Riedemann <email address hidden>
Date: Wed Dec 11 18:20:56 2013 -0800

    Make the wait_for_server_status timeout message a bit more clear

    The current error message when wait_for_server_status times out only
    includes the vm_state but the method, by default, is really waiting for
    a state combination of vm_state/task_state.

    Sometimes the vm_state is the desired status but the task_state is not,
    for example:

    "Server 613b9fd8-9f3f-41a6-86cc-a36301aa7fc8 failed to reach ACTIVE
    status within the required time (196 s). Current status: ACTIVE."

    This patch adds the current task_state to the error message for clarity
    when the timeout occurs.

    Related-Bug: #1257561

    Change-Id: Icfce12035bf4938eac3a13a7b640db4398ef832f

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/61647
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=46c1b2cd4a42285ee1a790de7ffff8facc2fb906
Submitter: Jenkins
Branch: master

commit 46c1b2cd4a42285ee1a790de7ffff8facc2fb906
Author: Matt Riedemann <email address hidden>
Date: Wed Dec 11 19:51:39 2013 -0800

    Wait for backup images to be ACTIVE in test_create_backup

    The test_create_backup test creates two backup snapshot images of the
    instance right away and waits for the instance to be ACTIVE before
    continuing each time, and before creating the third and final backup
    image.

    The problem is the instance is always ACTIVE, it's the task state that
    is transitioning on the instance while it's creating backup snapshot
    instances and checking for old rotations to delete.

    The test needs to not only check that the instance is ACTIVE before
    creating each new backup but also that the snapshot image backups
    themselves are ACTIVE which is a better indication that the instance's
    state transition is complete.

    Closes-Bug: #1257561

    Change-Id: I5b423b3424e1dfe6cf5f8fef0dc9538961c1f28d

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.