tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML setUpClass times-out on attaching volume

Bug #1254772 reported by Julien Danjou
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Fix Released
Undecided
Matt Riedemann

Bug Description

2013-11-25 15:42:45.769 | ======================================================================
2013-11-25 15:42:45.770 | FAIL: setUpClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML)
2013-11-25 15:42:45.770 | setUpClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML)
2013-11-25 15:42:45.770 | ----------------------------------------------------------------------
2013-11-25 15:42:45.770 | _StringException: Traceback (most recent call last):
2013-11-25 15:42:45.770 | File "tempest/api/compute/servers/test_server_rescue.py", line 51, in setUpClass
2013-11-25 15:42:45.770 | cls.volume_to_attach['id'], 'available')
2013-11-25 15:42:45.771 | File "tempest/services/compute/xml/volumes_extensions_client.py", line 140, in wait_for_volume_status
2013-11-25 15:42:45.771 | raise exceptions.TimeoutException(message)
2013-11-25 15:42:45.771 | TimeoutException: Request timed out
2013-11-25 15:42:45.771 | Details: Volume test_attach failed to reach available status within the required time (196 s).
2013-11-25 15:42:45.771 |

http://logs.openstack.org/77/56577/9/check/check-tempest-devstack-vm-postgres-full/f5fe3ff/console.html

Revision history for this message
David Kranz (david-kranz) wrote :
Download full text (3.8 KiB)

This shows up in n-cpu:

The "model server went away" showed up 11 times in the last two weeks with the last one being on Dec. 3. This sample size is too small for me to close at this time.

2013-11-25 15:24:22.099 21076 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-11-25 15:24:32.814 ERROR nova.compute.manager [req-ecacaa21-3f07-4b44-9896-8b5bd2238a19 ServersTestManualDisk-tempest-1962756300-user ServersTestManualDisk-tempest-1962756300-tenant] [instance: 1f872097-8ad8-44f8-ba03-89a14115efe0] Failed to deallocate network for instance.
2013-11-25 15:25:32.855 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1809, in _try_deallocate_network\n self._deallocate_network(context, instance, requested_networks)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1491, in _deallocate_network\n context, instance, requested_networks=requested_networks)\n', ' File "/opt/stack/new/nova/nova/network/api.py", line 93, in wrapped\n return func(self, context, *args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/network/api.py", line 318, in deallocate_for_instance\n self.network_rpcapi.deallocate_for_instance(context, **args)\n', ' File "/opt/stack/new/nova/nova/network/rpcapi.py", line 199, in deallocate_for_instance\n host=host, requested_networks=requested_networks)\n', ' File "/opt/stack/new/nova/nova/rpcclient.py", line 85, in call\n return self._invoke(self.proxy.call, ctxt, method, **kwargs)\n', ' File "/opt/stack/new/nova/nova/rpcclient.py", line 63, in _invoke\n return cast_or_call(ctxt, msg, **self.kwargs)\n', ' File "/opt/stack/new/nova/nova/openstack/common/rpc/proxy.py", line 130, in call\n exc.info, real_topic, msg.get(\'method\'))\n', 'Timeout: Timeout while waiting on RPC response - topic: "network", RPC method: "deallocate_for_instance" info: "<unknown>"\n']
2013-11-25 15:25:38.371 21076 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timeout while waiting on RPC response - topic: "conductor", RPC method: "compute_node_update" info: "<unknown>"
2013-11-25 15:26:32.903 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1919, in _delete_instance\n self._shutdown_instance(context, db_inst, bdms)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1854, in _shutdown_instance\n self._try_deallocate_network(context, instance, requested_networks)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1814, in _try_deallocate_network\n self._set_instance_error_state(context, instance[\'uuid\'])\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 484, in _set_instance_error_state\n vm_state=vm_states.ERROR)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 473, in _instance_update\n **kwargs)\n', ' File "/opt/stack/new/nova/nova/conductor/api.py", line 389, in instance_update\n updates, \'conductor\')\n', ' File "/opt/stack/new/nova/nova/conductor/rpcapi.py", li...

Read more...

Changed in tempest:
status: New → Invalid
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not getting any hits for this in logstash:

message:"ERROR nova.servicegroup.drivers.db [-] Recovered model server connection!" AND filename:"logs/screen-n-cpu.txt"

Or this:

message:"ERROR nova.servicegroup.drivers.db [-] model server went away" AND filename:"logs/screen-n-cpu.txt"

I am getting 15 hits in the last 2 weeks for this though:

message:"model server went away" AND filename:"logs/screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwibW9kZWwgc2VydmVyIHdlbnQgYXdheVwiIEFORCBmaWxlbmFtZTpcImxvZ3Mvc2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJhbGwiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ1NDc5NTUxfQ==

However, it also shows up on successful test runs.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The model server went away race is also tracked in bug 1270470. We should leave the model server went away issue for that bug, because I'm not convinced it has anything to do with this one.

We do have hits on this though:

message:"Volume test_attach failed to reach available status within the required time" AND filename:"console.html"

10 hits in the last 2 weeks, none in the gate:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVm9sdW1lIHRlc3RfYXR0YWNoIGZhaWxlZCB0byByZWFjaCBhdmFpbGFibGUgc3RhdHVzIHdpdGhpbiB0aGUgcmVxdWlyZWQgdGltZVwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJhbGwiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ1OTA4MTUxfQ==

tags: added: testing volumes
Revision history for this message
Matt Riedemann (mriedem) wrote :

I also see this show up twice in the last 7 days:

message:"Volume test_detach failed to reach available status within the required time" AND filename:"console.html"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVm9sdW1lIHRlc3RfZGV0YWNoIGZhaWxlZCB0byByZWFjaCBhdmFpbGFibGUgc3RhdHVzIHdpdGhpbiB0aGUgcmVxdWlyZWQgdGltZVwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ2MjEwNjM1fQ==

test_server_rescue sets up two volumes, one for attach and one for detach - my guess is there is some parallel setup going on that is maybe making the volumes step on each other in cinder and that's slowing the test setup down to the point where it can sometimes fail.

So I think it might be valid to keep this as a tempest test failure. Maybe the volumes should be uniquely named in the tempest setup when it runs this?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looked this over with sdague, we're thinking the problem is how test_server_rescue is doing two volume creates in the setUpClass method and then when it tears down, it doesn't wait for the volumes to be deleted, it just moves on. When you have tests running in parallel, this could cause cinder to backup and then we get racy failures on volume creation.

Also, out of the 7 test cases, only two of them use the volumes, and really those are isolated, so you only need to create one volume for the entire test class and that's only in two of the seven test cases, and then we can do that in the test cases themselves and cleanup with an addCleanup call inline.

no longer affects: nova
no longer affects: cinder
Changed in tempest:
status: Invalid → Triaged
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a good query for both setup failures:

(message:"Volume test_attach" OR message:"Volume test_detach") AND message:"failed to reach available status within the required time" AND filename:"console.html"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/69454

Changed in tempest:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/69455

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/69455
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=807d0561765fde17682b0d7dffe7bf82315b7f43
Submitter: Jenkins
Branch: master

commit 807d0561765fde17682b0d7dffe7bf82315b7f43
Author: Matt Riedemann <email address hidden>
Date: Mon Jan 27 12:03:10 2014 -0800

    Remove unnecessary volume creation in test_server_rescue

    The setup in test_server_rescue is creating two volumes where really
    only one shared volume is necessary, so remove the second volume. This
    patch also uses the common base class code to wait for the volume
    deletion on tear down and makes the volume name random within the
    context of the class name for easier debugging.

    This patch moves the volume create/delete into the specific test cases
    that need them, uses the common base class code to delete the volume and
    wait for it to be gone, and also moves the _delete_volume method from
    the v2 compute base test class into it's parent so the v3 compute base
    test case can also use it with it's own volumes client.

    Partial-Bug: #1254772

    Change-Id: Id98f65cc2843a97221222e86c60b3859e15af869

Revision history for this message
Joe Gordon (jogo) wrote :

no hits in elastic-recheck, marking as resolved.

Changed in tempest:
status: In Progress → Fix Committed
Revision history for this message
Matt Riedemann (mriedem) wrote :

I probably busted the e-r query when I made this change:

https://review.openstack.org/69455

Revision history for this message
Matt Riedemann (mriedem) wrote :
Sanju Abraham (asanju)
Changed in tempest:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.