tempest

tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML setUpClass times-out on attaching volume

Bug #1254772 reported by Julien Danjou on 2013-11-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tempest	Fix Released	Undecided	Matt Riedemann

Bug Description

2013-11-25 15:42:45.769 | ======================================================================
2013-11-25 15:42:45.770 | FAIL: setUpClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML)
2013-11-25 15:42:45.770 | setUpClass (tempest.api.compute.servers.test_server_rescue.ServerRescueTestXML)
2013-11-25 15:42:45.770 | ----------------------------------------------------------------------
2013-11-25 15:42:45.770 | _StringException: Traceback (most recent call last):
2013-11-25 15:42:45.770 | File "tempest/api/compute/servers/test_server_rescue.py", line 51, in setUpClass
2013-11-25 15:42:45.770 | cls.volume_to_attach['id'], 'available')
2013-11-25 15:42:45.771 | File "tempest/services/compute/xml/volumes_extensions_client.py", line 140, in wait_for_volume_status
2013-11-25 15:42:45.771 | raise exceptions.TimeoutException(message)
2013-11-25 15:42:45.771 | TimeoutException: Request timed out
2013-11-25 15:42:45.771 | Details: Volume test_attach failed to reach available status within the required time (196 s).
2013-11-25 15:42:45.771 |

http://logs.openstack.org/77/56577/9/check/check-tempest-devstack-vm-postgres-full/f5fe3ff/console.html

Tags:

Revision history for this message

David Kranz (david-kranz) wrote on 2013-12-12:

Download full text (3.8 KiB)

This shows up in n-cpu:

The "model server went away" showed up 11 times in the last two weeks with the last one being on Dec. 3. This sample size is too small for me to close at this time.

2013-11-25 15:24:22.099 21076 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-11-25 15:24:32.814 ERROR nova.compute.manager [req-ecacaa21-3f07-4b44-9896-8b5bd2238a19 ServersTestManualDisk-tempest-1962756300-user ServersTestManualDisk-tempest-1962756300-tenant] [instance: 1f872097-8ad8-44f8-ba03-89a14115efe0] Failed to deallocate network for instance.
2013-11-25 15:25:32.855 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1809, in _try_deallocate_network\n self._deallocate_network(context, instance, requested_networks)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1491, in _deallocate_network\n context, instance, requested_networks=requested_networks)\n', ' File "/opt/stack/new/nova/nova/network/api.py", line 93, in wrapped\n return func(self, context, *args, **kwargs)\n', ' File "/opt/stack/new/nova/nova/network/api.py", line 318, in deallocate_for_instance\n self.network_rpcapi.deallocate_for_instance(context, **args)\n', ' File "/opt/stack/new/nova/nova/network/rpcapi.py", line 199, in deallocate_for_instance\n host=host, requested_networks=requested_networks)\n', ' File "/opt/stack/new/nova/nova/rpcclient.py", line 85, in call\n return self._invoke(self.proxy.call, ctxt, method, **kwargs)\n', ' File "/opt/stack/new/nova/nova/rpcclient.py", line 63, in _invoke\n return cast_or_call(ctxt, msg, **self.kwargs)\n', ' File "/opt/stack/new/nova/nova/openstack/common/rpc/proxy.py", line 130, in call\n exc.info, real_topic, msg.get(\'method\'))\n', 'Timeout: Timeout while waiting on RPC response - topic: "network", RPC method: "deallocate_for_instance" info: "<unknown>"\n']
2013-11-25 15:25:38.371 21076 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timeout while waiting on RPC response - topic: "conductor", RPC method: "compute_node_update" info: "<unknown>"
2013-11-25 15:26:32.903 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1919, in _delete_instance\n self._shutdown_instance(context, db_inst, bdms)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1854, in _shutdown_instance\n self._try_deallocate_network(context, instance, requested_networks)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 1814, in _try_deallocate_network\n self._set_instance_error_state(context, instance[\'uuid\'])\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 484, in _set_instance_error_state\n vm_state=vm_states.ERROR)\n', ' File "/opt/stack/new/nova/nova/compute/manager.py", line 473, in _instance_update\n **kwargs)\n', ' File "/opt/stack/new/nova/nova/conductor/api.py", line 389, in instance_update\n updates, \'conductor\')\n', ' File "/opt/stack/new/nova/nova/conductor/rpcapi.py", li...

This shows up in n-cpu:

The "model server went away" showed up 11 times in the last two weeks with the last one being on Dec. 3. This sample size is too small for me to close at this time.

2013-11-25 15:24:22.099 21076 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-11-25 15:24:32.814 ERROR nova.compute.manager [req-ecacaa21-3f07-4b44-9896-8b5bd2238a19 ServersTestManualDisk-tempest-1962756300-user ServersTestManualDisk-tempest-1962756300-tenant] [instance: 1f872097-8ad8-44f8-ba03-89a14115efe0] Failed to deallocate network for instance.
2013-11-25 15:25:32.855 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 1809, in _try_deallocate_network\n    self._deallocate_network(context, instance, requested_networks)\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 1491, in _deallocate_network\n    context, instance, requested_networks=requested_networks)\n', '  File "/opt/stack/new/nova/nova/network/api.py", line 93, in wrapped\n    return func(self, context, *args, **kwargs)\n', '  File "/opt/stack/new/nova/nova/network/api.py", line 318, in deallocate_for_instance\n    self.network_rpcapi.deallocate_for_instance(context, **args)\n', '  File "/opt/stack/new/nova/nova/network/rpcapi.py", line 199, in deallocate_for_instance\n    host=host, requested_networks=requested_networks)\n', '  File "/opt/stack/new/nova/nova/rpcclient.py", line 85, in call\n    return self._invoke(self.proxy.call, ctxt, method, **kwargs)\n', '  File "/opt/stack/new/nova/nova/rpcclient.py", line 63, in _invoke\n    return cast_or_call(ctxt, msg, **self.kwargs)\n', '  File "/opt/stack/new/nova/nova/openstack/common/rpc/proxy.py", line 130, in call\n    exc.info, real_topic, msg.get(\'method\'))\n', 'Timeout: Timeout while waiting on RPC response - topic: "network", RPC method: "deallocate_for_instance" info: "<unknown>"\n']
2013-11-25 15:25:38.371 21076 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timeout while waiting on RPC response - topic: "conductor", RPC method: "compute_node_update" info: "<unknown>"
2013-11-25 15:26:32.903 21076 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 1919, in _delete_instance\n    self._shutdown_instance(context, db_inst, bdms)\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 1854, in _shutdown_instance\n    self._try_deallocate_network(context, instance, requested_networks)\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 1814, in _try_deallocate_network\n    self._set_instance_error_state(context, instance[\'uuid\'])\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 484, in _set_instance_error_state\n    vm_state=vm_states.ERROR)\n', '  File "/opt/stack/new/nova/nova/compute/manager.py", line 473, in _instance_update\n    **kwargs)\n', '  File "/opt/stack/new/nova/nova/conductor/api.py", line 389, in instance_update\n    updates, \'conductor\')\n', '  File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 149, in instance_update\n    service=service)\n', '  File "/opt/stack/new/nova/nova/rpcclient.py", line 85, in call\n    return self._invoke(self.proxy.call, ctxt, method, **kwargs)\n', '  File "/opt/stack/new/nova/nova/rpcclient.py", line 63, in _invoke\n    return cast_or_call(ctxt, msg, **self.kwargs)\n', '  File "/opt/stack/new/nova/nova/openstack/common/rpc/proxy.py", line 130, in call\n    exc.info, real_topic, msg.get(\'method\'))\n', 'Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "instance_update" info: "<unknown>"\n']
2013-11-25 15:26:32.933 21076 ERROR nova.servicegroup.drivers.db [-] Recovered model server connection!

Changed in tempest:
status:	New → Invalid

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-27:

I'm not getting any hits for this in logstash:

message:"ERROR nova.servicegroup.drivers.db [-] Recovered model server connection!" AND filename:"logs/screen-n-cpu.txt"

Or this:

message:"ERROR nova.servicegroup.drivers.db [-] model server went away" AND filename:"logs/screen-n-cpu.txt"

I am getting 15 hits in the last 2 weeks for this though:

message:"model server went away" AND filename:"logs/screen-n-cpu.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwibW9kZWwgc2VydmVyIHdlbnQgYXdheVwiIEFORCBmaWxlbmFtZTpcImxvZ3Mvc2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJhbGwiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ1NDc5NTUxfQ==

However, it also shows up on successful test runs.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-27:

The model server went away race is also tracked in bug 1270470. We should leave the model server went away issue for that bug, because I'm not convinced it has anything to do with this one.

We do have hits on this though:

message:"Volume test_attach failed to reach available status within the required time" AND filename:"console.html"

10 hits in the last 2 weeks, none in the gate:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVm9sdW1lIHRlc3RfYXR0YWNoIGZhaWxlZCB0byByZWFjaCBhdmFpbGFibGUgc3RhdHVzIHdpdGhpbiB0aGUgcmVxdWlyZWQgdGltZVwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJhbGwiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ1OTA4MTUxfQ==

tags:

added: testing volumes

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-27:

I also see this show up twice in the last 7 days:

message:"Volume test_detach failed to reach available status within the required time" AND filename:"console.html"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVm9sdW1lIHRlc3RfZGV0YWNoIGZhaWxlZCB0byByZWFjaCBhdmFpbGFibGUgc3RhdHVzIHdpdGhpbiB0aGUgcmVxdWlyZWQgdGltZVwiIEFORCBmaWxlbmFtZTpcImNvbnNvbGUuaHRtbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwODQ2MjEwNjM1fQ==

test_server_rescue sets up two volumes, one for attach and one for detach - my guess is there is some parallel setup going on that is maybe making the volumes step on each other in cinder and that's slowing the test setup down to the point where it can sometimes fail.

So I think it might be valid to keep this as a tempest test failure. Maybe the volumes should be uniquely named in the tempest setup when it runs this?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-27:

Looked this over with sdague, we're thinking the problem is how test_server_rescue is doing two volume creates in the setUpClass method and then when it tears down, it doesn't wait for the volumes to be deleted, it just moves on. When you have tests running in parallel, this could cause cinder to backup and then we get racy failures on volume creation.

Also, out of the 7 test cases, only two of them use the volumes, and really those are isolated, so you only need to create one volume for the entire test class and that's only in two of the seven test cases, and then we can do that in the test cases themselves and cleanup with an addCleanup call inline.

no longer affects:	nova
no longer affects:	cinder
Changed in tempest:
status:	Invalid → Triaged
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-01-27:

Here is a good query for both setup failures:

(message:"Volume test_attach" OR message:"Volume test_detach") AND message:"failed to reach available status within the required time" AND filename:"console.html"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-27: Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/69454

Changed in tempest:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-01-27: Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/69455

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-06: Fix merged to tempest (master)

Reviewed: https://review.openstack.org/69455
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=807d0561765fde17682b0d7dffe7bf82315b7f43
Submitter: Jenkins
Branch: master

commit 807d0561765fde17682b0d7dffe7bf82315b7f43
Author: Matt Riedemann <email address hidden>
Date: Mon Jan 27 12:03:10 2014 -0800

Remove unnecessary volume creation in test_server_rescue

    The setup in test_server_rescue is creating two volumes where really
    only one shared volume is necessary, so remove the second volume. This
    patch also uses the common base class code to wait for the volume
    deletion on tear down and makes the volume name random within the
    context of the class name for easier debugging.

    This patch moves the volume create/delete into the specific test cases
    that need them, uses the common base class code to delete the volume and
    wait for it to be gone, and also moves the _delete_volume method from
    the v2 compute base test class into it's parent so the v3 compute base
    test case can also use it with it's own volumes client.

Partial-Bug: #1254772

Change-Id: Id98f65cc2843a97221222e86c60b3859e15af869

Revision history for this message

Joe Gordon (jogo) wrote on 2014-06-05:

#10

no hits in elastic-recheck, marking as resolved.

Changed in tempest:
status:	In Progress → Fix Committed

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-05:

#11

I probably busted the e-r query when I made this change:

https://review.openstack.org/69455

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-05:

#12

There are similar hits showing up, but more for "Volume ServerRescueNegativeTest" which is a different test.

http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiZmFpbGVkIHRvIHJlYWNoIGF2YWlsYWJsZSBzdGF0dXMgd2l0aGluIHRoZSByZXF1aXJlZCB0aW1lXCIgQU5EIHRhZ3M6XCJjb25zb2xlLmh0bWxcIiIsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwMTk5NzE5NzUzOH0=

Sanju Abraham (asanju) on 2014-06-18

Changed in tempest:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1260574

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.