Flaky failures of instances to reach BUILD and ACTIVE states

Bug #1079687 reported by David Kranz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Sean Dague
tempest
Fix Released
Critical
Sean Dague

Bug Description

This has been happening in the hourly tempest runs. Here are some recent examples:

https://jenkins.openstack.org/job/periodic-tempest-devstack-vm-check-hourly/77/console
03:44:06 ERROR: Specify a keypair while creating a server
03:44:06 ----------------------------------------------------------------------
03:44:06 Traceback (most recent call last):
03:44:06 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 86, in test_create_specify_keypair
03:44:06 self.client.wait_for_server_status(server['id'], 'ACTIVE')
03:44:06 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
03:44:06 raise exceptions.BuildErrorException(server_id=server_id)
03:44:06 BuildErrorException: Server 809a992b-9e34-4e14-9ead-3b536355ff71 failed to build and is in ERROR status
03:44:06

https://jenkins.openstack.org/job/periodic-tempest-devstack-vm-check-hourly/78/console
04:43:26 ERROR: Delete a server while it's VM state is Building
04:43:26 ----------------------------------------------------------------------
04:43:26 Traceback (most recent call last):
04:43:26 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 148, in test_delete_server_while_in_building_state
04:43:26 self.client.wait_for_server_status(server['id'], 'BUILD')
04:43:26 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
04:43:26 raise exceptions.BuildErrorException(server_id=server_id)
04:43:26 BuildErrorException: Server 8b152f72-4f69-406f-a45a-a40490013bae failed to build and is in ERROR status
04:43:26
04:43:26 ======================================================================
04:43:26 ERROR: The server's access addresses should reflect the provided values
04:43:26 ----------------------------------------------------------------------
04:43:26 Traceback (most recent call last):
04:43:26 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 125, in test_update_access_server_address
04:43:26 self.client.wait_for_server_status(server['id'], 'ACTIVE')
04:43:26 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
04:43:26 raise exceptions.BuildErrorException(server_id=server_id)
04:43:26 BuildErrorException: Server b708f30f-69d3-463d-98df-099b68acc308 failed to build and is in ERROR status
04:43:26
04:43:26 ======================================================================
04:43:26 ERROR: The server name should be changed to the the provided value
04:43:26 ----------------------------------------------------------------------
04:43:26 Traceback (most recent call last):
04:43:26 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 100, in test_update_server_name
04:43:26 self.client.wait_for_server_status(server['id'], 'ACTIVE')
04:43:26 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
04:43:26 raise exceptions.BuildErrorException(server_id=server_id)
04:43:26 BuildErrorException: Server 60d60b56-6d87-48ef-aed3-2a57c30b014f failed to build and is in ERROR status
04:4

https://jenkins.openstack.org/job/periodic-tempest-devstack-vm-check-hourly/85/console
11:43:18 ERROR: Specify a keypair while creating a server
11:43:18 ----------------------------------------------------------------------
11:43:18 Traceback (most recent call last):
11:43:18 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 86, in test_create_specify_keypair
11:43:18 self.client.wait_for_server_status(server['id'], 'ACTIVE')
11:43:18 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
11:43:18 raise exceptions.BuildErrorException(server_id=server_id)
11:43:18 BuildErrorException: Server 646858c1-109c-42bf-921f-d58281686ea7 failed to build and is in ERROR status
11:43:18
11:43:18 ======================================================================
11:43:18 ERROR: Creating a server with a name that already exists is allowed
11:43:18 ----------------------------------------------------------------------
11:43:18 Traceback (most recent call last):
11:43:18 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 68, in test_create_with_existing_server_name
11:43:18 for server_id in (id1, id2):
11:43:18 UnboundLocalError: local variable 'id1' referenced before assignment
11:43:18
11:43:18 ======================================================================
11:43:18 ERROR: Delete a server while it's VM state is Building
11:43:18 ----------------------------------------------------------------------
11:43:18 Traceback (most recent call last):
11:43:18 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 148, in test_delete_server_while_in_building_state
11:43:18 self.client.wait_for_server_status(server['id'], 'BUILD')
11:43:18 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
11:43:18 raise exceptions.BuildErrorException(server_id=server_id)
11:43:18 BuildErrorException: Server d3bc54f7-8914-4156-abba-600bc54d2550 failed to build and is in ERROR status
11:43:18
11:43:18 ======================================================================
11:43:18 ERROR: The server's access addresses should reflect the provided values
11:43:18 ----------------------------------------------------------------------
11:43:18 Traceback (most recent call last):
11:43:18 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 125, in test_update_access_server_address
11:43:18 self.client.wait_for_server_status(server['id'], 'ACTIVE')
11:43:18 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
11:43:18 raise exceptions.BuildErrorException(server_id=server_id)
11:43:18 BuildErrorException: Server 245309e8-1645-4331-a9d0-d0ed9e8028bb failed to build and is in ERROR status
11:43:18
11:43:18 ======================================================================
11:43:18 ERROR: The server name should be changed to the the provided value
11:43:18 ----------------------------------------------------------------------
11:43:18 Traceback (most recent call last):
11:43:18 File "/opt/stack/tempest/tempest/tests/compute/servers/test_servers.py", line 100, in test_update_server_name
11:43:18 self.client.wait_for_server_status(server['id'], 'ACTIVE')
11:43:18 File "/opt/stack/tempest/tempest/services/compute/xml/servers_client.py", line 195, in wait_for_server_status
11:43:18 raise exceptions.BuildErrorException(server_id=server_id)
11:43:18 BuildErrorException: Server 4db93eed-365e-4721-9f5b-09c10b745c3f failed to build and is in ERROR status

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Anyway to get the server side logs when this problem occurs?

Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

looks like all the logs get truncated just before we get the ERROR back, for

Example in 96 the ERROR happens at 22:43:03 (https://jenkins.openstack.org/job/periodic-tempest-devstack-vm-check-hourly/96/consoleFull)

The last entry in n-api.txt is 22:41:39 and n-cpu stops at 22:42:44

Is some kind of timeout killing all the process(es)? Even the kern_log.txt stops at 22:42:43

Revision history for this message
David Kranz (david-kranz) wrote :

This happened three or four times today.

Revision history for this message
David Kranz (david-kranz) wrote :

I did an analysis of the last 200+ hourly runs. 40 of them failed and almost all of the errors are from /tempest/tests/compute/servers/test_servers.py. In these failures the instance reaches ERROR state quickly. This is not a timeout issue. There is also an intermittent failure in exercises where an instance fails to create in the 60 seconds allowed. Here is an example:

http://logs.openstack.org/16591/1/gate/gate-tempest-devstack-vm/19391/console.html

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Download full text (14.2 KiB)

David,

That 16591 seems to be rpc timeout related

http://logs.openstack.org/16591/1/gate/gate-tempest-devstack-vm/19391/logs/screen-n-cpu.txt.gz

-- dims

2012-11-20 23:16:14 21855 DEBUG nova.manager [-] Running periodic task ComputeManager._poll_rebooting_instances periodic_tasks /opt/stack/nova/nova/manager.py:171
2012-11-20 23:16:14 21855 DEBUG nova.manager [-] Skipping ComputeManager._cleanup_running_deleted_instances, 28 ticks left until next run periodic_tasks /opt/stack/nova/nova/manager.py:166
2012-11-20 23:16:14 21855 DEBUG nova.manager [-] Running periodic task ComputeManager._check_instance_build_time periodic_tasks /opt/stack/nova/nova/manager.py:171
2012-11-20 23:16:14 21855 DEBUG nova.manager [-] Running periodic task ComputeManager._heal_instance_info_cache periodic_tasks /opt/stack/nova/nova/manager.py:171
2012-11-20 23:16:14 21855 DEBUG nova.openstack.common.rpc.amqp [-] Making asynchronous call on network ... multicall /opt/stack/nova/nova/openstack/common/rpc/amqp.py:352
2012-11-20 23:16:14 21855 DEBUG nova.openstack.common.rpc.amqp [-] MSG_ID is e0b7dcdbf50f4ef99c907163773d8281 multicall /opt/stack/nova/nova/openstack/common/rpc/amqp.py:355
2012-11-20 23:16:39 21855 ERROR nova.openstack.common.rpc.common [-] Timed out waiting for RPC response: timed out
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/opt/stack/nova/nova/openstack/common/rpc/impl_kombu.py", line 552, in ensure
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/opt/stack/nova/nova/openstack/common/rpc/impl_kombu.py", line 630, in _consume
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 110, in drain_events
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 200, in drain_events
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common return connection.drain_events(**kwargs)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 50, in drain_events
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common return self.wait_multi(self.channels.values(), timeout=timeout)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 56, in wait_multi
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common chanmap.keys(), allowed_methods, timeout=timeout)
2012-11-20 23:16:39 21855 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/tran...

Revision history for this message
David Kranz (david-kranz) wrote :

When I said this was not a timeout issue, I meant that from the tempest perspective the test is not failing because it took too long to get to the expected state. The instance goes to the ERROR state quickly. It is of course possible that the bad behavior is caused by some internal timeout. This bug continues to block many attempts to check code into tempest. I hope this issue can be fixed soon so we can add more tests to the project gate that will hopefully prevent this kind of bug slipping in again.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

David,

Only thing i can think of is some sort of problem with rabbitmq. We don't seem to capture logs related to queueing do we? Where could we make the change to capture that log?

thanks,
dims

Revision history for this message
David Kranz (david-kranz) wrote :

This bug is standing in the way of applying most of the full tempest gate to nova. Another nova bug slipped through the partial gate yesterday. I see that this ticket is not even assigned or marked as important. This behavior seems like a pretty serious regression and hope it can be resolved soon.

Revision history for this message
David Kranz (david-kranz) wrote :

As of 9am on 12/8, this is happening on every run of periodic-tempest-devstack-vm-check-hourly.

Sean Dague (sdague)
Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Sean Dague (sdague-b)
Sean Dague (sdague)
Changed in tempest:
assignee: nobody → Sean Dague (sdague-b)
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

This is a possible fix for this - https://review.openstack.org/#/c/18098/, turns out we were being very sloppy with cleaning up servers, with the net result being that we were occationally overrunning the memory of the CI systems when deletes didn't process fast enough, thus causing vms to not start which we assumed would.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/18098
Committed: http://github.com/openstack/tempest/commit/dd523b1f3f56e083a338c630e0eace90aa625615
Submitter: Jenkins
Branch: master

commit dd523b1f3f56e083a338c630e0eace90aa625615
Author: Sean Dague <email address hidden>
Date: Thu Dec 13 17:31:22 2012 -0500

    ensure servers are deleted between tests

    Possible fix for bug #1079687

    When running tests we create a lot of servers, and do delete
    calls at the end of tests. However delete is an async action,
    so the previous servers will often be hanging around after our
    tests. In memory constrained environments like the OpenStack CI
    system if deletes happen too slowly we hit the memory limits,
    nova scheduler can no longer schedule VMs, and the tests get flakey.

    This *may* fix the periodic fails we've seen on tempest in nova.

    Pair programmed with Matt Treinish <email address hidden>

    Change-Id: I1f93ac26064c68bf529e2efa1bf0c35e1f4e7d2c

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
Sean Dague (sdague) wrote :

This is solved in the general case in tempest.

Changed in nova:
status: Triaged → Invalid
Sean Dague (sdague)
Changed in tempest:
milestone: none → havana-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers