[Image] failed to reach ACTIVE status within the required time (196 s). Current status: SAVING

Bug #1320617 reported by Attila Fazekas on 2014-05-18
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceilometer
Invalid
Undecided
Unassigned
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

http://logs.openstack.org/08/92608/1/gate/gate-tempest-dsvm-postgres-full/57b137a/console.html.gz

The relevant part of the test case:
1. Boot servet1
2. Boot server2 wait until active
3. wait until server1 ACTIVE
4. Create snapshot from server1, wait until ACTIVE with 196 sec timeout
5. Cleanup, die with the first failure

Normally the test case would create 2 additional image at the beginning,
but now died at fist image creation.

2014-05-09 21:44:09.857 | Captured traceback:
2014-05-09 21:44:09.858 | ~~~~~~~~~~~~~~~~~~~
2014-05-09 21:44:09.858 | Traceback (most recent call last):
2014-05-09 21:44:09.858 | File "tempest/api/compute/images/test_list_image_filters.py", line 45, in setUpClass
2014-05-09 21:44:09.858 | cls.server1['id'], wait_until='ACTIVE')
2014-05-09 21:44:09.858 | File "tempest/api/compute/base.py", line 302, in create_image_from_server
2014-05-09 21:44:09.858 | kwargs['wait_until'])
2014-05-09 21:44:09.858 | File "tempest/services/compute/xml/images_client.py", line 140, in wait_for_image_status
2014-05-09 21:44:09.858 | waiters.wait_for_image_status(self, image_id, status)
2014-05-09 21:44:09.858 | File "tempest/common/waiters.py", line 129, in wait_for_image_status
2014-05-09 21:44:09.858 | raise exceptions.TimeoutException(message)
2014-05-09 21:44:09.859 | TimeoutException: Request timed out
2014-05-09 21:44:09.859 | Details: (ListImageFiltersTestXML:setUpClass) Image 20b6e7a9-f65d-4d17-b025-59f9237ff8cb failed to reach ACTIVE status within the required time (196 s). Current status: SAVING.

logstash:
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiZmFpbGVkIHRvIHJlYWNoIEFDVElWRSBzdGF0dXMgd2l0aGluIHRoZSByZXF1aXJlZCB0aW1lXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIgQU5EIG1lc3NhZ2U6XCJDdXJyZW50IHN0YXR1czogU0FWSU5HXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6ImFsbCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDA0MTEyNDcwMzksIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

message:"failed to reach ACTIVE status within the required time" AND filename:"console.html" AND message:"Current status: SAVING"

description: updated
description: updated
Matt Riedemann (mriedem) wrote :

We could have a ton of these various timeout issues in Tempest, we're never going to nail these down properly so we need to step back and start looking at Tempest and thinking about how to make that smarter than assuming a hard-coded value is going to fit all scenarios, i.e. there could be concurrent load on the host causing it to run slow at the time of failure, or we could look at getting diagnostic information back about the instance and logging that to see if there is something more specific to fingerprint when we hit a timeout failure.

Matt Riedemann (mriedem) wrote :

For example, when this failed in the log link above, around time 21:44:09.859, you can see in dstat that ceilometer agent notification was really impacting the CPU:

http://logs.openstack.org/08/92608/1/gate/gate-tempest-dsvm-postgres-full/57b137a/logs/dstat.txt.gz

09-05 21:44:09| 68 14 16 1 0 1|3933M 184M 2529M 837M| 0 0 | 48k 7780k|5.00 192 |4588 8707 |8.20 6.81 3.49| 13 0 52|ceilometer-agent-noti2052 7.5%3552B 97k

It'd be cool if we could dynamically adjust timeouts based on current system load...but maybe that's just overly complicating things - the alternative is probably just making ceilometer better and consuming less CPU all the time.

Matt Riedemann (mriedem) wrote :

I have a partial bug proof of concept up for Tempest here: https://review.openstack.org/#/c/94245/

Matt Riedemann (mriedem) wrote :

Also pushed this related bug fix to devstack: https://review.openstack.org/#/c/94221/

Matt Riedemann (mriedem) on 2014-05-20
summary: - failed to reach ACTIVE status within the required time (196 s). Current
- status: SAVING
+ [Image] failed to reach ACTIVE status within the required time (196 s).
+ Current status: SAVING
Matt Riedemann (mriedem) wrote :

Another idea that would be nice to help debug this is change tempest to launch the snapshot as a task to glance and then if it fails we can get information about the task when tempest times out.

Tracy Jones (tjones-i) on 2014-05-21
tags: added: testing
Eoghan Glynn (eglynn) wrote :

Due to some recent improvements in the ceilometer sqlalcmehy driver (on master as opposed to stable/icehouse), the CPU load imposed by ceilometer has significantly reduced.

Are you still seeing this issue occur regularly on master?

Attila Fazekas (afazekas) wrote :

196 sec sounds like an eternity, compered to the normal 6-7 sec.

At the first look of the logs seamed like n-cpu always picked another task instead of serving an earlier arrived snapshot create.

The build timeout should be configured on the devstack-gate side and devstack should allow to override the default value with environment variables.
An another option is switching to local.conf on the gate and post configuring tempest.
(Unfortunately devstack stable/havana version not able to change the tempest.conf by the local.conf.)

AFAIK we still use only 1 glance worker.

Matt Riedemann (mriedem) wrote :

@afazekas, do you think we should set glance workers to nproc in devstack-gate to see if it helps here at all? Note that nova by defaults runs with $nproc workers for the api/meta and conductor workers.

gordon chung (chungg) wrote :

is this still an issue? specifically connected to ceilometer? the logstash query in description still pulls some results but it seems unrelated to Ceilometer:

http://logs.openstack.org/65/98365/6/check/check-tempest-dsvm-full/fde2d04/logs/screen-dstat.txt.gz

if we still think it's a ceilometer issue, we can enable multiple workers (we currently run on a single thread) in ceilometer

Changed in ceilometer:
status: New → Incomplete
Kashyap Chamarthy (kashyapc) wrote :

Any update here? The DevStack patch seems to be abandoned.

In one of the comments, Sean Dague says "For instance, the orchestration image and timeout are coupled, and are set in the gate. The reason this is a timeout is that fedora is sometimes not booting at all in the gate."

Is the above still the case?

Meanwhile, I can confirm this same time-out issue reoccurring for "check-tempest-dsvm-full" job --

http://logs.openstack.org/90/102490/1/check/check-tempest-dsvm-full/b241a13/console.html#_2014-06-25_11_34_43_913

Contextual snippet:
--------------
[. . .]
2014-06-25 11:34:43.906 | Details: (ImagesOneServerTestJSON:tearDown) Server 6df01c6c-27a3-4e87-b25d-bc1d62dff5b9 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: ACTIVE. Current task state: image_pending_upload.
2014-06-25 11:34:43.906 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base Traceback (most recent call last):
2014-06-25 11:34:43.906 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base File "tempest/api/compute/base.py", line 161, in server_check_teardown
2014-06-25 11:34:43.906 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base 'ACTIVE')
2014-06-25 11:34:43.906 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base File "tempest/services/compute/json/servers_client.py", line 173, in wait_for_server_status
2014-06-25 11:34:43.906 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base raise_on_error=raise_on_error)
2014-06-25 11:34:43.907 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base File "tempest/common/waiters.py", line 107, in wait_for_server_status
2014-06-25 11:34:43.907 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base raise exceptions.TimeoutException(message)
2014-06-25 11:34:43.907 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base TimeoutException: Request timed out
2014-06-25 11:34:43.907 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base Details: (ImagesOneServerTestJSON:tearDown) Server 6df01c6c-27a3-4e87-b25d-bc1d62dff5b9 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: ACTIVE. Current task state: image_pending_upload.
2014-06-25 11:34:43.907 | 2014-06-25 11:16:00.362 21498 TRACE tempest.api.compute.base
[. . .]
--------------

Daniel Berrange (berrange) wrote :

FYI, one option to perhaps help reliability to try to look at whether the guest OS is 'making progress' in booting. eg monitor the disk & network I/O stats - if those are changing it is reasonable to assume the guest is still booting or working & thus allow things to wait longer to complete. If those are not changing over a 60 second period it is more reasonable to assume it has either hung or finished and so abort the tests. This could help avoid having to have a ridiculously long timeout in the general case.

Matt Riedemann (mriedem) wrote :

I'd like to see us use the diagnostics API in Tempest to get information off the instance when we hit a timeout, just haven't had the time yet.

http://docs.openstack.org/api/openstack-compute/2/content/GET_os-diagnostics-v2_getDiagnostics__v2__tenant_id__servers__server_id__diagnostics_ext-diagnostics.html

Note that not all virt drivers implement that so we'd have to handle that appropriately in Tempest. It's also inconsistent data in the V2 API, but that's OK in my opinion for just dumping it to the tempest/console log for debug.

As for the glance/cinder changes to devstack for running as many API workers as there are CPUs on the host, I've moved that work to the actual projects rather than devstack, tracked via bug 1333325 and bug 1333370.

Attila Fazekas (afazekas) wrote :

I generally would like to have more workers with everything by default, but in this case I do not have anything which would prove it would really help.

Matt Riedemann (mriedem) wrote :

This is hitting the _live_snapshot code in the libvirt driver:

http://logs.openstack.org/90/102490/1/check/check-tempest-dsvm-full/b241a13/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-06-25_11_13_07_909

This is due to running trusty with libvirt 1.2.2 in the gate, which made this spike the same as bug 1334398. The errors are slightly different, this one blows up in waiting for the block job abort, but it indicates why they are spiking in the gate at the same time and then drop off with the same workaround for now:

https://review.openstack.org/#/c/102643/

Joe Gordon (jogo) wrote :

no hits in the gate anymore looks like this was resolved or the fingerprint has changed.

Changed in nova:
status: New → Incomplete
Sean Dague (sdague) on 2014-09-17
Changed in nova:
status: Incomplete → Invalid
gordon chung (chungg) wrote :

marking this invalid from ceilometer pov, since this hasn't cropped up. as eglynn mentioned, there were various changes to address issues... not really sure if consuming 7.5% cpu is the root of issue regardless.

Changed in ceilometer:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers