tempest

test_create_server TimeoutException failed while waiting for server to build in setup

Bug #1315201 reported by Matt Riedemann on 2014-05-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Expired	Undecided	Unassigned
	tempest	Fix Released	Undecided	Sean Dague

Bug Description

There are already several timeout related bugs but nothing really fit the timeout to build in setup for this test, and it's not really the same as bug 1254890 as far as where it fails in Tempest, but could potentially be similar issues under the covers in nova.

http://logs.openstack.org/37/84037/8/check/check-grenade-dsvm-partial-ncpu/ab64155/console.html

message:"Details\: Server" AND message:"failed to reach ACTIVE status and task state \"None\" within the required time" AND message:"Current status\: BUILD. Current task state\: spawning." AND tags:console

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsc1xcOiBTZXJ2ZXJcIiBBTkQgbWVzc2FnZTpcImZhaWxlZCB0byByZWFjaCBBQ1RJVkUgc3RhdHVzIGFuZCB0YXNrIHN0YXRlIFxcXCJOb25lXFxcIiB3aXRoaW4gdGhlIHJlcXVpcmVkIHRpbWVcIiBBTkQgbWVzc2FnZTpcIkN1cnJlbnQgc3RhdHVzXFw6IEJVSUxELiBDdXJyZW50IHRhc2sgc3RhdGVcXDogc3Bhd25pbmcuXCIgQU5EIHRhZ3M6Y29uc29sZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5ODk4ODE3Njc0OX0=

48 hits in 7 days, all failures, check and gate, several different jobs. Since it's a timeout there isn't an error in the nova logs.

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-05-02:

e-r query patch here: https://review.openstack.org/#/c/91707/

tags:

added: testing

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-05-14:

Similar timeout, different test, probably different root cause (certainly different bug 1255624), but I don't know that there is a point in even reporting this now since we can't really fingerprint it well and we need to invest more time in figuring out why we are hitting these timeouts in the first place, i.e. system load, is something taking too long in nova, etc. Need better ways to diagnose this - maybe the diagnostics API could help at the point of failure?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-05-14:

Forgot the link in previous comment:

http://logs.openstack.org/99/91699/2/check/check-grenade-dsvm/e09acf9/console.html#_2014-05-14_19_41_27_620

Matt Riedemann (mriedem) on 2014-06-04

Changed in nova:
importance:	Undecided → High

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-04:

Hit another one here, different test:

http://logs.openstack.org/44/95844/3/check/check-grenade-dsvm-partial-ncpu/08f1129/console.html

Details: (ServerRescueTestXML:setUpClass) Server a06f3c8e-8018-44f7-b7f3-b45c7f9c8e6b failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning.

Looking at dstat when that failed:

http://logs.openstack.org/44/95844/3/check/check-grenade-dsvm-partial-ncpu/08f1129/logs/old/screen-dstat.txt.gz

02-06 19:24:47| 81 8 10 0 0 0|4333M 291M 2450M 660M|1408B 3028B| 261k 4544k|18.0 370 |7847 12k|9.44 5.48 2.42|7.0 0 50|qemu-system-x86_64 28626 12% 0 0

CPU idle is 10 so the system is getting pegged, note that around that time qemu-system is taking up the most CPU.

Wondering if this could be related to bug 1320617.

I started a mailing list thread on the topic here also:

http://lists.openstack.org/pipermail/openstack-dev/2014-May/035253.html

For this particular test failure, I think we could get some better logging around these messages:

1. DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/a06f3c8e-8018-44f7-b7f3-b45c7f9c8e6b/disk execute /opt/stack/old/nova/nova/openstack/common/processutils.py:154

2. "During sync_power_state the instance has a pending task. Skip."

In the failure above, the instance domain name is instance-00000008 in libvirt.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-04:

Talking with danpb and sdague in IRC, it sounds like 'qemu-img info' should be cheap and nova-rootwrap might actually be worse. Sean's thinking was we should get console output from the instance when this times out and fails in the rescue tests since the thinking is something is wrong with the guest and we're never going to get the domain into running state, so we need console output to help figure out why.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-04:

Related Tempest bug to dump the console output from the instance on timeout or error:

https://review.openstack.org/#/c/97812/

Changed in tempest:
status:	New → In Progress
assignee:	nobody → Sean Dague (sdague)

Andrew Laski (alaski) on 2014-06-04

Changed in nova:
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/97812
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=c1a3bbe37c6202562fad5ff1be6e818d12797aee
Submitter: Jenkins
Branch: master

commit c1a3bbe37c6202562fad5ff1be6e818d12797aee
Author: Sean Dague <email address hidden>
Date: Wed Jun 4 10:49:09 2014 -0400

dump console if we time out on a server op

    if we timeout waiting for a server to do a thing, also dump the
    console before we raise the exception. This will hopefully help
    on debugging failures.

Related-Bug: #1315201

Change-Id: I89d9a93fc02d9604ebc70cd26a2f5d63fbde4415

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-07-05: Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance:	High → Undecided
status:	Confirmed → Expired

Revision history for this message

Richard Theis (rtheis) wrote on 2016-08-24:

What is the current status on these ""failed to reach ACTIVE status and task state "None" within the required time (196 s)." errors? Logstash shows this as pervasive. In networking-ovn, we have been rechecking past such errors and suspect that others are too.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-08-29:

#10

Richard, you generally need better logging in the services themselves to see what's going on and why we're hitting the timeouts. It just takes people working on debugging it and making the service logging better for figuring out issues like this.

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2017-04-28:

#11

The gate is good status now.

Changed in tempest:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.