test_create_server TimeoutException failed while waiting for server to build in setup

Bug #1315201 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
tempest
Fix Released
Undecided
Sean Dague

Bug Description

There are already several timeout related bugs but nothing really fit the timeout to build in setup for this test, and it's not really the same as bug 1254890 as far as where it fails in Tempest, but could potentially be similar issues under the covers in nova.

http://logs.openstack.org/37/84037/8/check/check-grenade-dsvm-partial-ncpu/ab64155/console.html

message:"Details\: Server" AND message:"failed to reach ACTIVE status and task state \"None\" within the required time" AND message:"Current status\: BUILD. Current task state\: spawning." AND tags:console

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsc1xcOiBTZXJ2ZXJcIiBBTkQgbWVzc2FnZTpcImZhaWxlZCB0byByZWFjaCBBQ1RJVkUgc3RhdHVzIGFuZCB0YXNrIHN0YXRlIFxcXCJOb25lXFxcIiB3aXRoaW4gdGhlIHJlcXVpcmVkIHRpbWVcIiBBTkQgbWVzc2FnZTpcIkN1cnJlbnQgc3RhdHVzXFw6IEJVSUxELiBDdXJyZW50IHRhc2sgc3RhdGVcXDogc3Bhd25pbmcuXCIgQU5EIHRhZ3M6Y29uc29sZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5ODk4ODE3Njc0OX0=

48 hits in 7 days, all failures, check and gate, several different jobs. Since it's a timeout there isn't an error in the nova logs.

Tags: testing
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: testing
Revision history for this message
Matt Riedemann (mriedem) wrote :

Similar timeout, different test, probably different root cause (certainly different bug 1255624), but I don't know that there is a point in even reporting this now since we can't really fingerprint it well and we need to invest more time in figuring out why we are hitting these timeouts in the first place, i.e. system load, is something taking too long in nova, etc. Need better ways to diagnose this - maybe the diagnostics API could help at the point of failure?

Revision history for this message
Matt Riedemann (mriedem) wrote :
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

Hit another one here, different test:

http://logs.openstack.org/44/95844/3/check/check-grenade-dsvm-partial-ncpu/08f1129/console.html

Details: (ServerRescueTestXML:setUpClass) Server a06f3c8e-8018-44f7-b7f3-b45c7f9c8e6b failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning.

Looking at dstat when that failed:

http://logs.openstack.org/44/95844/3/check/check-grenade-dsvm-partial-ncpu/08f1129/logs/old/screen-dstat.txt.gz

02-06 19:24:47| 81 8 10 0 0 0|4333M 291M 2450M 660M|1408B 3028B| 261k 4544k|18.0 370 |7847 12k|9.44 5.48 2.42|7.0 0 50|qemu-system-x86_64 28626 12% 0 0

CPU idle is 10 so the system is getting pegged, note that around that time qemu-system is taking up the most CPU.

Wondering if this could be related to bug 1320617.

I started a mailing list thread on the topic here also:

http://lists.openstack.org/pipermail/openstack-dev/2014-May/035253.html

For this particular test failure, I think we could get some better logging around these messages:

1. DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/a06f3c8e-8018-44f7-b7f3-b45c7f9c8e6b/disk execute /opt/stack/old/nova/nova/openstack/common/processutils.py:154

2. "During sync_power_state the instance has a pending task. Skip."

In the failure above, the instance domain name is instance-00000008 in libvirt.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Talking with danpb and sdague in IRC, it sounds like 'qemu-img info' should be cheap and nova-rootwrap might actually be worse. Sean's thinking was we should get console output from the instance when this times out and fails in the rescue tests since the thinking is something is wrong with the guest and we're never going to get the domain into running state, so we need console output to help figure out why.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Related Tempest bug to dump the console output from the instance on timeout or error:

https://review.openstack.org/#/c/97812/

Changed in tempest:
status: New → In Progress
assignee: nobody → Sean Dague (sdague)
Andrew Laski (alaski)
Changed in nova:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/97812
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=c1a3bbe37c6202562fad5ff1be6e818d12797aee
Submitter: Jenkins
Branch: master

commit c1a3bbe37c6202562fad5ff1be6e818d12797aee
Author: Sean Dague <email address hidden>
Date: Wed Jun 4 10:49:09 2014 -0400

    dump console if we time out on a server op

    if we timeout waiting for a server to do a thing, also dump the
    console before we raise the exception. This will hopefully help
    on debugging failures.

    Related-Bug: #1315201

    Change-Id: I89d9a93fc02d9604ebc70cd26a2f5d63fbde4415

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: High → Undecided
status: Confirmed → Expired
Revision history for this message
Richard Theis (rtheis) wrote :

What is the current status on these ""failed to reach ACTIVE status and task state "None" within the required time (196 s)." errors? Logstash shows this as pervasive. In networking-ovn, we have been rechecking past such errors and suspect that others are too.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Richard, you generally need better logging in the services themselves to see what's going on and why we're hitting the timeouts. It just takes people working on debugging it and making the service logging better for figuring out issues like this.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

The gate is good status now.

Changed in tempest:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.