timeout causing gate-tempest-dsvm-full to fail

Bug #1258682 reported by John Dickinson on 2013-12-06
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
OpenStack Core Infrastructure
In Progress
High
Unassigned
tempest
Undecided
Unassigned

Bug Description

This has happened several times. A recent example is in https://jenkins02.openstack.org/job/gate-tempest-dsvm-full/775/console

There are several mentions of FAIL in the logs, but since the job timed out, no console logs were saved.

John Dickinson (notmyname) wrote :

see http://paste.openstack.org/show/54727/ for an example of the logs, including tracebacks (the test node was preserved and logs looked at)

Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :
Matthew Treinish (treinish) wrote :

I've uploaded some of the logs from the test node that John mentioned above that we held. That pastebin was just a grep of all the logs for TRACE. I can upload any of the other service logs from the node, I just figured these would be the most useful.

Abhishek Chanda (abhishek-i) wrote :

Marking as critical since it blocks a gating job

Changed in nova:
status: New → Triaged
importance: Undecided → Critical
Joe Gordon (jogo) wrote :

Abhishek, are you sure this is a nova issue? If we don't have a elastic recheck fingerprint for this is (making it harder to see if its a duplicate or not) then we shouldn't mark this as critical. Also don't see any comments about how you confirmed this is a nova issue so moving back to new.

Changed in nova:
importance: Critical → Undecided
status: Triaged → New
Sean Dague (sdague) wrote :

If the fix is increasing the timeout in the gate, it's not a tempest bug. It looks like in this case libvirt went off the rails, so nova is probably a good bug choice

Changed in tempest:
status: New → Invalid
James E. Blair (corvus) on 2013-12-13
tags: added: gate-failure
Matt Riedemann (mriedem) wrote :

Looks like bug 1253185 was the same issue before but was closed with this comment from James Blair:

"This was from when serving zuul refs was very slow; since fixed."

That was on 12/10. I saw it this morning with bug 1260816.

James E. Blair (corvus) wrote :

1253185 really was a different problem -- in the run cited in that bug, it took 25 minutes for devstack to start because zuul refs were being served slowly.

Related fix proposed to branch: master
Review: https://review.openstack.org/62084

Reviewed: https://review.openstack.org/62067
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=6b60f3b09bc85b4a4791c19821e187313cb2d1db
Submitter: Jenkins
Branch: master

commit 6b60f3b09bc85b4a4791c19821e187313cb2d1db
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 13 11:41:22 2013 -0800

    Add e-r query for bug 1258682

    When the build times out and this fails, there are no logs really
    so we have to base this on the build timeout message in the
    console log.

    Note that we are essentially doing a wildcard for the timeout
    value but we restrict the query based on build_name to avoid
    hits on some swift jobs.

    Related-Bug: #1258682

    Change-Id: I0db0e08627609b44ec8ea132b980021f8d7b7b9d

Reviewed: https://review.openstack.org/62084
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=f06073513af21a93fdbca39cb4bd7e2961541caa
Submitter: Jenkins
Branch: master

commit f06073513af21a93fdbca39cb4bd7e2961541caa
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 13 12:32:57 2013 -0800

    Add grenade jobs to the bug 1258682 e-r query

    Related-Bug: #1258682

    Change-Id: Id9b6e81c40e5bd0dfd9d963eb3f9c9fa055fc100

Morgan Fainberg (mdrnstm) wrote :

This seems to also be a similar run http://logs.openstack.org/27/56827/3/gate/gate-tempest-dsvm-full/75c3c1b/ is it expected that a timeout will purge all of the log files so there are no screen-* logs

Joe Gordon (jogo) wrote :

This bug is very very generic, and can be multiple bugs underneath. We are working on making sure that in the event of a timeout we preserve logs

Reviewed: https://review.openstack.org/62786
Committed: https://git.openstack.org/cgit/openstack-infra/devstack-gate/commit/?id=7a742e83963f3dabf63a221998375facfa4409bd
Submitter: Jenkins
Branch: master

commit 7a742e83963f3dabf63a221998375facfa4409bd
Author: Joe Gordon <email address hidden>
Date: Tue Dec 17 16:25:43 2013 -0800

    Preserve testr temp files

    If testr is killed before it finishes running, it saves its output in a
    temp file. If something times out this file may provide insight into
    what was making it slow, so preserve the temp file.

    Related-Bug: #1258682

    Change-Id: I5cd1bc2326998bf3a1c29cd3773bf583f04ef3d5

Reviewed: https://review.openstack.org/63364
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=446b93491ad64112ed3a024c0f78df284aac6eba
Submitter: Jenkins
Branch: master

commit 446b93491ad64112ed3a024c0f78df284aac6eba
Author: Masayuki Igawa <email address hidden>
Date: Fri Dec 20 18:24:02 2013 +0900

    Add query for bug 1263032

    Logstash says this occurs 3 times in the last 7 days.

    Related-Bug: #1258682

    Change-Id: I24e8b597a8e5e0c15c935c90698221d43147cdc8

David Kranz (david-kranz) wrote :

I am seeing a similar kind of failure here http://logs.openstack.org/77/64077/2/gate/gate-tempest-dsvm-postgres-full/b254ac8/console.html

But the timeout error looks like:

./safe-devstack-vm-gate-wrap.sh: line 213: 2805 Killed timeout -s 9 ${DEVSTACK_GATE_TIMEOUT}m $BASE/new/devstack-gate/devstack-vm-gate.sh

Clark Boylan (cboylan) wrote :

That is the same failure, but new machinery was added to devstack-gate to timeout within the job itself so that the log data can be collected before the Jenkins timeout is hit (when you hit the Jenkins timeout all of the service logs are unavailable).

Bhuvan Arumugam (bhuvan) wrote :

it affect check-granade-dsvm as well. this job fail due to timeout.
http://logs.openstack.org/37/65337/3/check/check-grenade-dsvm/5c60ffc/console.html

build timeout after 65mins.

Jeremy Stanley (fungi) on 2014-01-08
Changed in openstack-ci:
status: New → In Progress
importance: Undecided → High
milestone: none → icehouse
Jeremy Stanley (fungi) wrote :

There is work underway to have zuul push its git refs to the git.openstack.org farm, which we can scale out/up easily to cope with additional load. This should hopefully reduce any delays introduced by zuul being too heavily loaded during frequent gate resets to be able to also serve git repositories to job workers in a timely fashion.

Reviewed: https://review.openstack.org/65303
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=a9377929df5f0bff3dd2c9667ee4f58ea95346b5
Submitter: Jenkins
Branch: master

commit a9377929df5f0bff3dd2c9667ee4f58ea95346b5
Author: Matt Riedemann <email address hidden>
Date: Tue Jan 7 08:17:40 2014 -0800

    Fix the e-r query for bug 1258682

    The query for bug 1258682 uses wildcards and that doesn't work since
    wildcard analysis is disabled by default in elasticsearch for
    performance reasons.

    Looking at the query again, filtering on build_name is probably not
    necessary and actually limits the number of hits where we see the bug.
    When removing the build_name filters, there are more hits and on
    different jobs, i.e. gate-rally-py26 and gate-heat-py27 start showing
    hits.

    Closes-Bug: #1266833
    Related-Bug: #1258682

    Change-Id: Ib9c3bd05592f40d1bea8f4428e8e8fb0776cdcce

Reviewed: https://review.openstack.org/67094
Committed: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/commit/?id=16fd1efe8dd759c3bc6eac7dc4cfeb41411ffe4b
Submitter: Jenkins
Branch: master

commit 16fd1efe8dd759c3bc6eac7dc4cfeb41411ffe4b
Author: Alexis Lee <email address hidden>
Date: Thu Jan 16 10:12:05 2014 +0000

    Augment the e-r query for bug 1258682

    Work done to retain logs has changed the signature of this bug, see
    comments #26 + #27 on:
        https://bugs.launchpad.net/tempest/+bug/1258682

    Change-Id: Ia0813e64465ee59f351f0ee212681ab2fd256797
    Related-Bug: #1258682

Matt Riedemann (mriedem) wrote :

Hmm, this patch hit this bug in the gate but e-r didn't catch it, wondering if the latest change to the e-r query broke it?

http://logs.openstack.org/21/54521/11/gate/gate-grenade-dsvm/a6f185d/console.html

Matt Riedemann (mriedem) wrote :

This still shows up, but part of that is tripleo CI:

http://goo.gl/XRDrCt

And part of it is horizon/selenium bug 1317630.

Jeremy Stanley (fungi) on 2014-10-27
Changed in openstack-ci:
milestone: icehouse → kilo
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers