test_server_cfn_init failed in gate-tempest-dsvm-neutron-heat-slow: AssertionError: Timed out waiting for 10.1.0.4 to become reachable

Bug #1374175 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Won't Fix
Undecided
Unassigned
tempest
Fix Released
Undecided
Steven Hardy

Bug Description

http://logs.openstack.org/02/111802/1/gate/gate-tempest-dsvm-neutron-heat-slow/5c70056/console.html

2014-09-25 18:18:25.615 | ==============================
2014-09-25 18:18:25.615 | Failed 1 tests - output below:
2014-09-25 18:18:25.616 | ==============================
2014-09-25 18:18:25.616 |
2014-09-25 18:18:25.616 | tempest.scenario.orchestration.test_server_cfn_init.CfnInitScenarioTest.test_server_cfn_init[compute,orchestration,slow]
2014-09-25 18:18:25.616 | ------------------------------------------------------------------------------------------------------------------------
2014-09-25 18:18:25.616 |
2014-09-25 18:18:25.616 | Captured traceback:
2014-09-25 18:18:25.616 | ~~~~~~~~~~~~~~~~~~~
2014-09-25 18:18:25.616 | Traceback (most recent call last):
2014-09-25 18:18:25.616 | File "tempest/test.py", line 142, in wrapper
2014-09-25 18:18:25.616 | return f(self, *func_args, **func_kwargs)
2014-09-25 18:18:25.616 | File "tempest/scenario/orchestration/test_server_cfn_init.py", line 134, in test_server_cfn_init
2014-09-25 18:18:25.617 | self.check_stack()
2014-09-25 18:18:25.617 | File "tempest/scenario/orchestration/test_server_cfn_init.py", line 89, in check_stack
2014-09-25 18:18:25.617 | "Timed out waiting for %s to become reachable" % server_ip)
2014-09-25 18:18:25.617 | File "/usr/lib/python2.7/unittest/case.py", line 408, in fail
2014-09-25 18:18:25.617 | raise self.failureException(msg)
2014-09-25 18:18:25.617 | AssertionError: Timed out waiting for 10.1.0.4 to become reachable

Looks like the stack resource uuid is 1a4acfcd-dc28-427c-88e3-35411b746509.

There is an error in the heat engine log but I'm not sure if it's related:

http://logs.openstack.org/02/111802/1/gate/gate-tempest-dsvm-neutron-heat-slow/5c70056/logs/screen-h-eng.txt.gz?level=TRACE#_2014-09-25_18_18_15_001

2014-09-25 18:18:15.001 2428 ERROR heat.engine.resource [req-b4056ec6-10b6-4262-92ea-9a829c2b269b None] DB error resource with id 5dfc5aa2-7e01-4ade-9906-ff0072f1dc67 not found

I'm not really seeing anything in the nova or neutron logs.

I'm not seeing anything great to fingerprint on for the failure either. We dump the server instance console output on failure, so I guess we could query on the post that does that:

message:"Request (CfnInitScenarioTest:test_server_cfn_init): 200 POST" AND tags:"console"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiUmVxdWVzdCAoQ2ZuSW5pdFNjZW5hcmlvVGVzdDp0ZXN0X3NlcnZlcl9jZm5faW5pdCk6IDIwMCBQT1NUXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTE2ODEyMTgzODJ9

There are 209 hits in 7 days for that, check and gate, all failures, but they could be bumping up with bug 1374021.

It'd help if Tempest put the test_class.test_case names in the error message here:

"AssertionError: Timed out waiting for 10.1.0.4 to become reachable"

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a tempest patch to add the test caller information to the timeout error message:

https://review.openstack.org/#/c/124198/

Revision history for this message
Adam Gandelman (gandelman-a) wrote :
Changed in heat:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/124198/ merged so we should be able to fingerprint this failure in elastic-recheck now.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Angus Salkeld (asalkeld) wrote :

Not just on icehouse : https://review.openstack.org/#/c/124481/1
I'll remove the icehouse/stable from the bug title.

summary: - test_server_cfn_init failed in stable/icehouse gate-tempest-dsvm-
- neutron-heat-slow: AssertionError: Timed out waiting for 10.1.0.4 to
- become reachable
+ test_server_cfn_init failed in gate-tempest-dsvm-neutron-heat-slow:
+ AssertionError: Timed out waiting for 10.1.0.4 to become reachable
Changed in heat:
importance: Undecided → Medium
Angus Salkeld (asalkeld)
Changed in heat:
importance: Medium → High
Revision history for this message
Angus Salkeld (asalkeld) wrote :
Revision history for this message
Sean Dague (sdague) wrote :

This is now bouncing things out left right and center from the gate. This needs to be the heat team's top priority.

Changed in heat:
importance: High → Critical
milestone: none → juno-rc1
Revision history for this message
Angus Salkeld (asalkeld) wrote :

I have run this tempest test locally (in a vagrant box) ~ dozen times so far one failure.

Ran 1 tests in 374.291s (+5.615s)
Ran 1 tests in 398.951s (+25.890s)
Ran 1 tests in 394.371s (-2.616s)
Ran 1 tests in 389.545s (-2.249s)
Ran 1 tests in 415.649s (+28.306s)
Ran 1 tests in 412.924s (-0.703s)

The failure I had was different tho':-(
  File "tempest/scenario/orchestration/test_server_cfn_init.py", line 134, in test_server_cfn_init
    self.check_stack()
  File "tempest/scenario/orchestration/test_server_cfn_init.py", line 93, in check_stack
    sid, 'WaitCondition', 'CREATE_COMPLETE')
  File "tempest/services/orchestration/json/orchestration_client.py", line 192, in wait_for_resource_status
    time.sleep(self.build_interval)
  File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()

I am not super familiar with all the timeouts, but it might simply be that one of the timeout settings is a bit close and
the test is a bit variable in duration.

Revision history for this message
Thierry Carrez (ttx) wrote :

RC1 is tagged already, moving to juno-rc-potential

Changed in heat:
milestone: juno-rc1 → none
tags: added: juno-rc-potential
Revision history for this message
Steven Hardy (shardy) wrote :

There are quite a few (non Heat) reasons why this could be happening:

1. Network fault meaning we can't connect to the instance
2. Instance is booting but slowly, so the timeout expires before the WaitCondition completes.

All the test does is boot an instance (via heat), then fail when trying to ping it due to exceeding
CONF.compute.ping_timeout (default is 120s).

It's possible 120s is not enough time to boot a Fedora image and/or some new (slower) compute nodes were added to the gate infra recently?

Revision history for this message
Steven Hardy (shardy) wrote :

Posted https://review.openstack.org/#/c/125606/ which aligns the ping timeout with the (much longer) stack timeout, lets see if that resolves the issue.

Revision history for this message
Steven Hardy (shardy) wrote :

I'm of the opinion this is not a Heat bug, but I'll assign it to myself and mark incomplete for now, until we can prove or disprove that

Changed in heat:
status: Confirmed → Incomplete
assignee: nobody → Steven Hardy (shardy)
importance: Critical → Undecided
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/125545
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=ff4a97e49e6db8a5a7c71f49fdeb8c2fa4d87d82
Submitter: Jenkins
Branch: master

commit ff4a97e49e6db8a5a7c71f49fdeb8c2fa4d87d82
Author: Sean Dague <email address hidden>
Date: Wed Oct 1 22:57:07 2014 -0400

    skip cfn_init test, it is failing too much

    And there is no current progress on the bug.

    Change-Id: I2604805d06a87973b8c30e56457fcd3648a666f4
    Related-Bug: #1374175

Revision history for this message
Zane Bitter (zaneb) wrote :

The error message mentioned occurs during a different test, so it's not related.

This test works by waiting for the server resource to move to CREATE_COMPLETE (this part seems to be working), then starts pinging the server with a timeout of CONF.compute.ping_timeout (120s by default). That's the part that's failing - Tempest can't ping the server within 2 minutes of Nova reporting that it's running.

As Steve said, that could be a connectivity problem or a simple case of booting the server taking slightly too long (or it could be a problem with the image being booted). It doesn't seem like a problem that Heat itself could cause, since it's not involved here beyond reporting success from Nova.

Steven Hardy (shardy)
Changed in tempest:
status: New → In Progress
assignee: nobody → Steven Hardy (shardy)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This may be just another way for bug 1323658 to show up. The same connectivity issue after boot/reboot/resize as tracked there. I suspect Neutron to be involved in those failures, but we can't be sure in this particular case, because it seems we run Heat with Neutron only.

Revision history for this message
Steven Hardy (shardy) wrote :

> This may be just another way for bug 1323658 to show up

Is the title on that bug accurate? If so I don't see how it's related, as we don't do a resize or restart in this case, just boot a VM.

In any case, I do think heat was just the messenger in this case. No reviews to my timeout patch yet, and I'm not sure how we confirm if it works now the test has been kicked out of the gate.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Likely this issue could be reproduced in a check-tempest-dsvm-neutron-full test which boots Fedora-x86_64-20-20140618-sda directly from nova

This could be a situation where only testing with cirros is leading to particular race bugs not being seen.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

... or its just a too low ping timeout, as shardy's change fixes.

Zane Bitter (zaneb)
tags: removed: juno-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/125606
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=da2a83562b297c019b9325023577196ec992e507
Submitter: Jenkins
Branch: master

commit da2a83562b297c019b9325023577196ec992e507
Author: Steven Hardy <email address hidden>
Date: Thu Oct 2 12:52:20 2014 +0100

    Orchestration use stack timeout waiting for ping response

    The test_server_cfn_init scenario test boots a full Fedora image,
    not the cirros image which is presumably used for compute tests
    where the default 120s ping timeout may be appropriate.

    To ensure we don't encounter any spurious failures due to booting the
    much bigger orchestration image (which is Fedora in the gate), use
    the stack timeout (default 900s) to override the ping timeout for
    this test.

    Change-Id: I4eba18333bdaa0ef41122cda85fd3f4d91b2cf5c
    Closes-Bug: #1374175

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/135079

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/135079
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=0070808d69300348f621f0c7301e0b5be6f2c30d
Submitter: Jenkins
Branch: master

commit 0070808d69300348f621f0c7301e0b5be6f2c30d
Author: Sean Dague <email address hidden>
Date: Mon Nov 17 15:58:08 2014 -0500

    skip test_server_cfn_init as the failure rate is too high

    Change-Id: I9a102bdd06fc8d3eb94fc3437c1cab3cfc121b88
    Related-Bug: #1374175

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/152224

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Matthew Treinish (<email address hidden>) on branch: master
Review: https://review.openstack.org/152224

Steven Hardy (shardy)
Changed in heat:
assignee: Steven Hardy (shardy) → nobody
Thomas Herve (therve)
Changed in heat:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.