*tempest-dsvm-neutron-heat-slow fails with WaitConditionTimeout

Bug #1297560 reported by Matt Riedemann on 2014-03-26
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
tempest
Undecided
Attila Fazekas

Bug Description

There are actually two test failures here:

http://logs.openstack.org/10/82410/4/check/check-tempest-dsvm-neutron-heat-slow/094a340/console.html

And I see one stack trace in the h-engine log:

http://logs.openstack.org/10/82410/4/check/check-tempest-dsvm-neutron-heat-slow/094a340/logs/screen-h-eng.txt.gz?level=TRACE#_2014-03-25_14_28_27_890

2014-03-25 14:28:27.890 22679 ERROR heat.engine.resource [-] CREATE : WaitCondition "WaitCondition" [WaitHandle] Stack "heat-1205857075" [ae6bfe9c-5c98-4830-b88f-a66f34e2575a]
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource Traceback (most recent call last):
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource File "/opt/stack/new/heat/heat/engine/resource.py", line 420, in _do_action
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource while not check(handle_data):
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource File "/opt/stack/new/heat/heat/engine/resources/wait_condition.py", line 252, in check_create_complete
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource return runner.step()
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource File "/opt/stack/new/heat/heat/engine/scheduler.py", line 179, in step
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource self._runner.throw(self._timeout)
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource File "/opt/stack/new/heat/heat/engine/resources/wait_condition.py", line 227, in _wait
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource raise timeout
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource WaitConditionTimeout: 0 of 1 received
2014-03-25 14:28:27.890 22679 TRACE heat.engine.resource

There are 19 hits in the last 7 days, both check and gate queues:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiaGVhdC5lbmdpbmUucmVzb3VyY2UgV2FpdENvbmRpdGlvblRpbWVvdXQ6IDAgb2YgMSByZWNlaXZlZFwiIEFORCBmaWxlbmFtZTpsb2dzKnNjcmVlbi1oLWVuZy50eHQiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTU3OTQ2NTg2NTN9

Matt Riedemann (mriedem) wrote :
Sean Dague (sdague) wrote :

This appears that 600s is not an adequate timeout for these tests, it needs to be increased.

Steve Baker (steve-stevebaker) wrote :

I would recommend raising tempest.conf [orchestration] build_timeout to 1200

1200 is also what tempest tox.ini heat-slow OS_TEST_TIMEOUT is already set to.

Reviewed: https://review.openstack.org/87691
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=a2dfd49a3ed6c893c2bc29771c3751adc31a92c3
Submitter: Jenkins
Branch: master

commit a2dfd49a3ed6c893c2bc29771c3751adc31a92c3
Author: Matthew Treinish <email address hidden>
Date: Tue Apr 15 11:15:34 2014 -0400

    Up the default timeout for stack builds

    This commit ups the default stack build timeout from 600 to 1200.
    The stack frequently can take longer than 600 secs to build so
    doubling it should give us a safe margin.

    Related-Bug: #1297560

    Change-Id: I3b13a225793d42e2fa9f4871b2fe81c745954c56

Matt Riedemann (mriedem) wrote :

Looks like this also needs to be changed for the heat-slow job:

https://github.com/openstack/tempest/blob/master/tempest/api/orchestration/stacks/templates/cfn_init_signal.yaml#L71

That's still doing a timeout of 600 seconds in the template and that's how long heat is waiting for the instance to be setup, which is what's timing out.

Steven Hardy (shardy) wrote :

Patch which converts the template timeout to respect orchestration.build_timeout here:

https://review.openstack.org/#/c/87993/

(similar fixes required elsewhere, in-progress)

Changed in heat:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Steve Baker (steve-stevebaker)
Changed in tempest:
assignee: nobody → Steve Baker (steve-stevebaker)

Reviewed: https://review.openstack.org/87993
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=0d00dd8be166c99097d7eb67200eb2a4ab4dc9a2
Submitter: Jenkins
Branch: master

commit 0d00dd8be166c99097d7eb67200eb2a4ab4dc9a2
Author: Steven Hardy <email address hidden>
Date: Wed Apr 16 16:09:58 2014 +0100

    orchestration convert API test Timeouts to respect build_timeout

    Some tests have a hard-coded WaitCondition timeout, which would be better
    controlled from the global timeout set in tempest.conf

    We should ensure future tests follow this convention where Timeout values
    are specified in the template.

    Change-Id: Ie4a9048771e3a106e1d9c6b69f244461f6d5fd8d
    Partial-Bug: #1297560

Steve Baker (steve-stevebaker) wrote :

There is a failure which occurs before the WaitConditionTimeout

See the end of this boot log:
http://logs.openstack.org/37/88137/3/check/check-tempest-dsvm-neutron-heat-slow-icehouse/2854094/logs/tempest.txt.gz?level=DEBUG#_2014-04-22_00_49_19_249

versus this successful boot log:
http://logs.openstack.org/37/88137/3/check/check-tempest-dsvm-neutron-heat-slow/c564185/logs/tempest.txt.gz?level=DEBUG#_2014-04-22_00_38_53_550

some things are apparent in the failed boot logs:
* The boot timestamps are much bigger, around 400s instead of 200s
* cloud-init fails to set the host name, ie [WARNING]: Failed to set the hostname to cf-itscenariotest--384965229-smokeserver-2hxz75x2ybpz.novalocal
* The Login service fails to start on multiple attempts
* cloud-config doesn't execute due to dependency failures
[[1;33mDEPEND[0m] Dependency failed for Cloud-config availability.
[[1;33mDEPEND[0m] Dependency failed for Apply the settings specified in cloud-config.
[[1;33mDEPEND[0m] Dependency failed for Execute cloud user/final scripts.

This is much closer to the root cause, although I still have no idea what is causing the slow boots and timed-out login service.

Some useful logstash queries:
message:"Failed to set the hostname to" AND NOT message:"RESP BODY" AND filename:console.html
message:"Dependency failed for Execute cloud user/final scripts." AND NOT message:"RESP BODY" AND filename:console.html

Suggestions welcome on where to go next with this.

Steve Baker (steve-stevebaker) wrote :

This could be caused by the excessively frequent default of [orchestration]build_interval=1 causing enough load to bring down nova boot times.

I would suggest a better default would be 10s

Reviewed: https://review.openstack.org/88137
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=22c1660072d0c6a17b16136a31a7e90204fcc605
Submitter: Jenkins
Branch: master

commit 22c1660072d0c6a17b16136a31a7e90204fcc605
Author: Steve Baker <email address hidden>
Date: Mon May 5 13:34:19 2014 +1200

    Make test_server_cfn_init a scenario test

    Converting test_server_cfn_init to a scenario test is long overdue
    since it is more of a scenario than an exercising of the heat API.

    The heat-slow job has gone non-voting because this test fails
    ~%50 of the time due to bug #1297560. The server boot log is now
    logged regardless of success or failure to build up data to diagnose
    the issue.

    This also adds several convenience functions to the test base class.

    Co-Authored-By: Steven Hardy <email address hidden>
    Related-Bug: #1297560
    Change-Id: I077aeaf2bf8b292699eb20c5a75c59df35645913

vishal yadav (vishalcdac07) wrote :

This bug re-appeared and causing gate-failure.

http://logs.openstack.org/52/97452/1/check/check-tempest-dsvm-neutron-heat-slow/e0f795c/console.html

2 tests failed due to this:
====================================================================================================
2014-06-03 13:38:24.030 | {2} setUpClass (tempest.api.orchestration.stacks.test_neutron_resources.NeutronResourcesTestJSON) [0.000000s] ... FAILED
2014-06-03 13:38:24.031 |
2014-06-03 13:38:24.031 | Captured traceback:
2014-06-03 13:38:24.031 | ~~~~~~~~~~~~~~~~~~~
2014-06-03 13:38:24.031 | Traceback (most recent call last):
2014-06-03 13:38:24.031 | File "tempest/test.py", line 76, in decorator
2014-06-03 13:38:24.031 | f(cls)
2014-06-03 13:38:24.031 | File "tempest/api/orchestration/stacks/test_neutron_resources.py", line 74, in setUpClass
2014-06-03 13:38:24.031 | raise e
2014-06-03 13:38:24.031 | TimeoutException: Request timed out
2014-06-03 13:38:24.031 | Details: Request timed out
2014-06-03 13:38:24.031 | Details: Stack heat-773615493 failed to reach CREATE_COMPLETE status within the required time (900 s).
====================================================================================================

and

====================================================================================================
2014-06-03 13:38:51.440 | {1} tempest.scenario.orchestration.test_server_cfn_init.CfnInitScenarioTest.test_server_cfn_init [933.619359s] ... FAILED
2014-06-03 13:38:51.440 |
2014-06-03 13:38:51.440 | Captured traceback:
2014-06-03 13:38:51.440 | ~~~~~~~~~~~~~~~~~~~
2014-06-03 13:38:51.440 | Traceback (most recent call last):
2014-06-03 13:38:51.440 | File "tempest/test.py", line 126, in wrapper
2014-06-03 13:38:51.440 | return f(self, *func_args, **func_kwargs)
2014-06-03 13:38:51.440 | File "tempest/scenario/orchestration/test_server_cfn_init.py", line 130, in test_server_cfn_init
2014-06-03 13:38:51.440 | self.check_stack()
2014-06-03 13:38:51.440 | File "tempest/scenario/orchestration/test_server_cfn_init.py", line 92, in check_stack
2014-06-03 13:38:51.440 | raise e
2014-06-03 13:38:51.440 | StackResourceBuildErrorException: Resource WaitCondition in stack CfnInitScenarioTest--2115605652/f718c337-05bd-47ef-9910-4dedb959362c is in CREATE_FAILED status due to 'WaitConditionTimeout: 0 of 1 received'
====================================================================================================

Attila Fazekas (afazekas) wrote :

message:"Return code of 7 after executing" AND NOT message: "RESP BODY" AND tags:"console"
http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiUmV0dXJuIGNvZGUgb2YgNyBhZnRlciBleGVjdXRpbmdcIiBBTkQgTk9UIG1lc3NhZ2U6IFwiUkVTUCBCT0RZXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIiLCJ0aW1lZnJhbWUiOiI4NjQwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwMjQ3MjYwODgyOX0=

Provision failed with exit code 7

The above looks like the leading failure.

TODO: Check Is the l3 network ready at the cfn-signal time.

NOTE: The image switched for 32bit to 64 bit long time ago, but it does not causes slow down. (at least not this much).
NOTE: There are rare other type of boot up failures for example https://bugs.launchpad.net/cirros/+bug/1312199 , but it is very rare.

Attila Fazekas (afazekas) wrote :

http://logs.openstack.org/76/71476/16/check/check-tempest-dsvm-neutron-heat-slow-icehouse/d066094/logs/screen-q-vpn.txt.gz

Normally two reuters needs to be created in heat job.
But only one created by the q-svc.

e6da91c3-01b0-4963-9359-96a3cdf4b824 not found.

Steve Baker (steve-stevebaker) wrote :

The root cause of this bug used to be fedora slow boot leading to logind start timeout. I haven't checked the logs yet to see if there are still logind timeouts, or if cfn-signal connectivity has taken over as the root cause.

Attila Fazekas (afazekas) wrote :

It cannot be ignored, because actually two test cases failed.
1. uses the the router created by devstack (in this case the extra router does not matters) (fail reason unknown)
2. A neutron related test case where the extra router meters. (fail reason can be the missing router)

The q-vpn logs just contains info about delete attempts, but the creation never happens in the agent perspective, however
the q-svc knows about 2 router creation.

If the router does not created with admin_state_up=down, it is expected to be created soon and it needs to be visible in the q-vpn logs.

(It is late here, I will check the logs again soon.)

Attila Fazekas (afazekas) wrote :

You was right, the second router just created and not used (admin_state_up: false).
Actually the router1 (created by devstack) used by both stack, and both stack has a WaitCondition which needs to be signaled.
The cfn-signal commands needs to able to connect to the heat-cfn-api at the time when it is executed, now it is tried only once in the instance lifetime.

According to the curl manual the 7 exist code means:
"7 Failed to connect to host."

Attila Fazekas (afazekas) wrote :

The net 10.0.3.0 used by both the host and by test_neutron_resources.

test_neutron_resources must use a different range.

no longer affects: neutron
Changed in tempest:
assignee: Steve Baker (steve-stevebaker) → Attila Fazekas (afazekas)

Fix proposed to branch: master
Review: https://review.openstack.org/99694

Changed in tempest:
status: New → In Progress
Steve Baker (steve-stevebaker) wrote :

Last time I calculated the failure rate for this bug it was failing ~20% of the time. Is that likely the same proportion that host gw was 10.0.3.1?

Reviewed: https://review.openstack.org/99694
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=640392b92866bc541f470bba31fc23c9c9bece1c
Submitter: Jenkins
Branch: master

commit 640392b92866bc541f470bba31fc23c9c9bece1c
Author: Attila Fazekas <email address hidden>
Date: Thu Jun 12 15:58:10 2014 +0200

    Heat Overlapping ip issue

    The hard coded cidr used by heat neutron scenario,
    was overlapping with the host systems cidr.
    Because the neutron scenario does not used his own router,
    it also caused issue for the cfn init scenario.

    * The neutron scenario will use the first configured cidr
    * The neutron scenario will use his own router
    * The neutron scenario will retry singaling the heat-cfn-api,
      because if neutron is overload and the vm boot is fast and we
      do not wait even for the meta data api service connectivity (enabled
      config drive), the signal could be sent in a wrong time when
      the L3 connectivity is not ready, but the L2 is.
    * Let neutron decide the allocation_pools (simplification)
    * Using comma_delimited_list for passing not hard coded dns servers
      to the stack

    Change-Id: I06bd197b0f6c012a1416016a40f29ddd080b21b9
    Partial-Bug: #1297560

Steve Baker (steve-stevebaker) wrote :
Download full text (3.3 KiB)

Since the overlapping IP issue has been fixed, cloud-init set_hostname failures have been the root cause of this bug. From the fedora boot log:

2014-06-17 20:24:39.823 | localhost login: [ 134.177677] input: PC Speaker as /devices/platform/pcspkr/input/input3
2014-06-17 20:24:39.823 | [ 134.943002] kvm: Nested Virtualization enabled
2014-06-17 20:24:39.823 | [ 177.146896] cloud-init[248]: Cloud-init v. 0.7.2 running 'init-local' at Tue, 17 Jun 2014 20:12:32 +0000. Up 170.43 seconds.
2014-06-17 20:24:39.823 | [ 244.675684] cloud-init[506]: Cloud-init v. 0.7.2 running 'init' at Tue, 17 Jun 2014 20:13:41 +0000. Up 239.47 seconds.
2014-06-17 20:24:39.823 | [ 248.861947] cloud-init[506]: ci-info: ++++++++++++++++Net device info+++++++++++++++++
2014-06-17 20:24:39.824 | [ 248.932690] cloud-init[506]: ci-info: +--------+-------+---------+------+------------+
2014-06-17 20:24:39.824 | [ 248.945527] cloud-init[506]: ci-info: | Device | Up | Address | Mask | Hw-Address |
2014-06-17 20:24:39.824 | [ 248.956994] cloud-init[506]: ci-info: +--------+-------+---------+------+------------+
2014-06-17 20:24:39.824 | [ 248.969013] cloud-init[506]: ci-info: | lo: | False | . | . | . |
2014-06-17 20:24:39.824 | [ 249.078113] cloud-init[506]: ci-info: | eth0: | False | . | . | . |
2014-06-17 20:24:39.824 | [ 249.228903] cloud-init[506]: ci-info: +--------+-------+---------+------+------------+
2014-06-17 20:24:39.824 | [ 249.241788] cloud-init[506]: ci-info: ++++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++++
2014-06-17 20:24:39.824 | [ 249.255404] cloud-init[506]: ci-info: +-------+-------------+----------+---------------+-----------+-------+
2014-06-17 20:24:39.824 | [ 249.267934] cloud-init[506]: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
2014-06-17 20:24:39.824 | [ 249.433215] cloud-init[506]: ci-info: +-------+-------------+----------+---------------+-----------+-------+
2014-06-17 20:24:39.824 | [ 249.529811] cloud-init[506]: ci-info: | 0 | 0.0.0.0 | 10.1.0.1 | 0.0.0.0 | eth0 | UG |
2014-06-17 20:24:39.825 | [ 249.545860] cloud-init[506]: ci-info: | 1 | 10.1.0.0 | 0.0.0.0 | 255.255.240.0 | eth0 | U |
2014-06-17 20:24:39.825 | [ 249.560586] cloud-init[506]: ci-info: +-------+-------------+----------+---------------+-----------+-------+
2014-06-17 20:24:39.825 | [ 303.431901] cloud-init[506]: 2014-06-17 20:14:44,484 - cloud-init[WARNING]: Stdout, stderr changing to (| tee -a /var/log/cloud-init-output.log, | tee -a /var/log/cloud-init-output.log)
2014-06-17 20:24:39.825 | [ 317.560994] EXT4-fs (vda1): resizing filesystem from 500000 to 524039 blocks
2014-06-17 20:24:39.825 | [ 317.628958] EXT4-fs (vda1): resized filesystem to 524039
2014-06-17 20:24:39.825 | [ 345.789486] cloud-init[506]: 2014-06-17 20:15:27,641 - util.py[WARNING]: Failed to set the hostname to cf-itscenariotest--561802081-smokeserver-gqs3uyqsmmra.novalocal (cf-itscenariotest--561802081-smokeserver-gqs3uyqsmmra)
2014-06-17 20:24:39.825 | [ 34...

Read more...

Change abandoned by afazekas (<email address hidden>) on branch: master
Review: https://review.openstack.org/99392
Reason: now it is working

Steve Baker (steve-stevebaker) wrote :

I think this bug can be marked as Fix Committed

no longer affects: heat
Changed in tempest:
status: In Progress → Fix Released
Attila Fazekas (afazekas) wrote :

Ok, I'll close this one.

Not seen in the gate pipeline in the past 7 day.

The logstash query on check pipeline picks up wrong patches as well.

Please create new bug, if you see similar issue.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers