VolumeBackupRestoreIntegrationTest WaitConditionFailure: Test Failed

Bug #1382300 reported by Steve Baker
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Triaged
Medium
Steven Hardy

Bug Description

check-heat-dsvm-functional has failed 42 times in the last 24 hours with this error

http://logs.openstack.org/65/128365/8/check/check-heat-dsvm-functional/0435765/console.html

2014-10-16 19:06:57.981 | 2014-10-16 19:06:57.963 | FAIL: heat_integrationtests.scenario.test_volumes.VolumeBackupRestoreIntegrationTest.test_cinder_volume_create_backup_restore
2014-10-16 19:06:57.982 | 2014-10-16 19:06:57.965 | tags: worker-0
2014-10-16 19:06:57.984 | 2014-10-16 19:06:57.966 | ----------------------------------------------------------------------
2014-10-16 19:06:57.988 | 2014-10-16 19:06:57.968 | Traceback (most recent call last):
2014-10-16 19:06:57.990 | 2014-10-16 19:06:57.972 | File "heat_integrationtests/scenario/test_volumes.py", line 119, in test_cinder_volume_create_backup_restore
2014-10-16 19:06:57.991 | 2014-10-16 19:06:57.974 | add_parameters={'backup_id': backup.id})
2014-10-16 19:06:57.996 | 2014-10-16 19:06:57.975 | File "heat_integrationtests/scenario/test_volumes.py", line 75, in _create_stack
2014-10-16 19:06:57.997 | 2014-10-16 19:06:57.979 | self._wait_for_stack_status(stack_identifier, 'CREATE_COMPLETE')
2014-10-16 19:06:57.999 | 2014-10-16 19:06:57.981 | File "heat_integrationtests/common/test.py", line 291, in _wait_for_stack_status
2014-10-16 19:06:58.001 | 2014-10-16 19:06:57.983 | stack_status_reason=stack.stack_status_reason)
2014-10-16 19:06:58.002 | 2014-10-16 19:06:57.984 | StackBuildErrorException: Stack VolumeBackupRestoreIntegrationTest-1720252003/39ff0250-a5d4-4ded-a99c-463fc6886596 is in CREATE_FAILED status due to 'Resource CREATE failed: WaitConditionFailure: Test Failed'

logstash.openstack.org query:
  message:"StackBuildErrorException: Stack VolumeBackupRestoreIntegrationTest"

This is likely a nova/cinder/swift interaction issue, we may need to consider skipping this part of the test so that check-heat-dsvm-functional can be made voting

Revision history for this message
Steven Hardy (shardy) wrote :

Hmm, so a little history - when this was proposed to tempest, it was working fine for a while, then something changed in cinder such that, for some reason, it seems the volume attachment to the instance was not working, in which case the WaitCondition posts failure back to heat as it can't find the expected block device.

I was never able to reproduce this locally, it's always worked fine, so as you say it may be some gate-specific cinder->nova interaction.

As you say it's probably the backup restore part where Swift, Cinder and Nova all have to play nice together otherwise we'll fail. IIRC this scenario is not tested anywhere else so possibly we're just the messenger here for bugginess in other services.

Happy to skip this if it gets our job voting, but I would like to bottom out the reason why this has turned flaky since I first wrote it.

I was discussing with dkranz about a "soft failure" mode, where rather than skipping a test, we run it, collect stats about it's pass/fail and pass even if it fails. E.g a SKIP_FAIL assertion on test failure based on some decorator or something - is this something we could implement for our in-tree tests?

Revision history for this message
Steve Baker (steve-stevebaker) wrote : Re: [Bug 1382300] Re: VolumeBackupRestoreIntegrationTest WaitConditionFailure: Test Failed

On 17/10/14 22:45, Steven Hardy wrote:
> As you say it's probably the backup restore part where Swift, Cinder and
> Nova all have to play nice together otherwise we'll fail. IIRC this
> scenario is not tested anywhere else so possibly we're just the
> messenger here for bugginess in other services.
>
> Happy to skip this if it gets our job voting, but I would like to bottom
> out the reason why this has turned flaky since I first wrote it.
I tried all yesterday to replicate it locally and couldn't. My only
thought would be to try raising rescan_timeout to higher than 120s
> I was discussing with dkranz about a "soft failure" mode, where rather
> than skipping a test, we run it, collect stats about it's pass/fail and
> pass even if it fails. E.g a SKIP_FAIL assertion on test failure based
> on some decorator or something - is this something we could implement
> for our in-tree tests?
>
We could log a warning instead of fail, and ask qa/infra if we could set
up elastic-recheck queries to monitor for that message. It seems like
that would be the best of both worlds, tests won't have unrelated
failures, and we still gather the stats on the error rates.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/129746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/129746
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d16a32886bd6f36cad5cc45f151a2fdb6a51248c
Submitter: Jenkins
Branch: master

commit d16a32886bd6f36cad5cc45f151a2fdb6a51248c
Author: Steve Baker <email address hidden>
Date: Tue Oct 21 11:08:50 2014 +1300

    Halt test_cinder_volume_create_backup_restore on error

    test_cinder_volume_create_backup_restore fails frequently due to issues
    outside heat. Instead of failing, this change halts the test when the
    error condition is triggered. This will allow some of the test to run
    while making the check job voting.

    Once logging is configured in heat_integrationtests an elastic-recheck
    search can monitor for the frequency of this error to aid
    nova/cinder/swift developers to fix it.

    Change-Id: I09722ad725a8d23fb2028c17b0dd9fcab3957649
    Related-Bug: #1382300

Changed in heat:
status: Triaged → In Progress
Revision history for this message
Steven Hardy (shardy) wrote :

stevebaker: you mentioned increasing the timeout recently, and I hadn't considered increasing the rescan_timeout. The test appears to be failing while waiting for the volume to attach, so this is actually worth a try. I was thinking of the WaitCondition timeout, which has already been increased to the build_timeout which is 1200 seconds.

Patch posted:

https://review.openstack.org/#/c/134502/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Steven Hardy (<email address hidden>) on branch: master
Review: https://review.openstack.org/134502
Reason: Abandoning in favour of skipping the test

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135347

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/135347
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f97f2e7644ddda874b1ac2e5f5cecd59923ac0ea
Submitter: Jenkins
Branch: master

commit f97f2e7644ddda874b1ac2e5f5cecd59923ac0ea
Author: Steven Hardy <email address hidden>
Date: Tue Nov 18 16:36:47 2014 +0000

    Functional tests skip volume_create_backup_restore

    This test is frequently failing and we don't yet understand why,
    so skip to avoid further delaying making the functional job voting.

    Change-Id: I62e0f70f1c27037f374bab8d15512bde1a2ce928
    Partial-Bug: #1382300

Changed in heat:
status: In Progress → Triaged
Rico Lin (rico-lin)
Changed in heat:
milestone: none → no-priority-tag-bugs
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.