Create volume and boot instance from it failed on step server deletion

Bug #1532163 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
In Progress
High
Dmitry Mescheryakov
8.0.x
In Progress
High
Dmitry Mescheryakov
9.x
In Progress
High
Dmitry Mescheryakov

Bug Description

Destroy two controllers and check pacemaker status is correct

Scenario:
1. Destroy first controller
2. Check pacemaker status
3. Run OSTF
4. Revert environment
5. Destroy second controller
6. Check pacemaker status
7. Run OSTF

Actual Result:
OStf failed on step 7:
Create volume and boot instance from it (failure)
Instance do not become active, so deletion starts, and failed by timeout
fuel_health.test: DEBUG: Waiting for <Server: ost1_test-boot-volume-instance1099375625> to get to ACTIVE status. Currently in build status
fuel_health.test: DEBUG: Sleeping for 10 seconds
fuel_health.common.test_mixins: INFO: STEP:5, verify action: 'server deletion'
fuel_health.nmanager: DEBUG: Deleting server.
fuel_health.test: DEBUG: Sleeping for 10 seconds
fuel_health.test: DEBUG: Sleeping for 10 seconds
fuel_health.common.test_mixins: INFO: Timeout 30s exceeded for server deletion
fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 177, in verify
    result = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 223, in __exit__
    raise AssertionError(msg)
AssertionError: Time limit exceeded while waiting for server deletion to finish.

So looks like create instace after destructive actions take a liitle bit more time, so may we need to increase timeout for instance creation

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "408"
  build_id: "408"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "7ef751bdc0e4601310e85b8bf713a62ed4aee305"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "8bb8c70efc61bcf633e02d6054dbf5ec8dcf6699"
  fuelmenu_sha: "2a0def56276f0fc30fd949605eeefc43e5d7cc6c"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "62573cb2a8aa54845de9303b4a30935a90e1db61"

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: New → Confirmed
Egor Kotko (ykotko)
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Egor Kotko (ykotko)
Revision history for this message
Egor Kotko (ykotko) wrote :

Seems the root of issue not only in time-outs. We are waiting for dictionary as response, but got the error message:
ERROR: Gateway Time-out (HTTP 504)

Steps to reproduce:
1). Deploy cluster: 3Controllers, 2Computes, 1Cinder
2). Destroy (Force Shutoff) one controller
3). Create cinder volume

Expected:
Volume will be created in ~2min
Response will be like(in cli):
http://paste.openstack.org/show/483768/
in rest api - dictionary

Actual:
Response: error message:
http://paste.openstack.org/show/483769/
but the volume was created.
Approximate time of volume creation ~5 min

https://drive.google.com/file/d/0BzWDM1PONYEub0FpdXlMdzQ3S1U/view?usp=sharing

Changed in fuel:
assignee: Egor Kotko (ykotko) → Fuel Library Team (fuel-library)
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

This is incorrect behaviour of cinder. Moving to mos-cinder team.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → MOS Cinder (mos-cinder)
tags: added: area-cinder
Revision history for this message
Yuriy Nesenenko (ynesenenko) wrote :

Looks like the used env is slow according to the appropriate time of volume creation (the above mentioned ~2-5min). So I don't think that it is incorrect behaviour of cinder.

Changed in fuel:
status: Confirmed → Incomplete
assignee: MOS Cinder (mos-cinder) → nobody
Changed in fuel:
assignee: nobody → Yuriy Nesenenko (ynesenenko)
assignee: Yuriy Nesenenko (ynesenenko) → nobody
Changed in fuel:
assignee: nobody → Yuriy Nesenenko (ynesenenko)
assignee: Yuriy Nesenenko (ynesenenko) → nobody
assignee: nobody → MOS Cinder (mos-cinder)
Ivan Kolodyazhny (e0ne)
Changed in fuel:
assignee: MOS Cinder (mos-cinder) → Tatyanka (tatyana-leontovich)
Revision history for this message
Yuriy Nesenenko (ynesenenko) wrote :

I think that the used env is slow according to the appropriate time of volume creation (the above mentioned ~2-5min). Please check it out on a faster environment. The appropriate time of volume creation should be < 1 min.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

@Ivan - please use snapshots and described steps to reproduce, also ping mos -qa for the help here

Changed in fuel:
assignee: Tatyanka (tatyana-leontovich) → MOS Cinder (mos-cinder)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Note according to test rail, the same issue happens on baremetal env https://mirantis.testrail.com/index.php?/tests/view/2465860&group_by=tests:status_id&group_order=asc&group_id=8 - so do not thinks that problem is in slow bm environment that appears only after destructive actions

Changed in fuel:
status: Incomplete → Confirmed
Ivan Kolodyazhny (e0ne)
Changed in fuel:
assignee: MOS Cinder (mos-cinder) → Yuriy Nesenenko (ynesenenko)
no longer affects: fuel
no longer affects: fuel/8.0.x
Changed in mos:
assignee: nobody → Yuriy Nesenenko (ynesenenko)
status: New → Confirmed
importance: Undecided → High
milestone: none → 8.0
Revision history for this message
Egor Kotko (ykotko) wrote :

Reproduced again on the baremetal environment, sometimes it is possible reproduce the problem not from the first time.
Destroy controller start test if was not reproduced:
start destroyed controller wait it is online and destroy another one after start the test.

https://drive.google.com/file/d/0BzWDM1PONYEuM0pYOUNwTGt0aG8/view?usp=sharing

Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
summary: - Create volume and boot instance from it failed on step server deleteion
+ Create volume and boot instance from it failed on step server deletion
Ivan Kolodyazhny (e0ne)
Changed in mos:
assignee: Yuriy Nesenenko (ynesenenko) → Ivan Kolodyazhny (e0ne)
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/cinder (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Ivan Kolodyazhny <email address hidden>
Review: https://review.fuel-infra.org/16539

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Do we have a stable repro? Also, why RabbitMQ is down?

The last thing I want to do here is introduce a last minute fix, which can break other things. Maybe it makes to move this to -updates?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/cinder (openstack-ci/fuel-8.0/liberty)

Change abandoned by Ivan Kolodyazhny <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/16539

tags: added: move-to-mu
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We haven't seen in this in the field yet, only on CI (rarely). The proposed fix is too risky to be merged at this stage of the release cycle. We believe it's safe to move this to a MU and continue investigation of this issue.

Changed in mos:
milestone: 8.0 → 8.0-updates
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

We've got a hacky workaround. Proper fix [1] requires fix in oslo.messaging to support 'retry' and 'timeout' params per client. For now, these params are global.

[1] https://review.openstack.org/274148 - PoC

Changed in mos:
assignee: Ivan Kolodyazhny (e0ne) → Dmitry Mescheryakov (dmitrymex)
tags: added: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Given the fact we haven't actually seen this in the field, I suggest we skip the release notes part.

tags: removed: release-notes
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The original issue is definitely the same as #1560097. Since we already tracking all the work there, I marking current bug as duplicate.

Re issue that Egor have found in comment #2 - it is al least a very different issue, but it might be that that one is a wrong test as well. For instance, it is not clear if Egor waited for RabbitMQ to recover after destructive action. Please file a separate issue if you experience it again and also please post snippets of the logs with timestamps indicating when the issue occurred, because it is completely unclear from the comment.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.