[CINDER][CEPH]cinder-volume can't be deleted during rally test

Bug #1459781 reported by Leontiy Istomin on 2015-05-28
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
High
Yuriy Nesenenko
5.1.x
High
Alexey Khivin
6.0.x
High
Alexey Khivin
6.1.x
High
Alexey Khivin
7.0.x
High
Yuriy Nesenenko
8.0.x
High
Yuriy Nesenenko

Bug Description

UPD 06/16 (copy-paste comment #11):
Clear steps to reproduce and expected result vs actual result
Run Rally scenario create_and_attach_volume rally scenario

Rough estimate of the probability of user facing the issue
Low. It happens in Rally but could not be reproduced in the mod-linux lab. The issue is that if it fails once in Rally, there is no retry logic at the Cinder level.

What is the real user facing impact / severity and is there a workaround available?

IMPACT: Connection to Ceph hangs and users can not attach volumes to VMs.
WORKAROUND: Restart Cinder services

Can we deliver the fix later and apply it easy on running env?
yes, add retry logic
https://bugs.launchpad.net/cinder/+bug/1462970

-----

during create_and_attach_volume rally scenario a volume couldn't be deleted:
rally.log:
http://paste.openstack.org/show/242767/
from cinder-all:
<158>May 28 05:10:49 node-49 cinder-volume volume 978205b0-5ff1-4590-82ca-48847a91b130: deleting
But the volume is still alive:
http://paste.openstack.org/show/242768/
http://paste.openstack.org/show/242818/

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-05-28_09-09-27.tar.xz

api: '1.0'
astute_sha: b09729c64b695b2e6fcc88c31843321759ec45d5
auth_required: true
build_id: 2015-05-16_20-55-26
build_number: '425'
feature_groups:
- mirantis
fuel-library_sha: 1645fe45f226cdd6d2829bea9912d0baa3be5033
fuel-ostf_sha: 9ce1800749081780b8b2a4a7eab6586583ffaf33
fuelmain_sha: 0e970647a83d9a7d336c4cc253606d4dd0d59a60
nailgun_sha: 076566b5df37f681c3fd5b139c966d680d81e0a5
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 38765563e1a7f14f45201fd47cf507393ff5d673
release: '6.1'

Successfully deployed the following configuration:
Baremetal,Centos,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-425
Controllers:3 Computes:47

Leontiy Istomin (listomin) wrote :

"cinder delete 978205b0-5ff1-4590-82ca-48847a91b130" shows the followong:
http://paste.openstack.org/show/242819/
from cinder-all.log:
http://paste.openstack.org/show/242840/

Changed in mos:
milestone: none → 7.0
Ivan Kolodyazhny (e0ne) wrote :

Looks like an issue with connection to Ceph:
2015-05-28T05:10:49.613207+00:00 debug: opening connection to ceph cluster (timeout=-1).
2015-05-28T05:10:49.657785+00:00 info: volume 978205b0-5ff1-4590-82ca-48847a91b130: deleting
2015-05-28T05:10:49.658915+00:00 debug: volume 978205b0-5ff1-4590-82ca-48847a91b130: removing export
2015-05-28T05:10:49.659718+00:00 debug: volume 978205b0-5ff1-4590-82ca-48847a91b130: deleting

Dan Hata (dhata) wrote :

This bug affects scale testing efforts and could fail up to 10 scenarios. Addressing this would give a 15% improvement for build 425.
Our success % would go from approximately 67% to 82%"

Ivan Kolodyazhny (e0ne) wrote :

Leontiy, Dan,

I can't reproduce this issue on my envs. Please, Provide me access to env when bug will be reproduced

Ivan Kolodyazhny (e0ne) wrote :

I'm working on workaround for Cinder but it requires changes config and changes to puppet manifests

Ivan Kolodyazhny (e0ne) wrote :

Filed bug to upstream https://bugs.launchpad.net/cinder/+bug/1462970 to implement workaround

Ivan Kolodyazhny (e0ne) wrote :

http://tracker.ceph.com/issues/9445 - could be root cause, we'll try to cherry-pick fix and test with patched Ceph

description: updated
Ivan Kolodyazhny (e0ne) wrote :

From conversation with Alyona:
http://obs-1.mirantis.com:82/trusty-fuel-6.1-stable-7699/ubuntu - repo with patched ceph. Waiting for RPM build

Kostiantyn Danylov (kdanylov) wrote :

Issue can't be reproduced on our labs, waiting scale lab with env to investigate

Ivan Kolodyazhny (e0ne) wrote :

Workaround for this issue is: restart cinder-volume services

Dan Hata (dhata) wrote :

from request of Eugene Bogdanov
Clear steps to reproduce and expected result vs actual result
Run Rally scenario create_and_attach_volume rally scenario

Rough estimate of the probability of user facing the issue
Low. It happens in Rally but could not be reproduced in the mod-linux lab. The issue is that if it fails once in Rally, there is no retry logic at the Cinder level.

What is the real user facing impact / severity and is there a workaround available?

IMPACT: Connection to Ceph hangs and users can not attach volumes to VMs.
WORKAROUND: Restart Cinder services

Can we deliver the fix later and apply it easy on running env?
yes, add retry logic
https://bugs.launchpad.net/cinder/+bug/1462970

description: updated
Alyona Kiseleva (akiselyova) wrote :

It seems not to be a Ceph problem. Ceph cluster has no errors, all diagnostic commands show good state.
Also, the manual procedure works ok without hangs.
Logs of one failed volume
http://paste.openstack.org/show/301192/
Error in Rally for this volume:
http://paste.openstack.org/show/301194/

Ivan Kolodyazhny (e0ne) wrote :

Issue was not reproduces at Scale for last few weeks. We'll try to apply workaround https://bugs.launchpad.net/cinder/+bug/1462970 for a next version of MOS.

If user and/or customer will reproduce this issue, we've got simple workaround to fix it: he need just to restart cinder-volume services.

tags: added: cinder
Dina Belova (dbelova) wrote :

Was reproduced with 6.1 #525 ISO on 20 nodes env. New log levels for the Ceph need to be applied.

Leontiy Istomin (listomin) wrote :

Has been reproduced during create-and-list-volume rally scenario. Volume stuck in "downloading" status:
from rally.log: http://paste.openstack.org/show/327921/
cinder show 6f2c7593-24aa-490c-97af-8ca20c8bb640: http://paste.openstack.org/show/327922/
from cinder logs: http://paste.openstack.org/show/327924/
volune exists in ceph: http://paste.openstack.org/show/327925/

environment configuration:
Baremetal,Centos,IBP,HA,Neutron-VLAN,Сeph-all,Nova-debug,Nova-quotas,6.1_525 Controllers:3 Computes+Ceph:17
api: '1.0'
astute_sha: 1ea8017fe8889413706d543a5b9f557f5414beae
auth_required: true
build_id: 2015-06-19_13-02-31
build_number: '525'
feature_groups:
- mirantis
fuel-library_sha: 2e7a08ad9792c700ebf08ce87f4867df36aa9fab
fuel-ostf_sha: 8fefcf7c4649370f00847cc309c24f0b62de718d
fuelmain_sha: a3998372183468f56019c8ce21aa8bb81fee0c2f
nailgun_sha: dbd54158812033dd8cfd7e60c3f6650f18013a37
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'

Diagnostic Snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-30_08-57-23.tar.xz

Ivan Kolodyazhny (e0ne) wrote :

Looks like bug https://bugs.launchpad.net/cinder/+bug/1401335 is not fixed. Merged patch causes new problem. Part of strace: http://paste.openstack.org/show/329543/

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Ivan Kolodyazhny <email address hidden>
Review: https://review.fuel-infra.org/9411

Change abandoned by Ivan Kolodyazhny <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/9411

Change restored by Ivan Kolodyazhny <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/9411

Reviewed: https://review.fuel-infra.org/9411
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: e1967e8b4c6ae7448a156d2b8f8b345deabf515d
Author: Ivan Kolodyazhny <email address hidden>
Date: Thu Jul 16 14:13:24 2015

Fix block eventlet threads on rbd calls

Commit Ibaf43858d60e1320c339f2523b5c09c7f7c7f91e caused new problem with
cross thread communication. According to Python documentation, code can
lead to a deadlock if the spawned thread directly or indirectly attempts
to import a module. python-rados spawns new thread to connect to
cluster. So I removed new spawning new thread to connect to rados. All
long-running operations calls whith python-rbd are still implemented in
native Python threads to block eventlet loop.

Closes-Bug: #1459781
Change-Id: I4b6c3be71f421067e9aa2657b7b1b3c5d30eb2bb

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Ivan Kolodyazhny <email address hidden>
Review: https://review.fuel-infra.org/10114

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Ivan Kolodyazhny <email address hidden>
Review: https://review.fuel-infra.org/10115

Reviewed: https://review.fuel-infra.org/10115
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: 271464fa14aba4899e5c9bfb2cc39f28c674af7d
Author: Ivan Kolodyazhny <email address hidden>
Date: Wed Aug 5 15:49:07 2015

Fix block eventlet threads on rbd calls

Commit Ibaf43858d60e1320c339f2523b5c09c7f7c7f91e caused new problem with
cross thread communication. According to Python documentation, code can
lead to a deadlock if the spawned thread directly or indirectly attempts
to import a module. python-rados spawns new thread to connect to
cluster. So I removed new spawning new thread to connect to rados. All
long-running operations calls whith python-rbd are still implemented in
native Python threads to block eventlet loop.

Closes-Bug: #1459781
Change-Id: I4b6c3be71f421067e9aa2657b7b1b3c5d30eb2bb

tags: added: 6.0-mu-5 done release-notes
tags: removed: 6.0-mu-5 done release-notes
tags: added: 6.0-mu-5 done release-notes
tags: removed: 6.0-mu-5
tags: added: 6.0-mu-5
Vadim Rovachev (vrovachev) wrote :

Verified on 6.0. Fix works.

Reviewed: https://review.fuel-infra.org/10114
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: 945a43ea6d6528e7479fc1f4d6f67dc96ab8ea92
Author: Ivan Kolodyazhny <email address hidden>
Date: Wed Aug 5 15:43:18 2015

Fix block eventlet threads on rbd calls

Commit Ibaf43858d60e1320c339f2523b5c09c7f7c7f91e caused new problem with
cross thread communication. According to Python documentation, code can
lead to a deadlock if the spawned thread directly or indirectly attempts
to import a module. python-rados spawns new thread to connect to
cluster. So I removed new spawning new thread to connect to rados. All
long-running operations calls whith python-rbd are still implemented in
native Python threads to block eventlet loop.

Closes-Bug: #1459781
Change-Id: I4b6c3be71f421067e9aa2657b7b1b3c5d30eb2bb

Vadim Rovachev (vrovachev) wrote :

Verified on 6.1.
Used packages:
cinder-api,cinder-backup,cinder-common,cinder-scheduler,cinder-volume,python-cinder
with version:
2014.2.2-1~u14.04+mos14
mirror:
http://mirror.fuel-infra.org/mos/snapshots/ubuntu-latest/ mos6.1-proposed/main amd64 Packages

tags: added: 7.0
tags: added: release-notes-done-7.0
removed: 7.0 done release-notes
tags: added: release-notes-done rn7.0
removed: release-notes-done-7.0
tags: added: on-verification

Verified on
----------------
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "
----------------

Steps to reproduce
---
1) Run rally scenario create-and-attach-volume:
root@4d26aea48680:/# rally -v task start create-and-attach-volume.json --tag create-and-attach-volume

2) See in /var/log/cinder-all.log "Cannot delete volume <volume_id>: volume is busy":
...
<158>Oct 7 05:29:00 node-6 cinder-volume 2015-10-07 05:29:00.115 16594 INFO cinder.volume.manager [req-6b46fd26-ad2f-42ed-9da5-02a47be922be 5fc88ee5f65d41f39f225c37c7644825 47fe3f9221d64b27ab03afe1625b4eab - - -] volume 4304a9ba-e5ab-4021-9ed4-93d81b88fe51: deleting
<156>Oct 7 05:29:00 node-6 cinder-volume 2015-10-07 05:29:00.212 16594 WARNING cinder.volume.drivers.rbd [req-6b46fd26-ad2f-42ed-9da5-02a47be922be 5fc88ee5f65d41f39f225c37c7644825 47fe3f9221d64b27ab03afe1625b4eab - - -] ImageBusy error raised while deleting rbd volume. This may have been caused by a connection from a client that has crashed and, if so, may be resolved by retrying the delete after 30 seconds has elapsed.
<155>Oct 7 05:29:00 node-6 cinder-volume 2015-10-07 05:29:00.217 16594 ERROR cinder.volume.manager [req-6b46fd26-ad2f-42ed-9da5-02a47be922be 5fc88ee5f65d41f39f225c37c7644825 47fe3f9221d64b27ab03afe1625b4eab - - -] Cannot delete volume 4304a9ba-e5ab-4021-9ed4-93d81b88fe51: volume is busy

Log Rally: http://paste.openstack.org/show/475558/

node-6:/var/log/cinder-all.log in attachment

But some time later all volumes were deleted:

root@node-6:~# cinder list --all-tenants
+----+-----------+--------+--------------+------+-------------+----------+-------------+
| ID | Tenant ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+----+-----------+--------+--------------+------+-------------+----------+-------------+
+----+-----------+--------+--------------+------+-------------+----------+-------------+

Added to 7.0 MU1, since it needs QA verification.

tags: added: 70mu1-confirmed

Reviewed: https://review.fuel-infra.org/12617
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: ee8d0966d011774af1da39258fe2ce0fb293a5ff
Author: Yuriy Nesenenko <email address hidden>
Date: Wed Oct 7 17:24:51 2015

Add retries to delete a volume in the RBD driver

This patch adds retries to delete a volume. After N tries of a
volume deletion failed we raise VolumeIsBusy exception.

Closes-Bug: #1459781
Change-Id: I9499be0c5985f9e8a3e55d1c9add01ad5cd11789

tags: removed: on-verification

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Ivan Kolodyazhny <email address hidden>
Review: https://review.fuel-infra.org/13334

tags: removed: 70mu1-confirmed

Change abandoned by Ivan Kolodyazhny <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13334
Reason: It's already merged to Liberty

tags: added: area-cinder
removed: cinder
Ivan Lozgachev (ilozgachev) wrote :

Verified on ENV-10 Build 482 and ENV-14 Build 496

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments