cinder volume create failed

Bug #1828059 reported by Peng Peng on 2019-05-07
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Bob Church

Bug Description

Brief Description
-----------------
Creating a volume from ubuntu image, the volume status was changing from creating to downloading, but eventually in "error" status.

Severity
--------
Major

Steps to Reproduce
------------------
....
TC-name: networking/test_ping_vms.py::test_ping_between_two_vms[ubuntu_14-virtio-virtio]

Expected Behavior
------------------
status is "available"

Actual Behavior
----------------
status in "error"

Reproducibility
---------------
Reproducible

System Configuration
--------------------
One node system

Lab-name: SM-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190506T233000Z

Last Pass
---------
20190505T233000Z

Timestamp/Logs
--------------
[2019-05-07 08:37:42,352] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne create --display-name vol-ubuntu_14-2 --image-id 74182c51-2785-4ac5-bee0-0d70c52041a0 3'
[2019-05-07 08:37:44,323] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Property | Value |
+------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-05-07T08:37:43.000000 |
| description | None |
| encrypted | False |
| id | 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e |
| metadata | {} |
| multiattach | False |
| name | vol-ubuntu_14-2 |
| os-vol-tenant-attr:tenant_id | 6c8b8d7302174cb9b3a2929af4a4176f |
| replication_status | None |
| size | 3 |
| snapshot_id | None |
| source_volid | None |
| status | creating |

[2019-05-07 08:37:49,088] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne show 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e'
[2019-05-07 08:37:50,507] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Property | Value |
+------------------------------+--------------------------------------+
| attached_servers | [] |
| attachment_ids | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-05-07T08:37:43.000000 |
| description | None |
| encrypted | False |
| id | 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e |
| metadata | |
| multiattach | False |
| name | vol-ubuntu_14-2 |
| os-vol-tenant-attr:tenant_id | 6c8b8d7302174cb9b3a2929af4a4176f |
| replication_status | None |
| size | 3 |
| snapshot_id | None |
| source_volid | None |
| status | downloading |

[2019-05-07 08:37:55,191] 262 DEBUG MainThread ssh.send :: Send 'cinder --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os_region_name RegionOne show 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e'
[2019-05-07 08:37:56,599] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Property | Value |
+------------------------------+--------------------------------------+
| attached_servers | [] |
| attachment_ids | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-05-07T08:37:43.000000 |
| description | None |
| encrypted | False |
| id | 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e |
| metadata | |
| multiattach | False |
| name | vol-ubuntu_14-2 |
| os-vol-tenant-attr:tenant_id | 6c8b8d7302174cb9b3a2929af4a4176f |
| replication_status | None |
| size | 3 |
| snapshot_id | None |
| source_volid | None |
| status | error |

[wrsroot@controller-0 ~(keystone_admin)]$ cinder list --a
+--------------------------------------+----------------------------------+----------------+-----------------+------+-------------+----------+-------------+
| ID | Tenant ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+----------------------------------+----------------+-----------------+------+-------------+----------+-------------+
| 449bba0d-16fc-48e8-a9c2-d2f8b86ca67e | 6c8b8d7302174cb9b3a2929af4a4176f | error_deleting | vol-ubuntu_14-2 | 3 | - | false | |
| ab2305b8-6ff0-4433-9e15-635d21998963 | 6c8b8d7302174cb9b3a2929af4a4176f | error_deleting | vol-tenant2 | 5 | - | false | |
| fd4696b8-9052-44e6-966f-a353a988e6f8 | 6c8b8d7302174cb9b3a2929af4a4176f | error_deleting | vol-tenant2-1 | 5 | - | false | |
+--------------------------------------+----------------------------------+----------------+-----------------+------+-------------+----------+-------------+
[wrsroot@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Peng Peng (ppeng) wrote :
Daniel Badea (daniel.badea) wrote :

Time frame when the error occured:

  2019-05-07 08:37:49 ... 2019-05-07 08:37:55

Unfortunately there are no cinder-volume logs for the time frame when the error occured:

  cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-98ccc7d2009c797516e62ef407fcf33231d8ba53314bbcaa0475ce125c95978b.log
    starts at 2019-05-07T09:56:28.869172348Z

  cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-b753345a608b14dc44623346840b969bde6a67e1b6b827200e3867ef28dce715.log
    starts at 2019-05-07T10:26:36.965982746Z

I assume cinder-volume has crashed, container was restarted and its logs were discarded.

Looking at sm-customer.log there's some mgr-restful-plugin activity:

| 2019-05-07T08:44:43.835 | 268 | service-scn | mgr-restful-plugin | enabled-active | disabled | audit state mismatch
| 2019-05-07T08:44:43.973 | 269 | service-scn | mgr-restful-plugin | disabled | enabling | enabled-active state requested
| 2019-05-07T08:44:44.144 | 270 | service-scn | mgr-restful-plugin | enabling | enabled-active | enable success

and also in ceph-mgr.controller-0.log there's evidence ceph-mgr (providing REST API) was restarted:

  2019-05-07 08:44:37.992 7fbe53950700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 90056) UID: 0
  2019-05-07 08:44:37.992 7fbe53950700 -1 mgr handle_signal *** Got signal Terminated ***
  2019-05-07 08:44:46.319 7fa91d533380 0 ceph version 13.2.2 (67ecc2fbbca4f602f1172fa8e561b813eb564df5) mimic (stable), process ceph-mgr, pid 430346

At the same time ceph-manager continuously reports Ceph is in HEALTH_WARN:

  2019-05-07 08:44:21.343 90851 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: Degraded data redundancy: 1183/2835 objects degraded (41.728%), 83 pgs degraded, 264 pgs undersized

So what I think happened:
- controller-0 was locked/unlocked
- after reboot services are starting up and Ceph enters HEALTH_WARN and starts recovery procedure
- openstack services become available before Ceph is in HEALTH_OK
- cinder-volume tries to write downloaded volume but its request is suspended by ceph-mgr (Ceph usually hangs on to a request until it enters HEALTH_OK and then responds to it)
- but meanwhile mgr-restful-plugin detects ceph-mgr is slow to respond and restarts it
- when this happens cinder-wolume connection to ceph-mgr drops and volume write request fails

the same issue is present in 2+2+2 BM

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Invalid volume: Invalid input received: Invalid volume: Volume attachments can not be created if the volume is in an error state. The Volume dea1ced2-0fa7-440b-91c7-eb07883e67ad currently has a status of: error (HTTP 400) (Request-ID: req-a976b021-078b-4c7b-8a51-8965de5439c9) (HTTP 400) (Request-ID: req-bc382865-8f4b-48e8-8e45-e7253b9ebc88)

Numan Waheed (nwaheed) on 2019-05-08
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) on 2019-05-08
Changed in starlingx:
assignee: nobody → Daniel Badea (daniel.badea)
Daniel Badea (daniel.badea) wrote :

Issue is caused by a synchronization issue in ceph-mgr restful plugin. See fix candidate and pull request here:

https://github.com/starlingx-staging/stx-ceph/pull/29/commits/3564e288e2abebe9b5828d32cdafe4e1b9148ac5

ceph-mgr.controller-0.log shows an "Unknown request":

2019-05-07 08:44:21.739 7fbe4000d700 0 log_channel(audit) log [DBG] : from='mon.0 192.168.204.3:0/84254' entity='mon.controller-0' cmd=[{"prefix": "osd df", "output_method": "tree", "format": "json"}]: dispatch
2019-05-07 08:44:21.739 7fbe4000d700 1 mgr.server reply handle_command (0) Success
2019-05-07 08:44:21.798 7fbe4080e700 1 mgr[restful] Unknown request '140455068498768:0'
2019-05-07 08:44:21.946 7fbe4f948700 1 mgr send_beacon active

which explains why mgr-restful-plugin doesn't get a response back and decides to restart ceph-mgr.

Ghada Khalil (gkhalil) wrote :

Marking as release gating; this is causing stx sanity to fail

tags: added: stx.2.0 stx.distro.other stx.storage
Ghada Khalil (gkhalil) on 2019-05-10
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Ghada Khalil (gkhalil) wrote :

The root-cause is the ceph api synchronization issue.
This is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1827521

Bob Church and Daniel Badea have identified an upstream fix that was targeted in ceph v13.2.3 (stx has ceph v13.2.2):

https://tracker.ceph.com/issues/37271
https://github.com/ceph/ceph/pull/25113/commits/6d7f6209b76f2d38ff832e556937510228d76bda

This fix was cherry-picked and merged in starlingx by pull request:
https://github.com/starlingx-staging/stx-ceph/pull/30

The fix should be available in May 15 loads.

Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil) on 2019-05-15
tags: added: stx.sanity
Ghada Khalil (gkhalil) wrote :

Re-opening as this issue was reported by Christopher Lemus using CENGN image 20190515T220331Z
More investigation is required.

Changed in starlingx:
status: Fix Released → Confirmed
Cristopher Lemus (cjlemusc) wrote :

Providing updates to this bug, just in case that is not a duplicated of the other one and we are facing a different issue.

Using CENGN image 20190515T220331Z: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190515T220331Z/outputs/iso/bootimage.iso

During sanity execution, Volumes cannot be created, they remain on error status. Sanity executes the following command:

# openstack volume create --size 20 --image cirros --bootable vol-cirros-1

However, the volume just stays on error status, and it cannot be deleted. This is the current status of the volume:

controller-0:~# openstack volume show a2de160f-d4c2-4033-87cf-f119adf2e684
+--------------------------------+--------------------------------------------+
| Field | Value |
+--------------------------------+--------------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-05-16T13:37:25.000000 |
| description | None |
| encrypted | False |
| id | a2de160f-d4c2-4033-87cf-f119adf2e684 |
| migration_status | None |
| multiattach | False |
| name | vol-cirros-1 |
| os-vol-host-attr:host | cinder-volume-worker@ceph-store#ceph-store |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 020e982de53c44a6a988859a94ecbc82 |
| properties | |
| replication_status | None |
| size | 20 |
| snapshot_id | None |
| source_volid | None |
| status | error_deleting |
| type | None |
| updated_at | 2019-05-16T13:52:23.000000 |
| user_id | aa11b104ece949e68c4d60d404684397 |
| volume_image_metadata | {u'signature_verified': u'False'} |
+--------------------------------+--------------------------------------------+

Collect attached.

Peng Peng (ppeng) wrote :
Download full text (4.2 KiB)

Issue was reproduced on
Lab: WCP_99_103
Load: 2019-05-15_18-01-07

[2019-05-16 18:00:08,616] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --image=6446e145-72f2-49fa-9202-53b6c32fa031 --size=3 --bootable vol-ubuntu_14-5'
[2019-05-16 18:00:10,338] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2019-05-16T18:00:10.000000 |
| description | None |
| encrypted | False |
| id | f248623d-78d9-4331-a201-ba3f401ec24c |
| multiattach | False |
| name | vol-ubuntu_14-5 |
| properties | |
| replication_status | None |
| size | 3 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| type | None |
| updated_at | None |
| user_id | 2620f10b8a2d4eeea3b24d6daeee80b1 |
+---------------------+--------------------------------------+

[2019-05-16 18:00:17,147] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show f248623d-78d9-4331-a201-ba3f401ec24c'
[2019-05-16 18:00:18,645] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Field | Value |
+------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-05-16T18:00:10.000000 |
| description | None |
| encrypted ...

Read more...

Peng Peng (ppeng) wrote :
Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil) wrote :

From Bob Church:
I observed the following in the new cinder-volume pod logs from Peng:

2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 2072, in rbd.Image.parent_info (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:24095)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 695, in rbd.make_ex (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:6647)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 549, in rbd.ImageNotFound.__init__ (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:3737)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils TypeError: _init_() got an unexpected keyword argument 'errno'

Historically, we have introduced verbose error messages into the librados/librbd python bindings with: https://github.com/starlingx-staging/stx-ceph/commit/53e93f8f800fc3055678c57b4079a86d8ef88ea7

This commit is currenly not compatable with current mimic implementation of make_ex() which translates librados/librbd return codes into python exceptions.

The rbd.ImageNotFound is actually an acceptable and handled excpetion in Cinder but because of the TypeError being raised, Cinder is bailing out of volume creation/deletions early

                I did a very quick test by fixing up the bindings with: https://github.com/rchurch-wrs/stx-ceph/commit/a46969cd51a6b3a0723ce7c44ea34c528544609f and manually updating the cinder docker image.

                This fixed the issue and I was able to create an image.

Changed in starlingx:
assignee: Daniel Badea (daniel.badea) → Bob Church (rchurch)
Ghada Khalil (gkhalil) on 2019-05-17
Changed in starlingx:
status: Confirmed → Fix Committed
Ghada Khalil (gkhalil) wrote :

Merged in the cengn build: 20190517T223724Z
./cgcs-root/stx/git/ceph 143ed7218158183529f928fd6833b85f760a1863 2019-05-17 18:30:11 -0400 brent-rowsell <email address hidden> Update verbose rados & rbd exceptions messages

Changed in starlingx:
status: Fix Committed → Fix Released
Cristopher Lemus (cjlemusc) wrote :

Issue is still present on CENG Image: 20190519T233000Z

A full collect is uploaded.

We observed the same behavior on all 4 configurations. Please, let us know if besides the collect, you need any additional log for troubleshoot.

Bob Church (rchurch) wrote :

I took a look at the cinder logs and saw the same failure signature. Can you confirm that you are running latest docker images? The docker images were rebuilt as part of this fix and are required to resolve this.

controller-0:~$ sudo docker image ls | grep cinder
192.168.204.2:9001/docker.io/starlingx/stx-cinder master-centos-stable-latest 505032920dd0 2 days ago 860MB
starlingx/stx-cinder master-centos-stable-latest 505032920dd0 2 days ago 860MB

I do not see this use when using these images.

Peng Peng (ppeng) wrote :

Have not seen this issue recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.