Bug #1828059 “cinder volume create failed” : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-07:

#1

ALL_NODES_20190507.141025.tar Edit (52.8 MiB, application/x-tar)

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-07:

#2

Time frame when the error occured:

2019-05-07 08:37:49 ... 2019-05-07 08:37:55

Unfortunately there are no cinder-volume logs for the time frame when the error occured:

cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-98ccc7d2009c797516e62ef407fcf33231d8ba53314bbcaa0475ce125c95978b.log
starts at 2019-05-07T09:56:28.869172348Z

cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-b753345a608b14dc44623346840b969bde6a67e1b6b827200e3867ef28dce715.log
starts at 2019-05-07T10:26:36.965982746Z

I assume cinder-volume has crashed, container was restarted and its logs were discarded.

Looking at sm-customer.log there's some mgr-restful-plugin activity:

and also in ceph-mgr.controller-0.log there's evidence ceph-mgr (providing REST API) was restarted:

  2019-05-07 08:44:37.992 7fbe53950700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 90056) UID: 0
  2019-05-07 08:44:37.992 7fbe53950700 -1 mgr handle_signal *** Got signal Terminated ***
  2019-05-07 08:44:46.319 7fa91d533380 0 ceph version 13.2.2 (67ecc2fbbca4f602f1172fa8e561b813eb564df5) mimic (stable), process ceph-mgr, pid 430346

At the same time ceph-manager continuously reports Ceph is in HEALTH_WARN:

2019-05-07 08:44:21.343 90851 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: Degraded data redundancy: 1183/2835 objects degraded (41.728%), 83 pgs degraded, 264 pgs undersized

So what I think happened:
- controller-0 was locked/unlocked
- after reboot services are starting up and Ceph enters HEALTH_WARN and starts recovery procedure
- openstack services become available before Ceph is in HEALTH_OK
- cinder-volume tries to write downloaded volume but its request is suspended by ceph-mgr (Ceph usually hangs on to a request until it enters HEALTH_OK and then responds to it)
- but meanwhile mgr-restful-plugin detects ceph-mgr is slow to respond and restarts it
- when this happens cinder-wolume connection to ceph-mgr drops and volume write request fails

Time frame when the error occured:

2019-05-07 08:37:49 ... 2019-05-07 08:37:55

Unfortunately there are no cinder-volume logs for the time frame when the error occured:

cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-98ccc7d2009c797516e62ef407fcf33231d8ba53314bbcaa0475ce125c95978b.log
    starts at 2019-05-07T09:56:28.869172348Z

cinder-volume-f9d47fc7d-q85mf_openstack_cinder-volume-b753345a608b14dc44623346840b969bde6a67e1b6b827200e3867ef28dce715.log
    starts at 2019-05-07T10:26:36.965982746Z

I assume cinder-volume has crashed, container was restarted and its logs were discarded.

Looking at sm-customer.log there's some mgr-restful-plugin activity:

and also in ceph-mgr.controller-0.log there's evidence ceph-mgr (providing REST API) was restarted:

2019-05-07 08:44:37.992 7fbe53950700 -1 received  signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start  (PID: 90056) UID: 0
  2019-05-07 08:44:37.992 7fbe53950700 -1 mgr handle_signal *** Got signal Terminated ***
  2019-05-07 08:44:46.319 7fa91d533380  0 ceph version 13.2.2 (67ecc2fbbca4f602f1172fa8e561b813eb564df5) mimic (stable), process ceph-mgr, pid 430346

At the same time ceph-manager continuously reports Ceph is in HEALTH_WARN:

2019-05-07 08:44:21.343 90851 INFO ceph_manager.monitor [-] Current Ceph health: HEALTH_WARN detail: Degraded data redundancy: 1183/2835 objects degraded (41.728%), 83 pgs degraded, 264 pgs undersized

So what I think happened:
- controller-0 was locked/unlocked
- after reboot services are starting up and Ceph enters HEALTH_WARN and starts recovery procedure
- openstack services become available before Ceph is in HEALTH_OK
- cinder-volume tries to write downloaded volume but its request is suspended by ceph-mgr (Ceph usually hangs on to a request until it enters HEALTH_OK and then responds to it)
- but meanwhile mgr-restful-plugin detects ceph-mgr is slow to respond and restarts it
- when this happens cinder-wolume connection to ceph-mgr drops and volume write request fails

Revision history for this message

Maria Guadalupe Perez Ibara (maria-gp) wrote on 2019-05-07:

#3

the same issue is present in 2+2+2 BM

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Invalid volume: Invalid input received: Invalid volume: Volume attachments can not be created if the volume is in an error state. The Volume dea1ced2-0fa7-440b-91c7-eb07883e67ad currently has a status of: error (HTTP 400) (Request-ID: req-a976b021-078b-4c7b-8a51-8965de5439c9) (HTTP 400) (Request-ID: req-bc382865-8f4b-48e8-8e45-e7253b9ebc88)

Numan Waheed (nwaheed) on 2019-05-08

tags:

added: stx.retestneeded

Ghada Khalil (gkhalil) on 2019-05-08

Changed in starlingx:
assignee:	nobody → Daniel Badea (daniel.badea)

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-10:

#4

Issue is caused by a synchronization issue in ceph-mgr restful plugin. See fix candidate and pull request here:

https://github.com/starlingx-staging/stx-ceph/pull/29/commits/3564e288e2abebe9b5828d32cdafe4e1b9148ac5

ceph-mgr.controller-0.log shows an "Unknown request":

2019-05-07 08:44:21.739 7fbe4000d700 0 log_channel(audit) log [DBG] : from='mon.0 192.168.204.3:0/84254' entity='mon.controller-0' cmd=[{"prefix": "osd df", "output_method": "tree", "format": "json"}]: dispatch
2019-05-07 08:44:21.739 7fbe4000d700 1 mgr.server reply handle_command (0) Success
2019-05-07 08:44:21.798 7fbe4080e700 1 mgr[restful] Unknown request '140455068498768:0'
2019-05-07 08:44:21.946 7fbe4f948700 1 mgr send_beacon active

which explains why mgr-restful-plugin doesn't get a response back and decides to restart ceph-mgr.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-10:

#5

Marking as release gating; this is causing stx sanity to fail

tags:

added: stx.2.0 stx.distro.other stx.storage

Ghada Khalil (gkhalil) on 2019-05-10

Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-14:

#6

The root-cause is the ceph api synchronization issue.
This is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1827521

Bob Church and Daniel Badea have identified an upstream fix that was targeted in ceph v13.2.3 (stx has ceph v13.2.2):

https://tracker.ceph.com/issues/37271
https://github.com/ceph/ceph/pull/25113/commits/6d7f6209b76f2d38ff832e556937510228d76bda

This fix was cherry-picked and merged in starlingx by pull request:
https://github.com/starlingx-staging/stx-ceph/pull/30

The fix should be available in May 15 loads.

Changed in starlingx:
status:	Triaged → Fix Released

Ghada Khalil (gkhalil) on 2019-05-15

tags:

added: stx.sanity

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-16:

#7

Re-opening as this issue was reported by Christopher Lemus using CENGN image 20190515T220331Z
More investigation is required.

Changed in starlingx:
status:	Fix Released → Confirmed

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-05-16:

#8

controller-0_20190516.132547.tar Edit (41.9 MiB, application/x-tar)

Providing updates to this bug, just in case that is not a duplicated of the other one and we are facing a different issue.

Using CENGN image 20190515T220331Z: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190515T220331Z/outputs/iso/bootimage.iso

During sanity execution, Volumes cannot be created, they remain on error status. Sanity executes the following command:

# openstack volume create --size 20 --image cirros --bootable vol-cirros-1

However, the volume just stays on error status, and it cannot be deleted. This is the current status of the volume:

controller-0:~# openstack volume show a2de160f-d4c2-4033-87cf-f119adf2e684
+--------------------------------+--------------------------------------------+
| Field | Value |
+--------------------------------+--------------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-05-16T13:37:25.000000 |
| description | None |
| encrypted | False |
| id | a2de160f-d4c2-4033-87cf-f119adf2e684 |
| migration_status | None |
| multiattach | False |
| name | vol-cirros-1 |
| os-vol-host-attr:host | cinder-volume-worker@ceph-store#ceph-store |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 020e982de53c44a6a988859a94ecbc82 |
| properties | |
| replication_status | None |
| size | 20 |
| snapshot_id | None |
| source_volid | None |
| status | error_deleting |
| type | None |
| updated_at | 2019-05-16T13:52:23.000000 |
| user_id | aa11b104ece949e68c4d60d404684397 |
| volume_image_metadata | {u'signature_verified': u'False'} |
+--------------------------------+--------------------------------------------+

Collect attached.

Providing updates to this bug, just in case that is not a duplicated of the other one and we are facing a different issue.

Using CENGN image 20190515T220331Z: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190515T220331Z/outputs/iso/bootimage.iso

During sanity execution, Volumes cannot be created, they remain on error status. Sanity executes the following command:

# openstack volume create --size 20 --image cirros --bootable vol-cirros-1

However, the volume just stays on error status, and it cannot be deleted. This is the current status of the volume:

controller-0:~# openstack volume show a2de160f-d4c2-4033-87cf-f119adf2e684
+--------------------------------+--------------------------------------------+
| Field                          | Value                                      |
+--------------------------------+--------------------------------------------+
| attachments                    | []                                         |
| availability_zone              | nova                                       |
| bootable                       | true                                       |
| consistencygroup_id            | None                                       |
| created_at                     | 2019-05-16T13:37:25.000000                 |
| description                    | None                                       |
| encrypted                      | False                                      |
| id                             | a2de160f-d4c2-4033-87cf-f119adf2e684       |
| migration_status               | None                                       |
| multiattach                    | False                                      |
| name                           | vol-cirros-1                               |
| os-vol-host-attr:host          | cinder-volume-worker@ceph-store#ceph-store |
| os-vol-mig-status-attr:migstat | None                                       |
| os-vol-mig-status-attr:name_id | None                                       |
| os-vol-tenant-attr:tenant_id   | 020e982de53c44a6a988859a94ecbc82           |
| properties                     |                                            |
| replication_status             | None                                       |
| size                           | 20                                         |
| snapshot_id                    | None                                       |
| source_volid                   | None                                       |
| status                         | error_deleting                             |
| type                           | None                                       |
| updated_at                     | 2019-05-16T13:52:23.000000                 |
| user_id                        | aa11b104ece949e68c4d60d404684397           |
| volume_image_metadata          | {u'signature_verified': u'False'}          |
+--------------------------------+--------------------------------------------+

Collect attached.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-17:

#9

Download full text (4.2 KiB)

Issue was reproduced on
Lab: WCP_99_103
Load: 2019-05-15_18-01-07

[2019-05-16 18:00:17,147] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show f248623d-78d9-4331-a201-ba3f401ec24c'
[2019-05-16 18:00:18,645] 387 DEBUG MainThread ssh.expect :: Output:
+------------------------------+--------------------------------------+
| Field | Value |
+------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | true |
| consistencygroup_id | None |
| created_at | 2019-05-16T18:00:10.000000 |
| description | None |
| encrypted ...

Issue was reproduced on
Lab: WCP_99_103
Load: 2019-05-15_18-01-07

[2019-05-16 18:00:17,147] 262  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'tenant2' --os-password 'Li69nux*' --os-project-name tenant2 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume show f248623d-78d9-4331-a201-ba3f401ec24c'
[2019-05-16 18:00:18,645] 387  DEBUG MainThread ssh.expect  :: Output: 
+------------------------------+--------------------------------------+
| Field                        | Value                                |
+------------------------------+--------------------------------------+
| attachments                  | []                                   |
| availability_zone            | nova                                 |
| bootable                     | true                                 |
| consistencygroup_id          | None                                 |
| created_at                   | 2019-05-16T18:00:10.000000           |
| description                  | None                                 |
| encrypted                    | False                                |
| id                           | f248623d-78d9-4331-a201-ba3f401ec24c |
| multiattach                  | False                                |
| name                         | vol-ubuntu_14-5                      |
| os-vol-tenant-attr:tenant_id | 4962ac2858914305a2ea16be16d250ad     |
| properties                   |                                      |
| replication_status           | None                                 |
| size                         | 3                                    |
| snapshot_id                  | None                                 |
| source_volid                 | None                                 |
| status                       | error                                |
| type                         | None                                 |
| updated_at                   | 2019-05-16T18:00:18.000000           |
| user_id                      | 2620f10b8a2d4eeea3b24d6daeee80b1     |
| volume_image_metadata        | {u'signature_verified': u'False'}    |
+------------------------------+--------------------------------------+
controller-1:~$

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-17:

#10

ALL_NODES_20190516.195024.tar Edit (34.5 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-17:

#11

controller-0_20190516.201525.tar Edit (36.9 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-17:

#12

From Bob Church:
I observed the following in the new cinder-volume pod logs from Peng:

2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 2072, in rbd.Image.parent_info (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:24095)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 695, in rbd.make_ex (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:6647)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils File "rbd.pyx", line 549, in rbd.ImageNotFound.__init__ (/builddir/build/BUILD/ceph-13.2.2/build/src/pybind/rbd/pyrex/rbd.c:3737)
2019-05-16 17:49:26.845 19 ERROR cinder.volume.utils TypeError: _init_() got an unexpected keyword argument 'errno'

Historically, we have introduced verbose error messages into the librados/librbd python bindings with: https://github.com/starlingx-staging/stx-ceph/commit/53e93f8f800fc3055678c57b4079a86d8ef88ea7

This commit is currenly not compatable with current mimic implementation of make_ex() which translates librados/librbd return codes into python exceptions.

The rbd.ImageNotFound is actually an acceptable and handled excpetion in Cinder but because of the TypeError being raised, Cinder is bailing out of volume creation/deletions early

I did a very quick test by fixing up the bindings with: https://github.com/rchurch-wrs/stx-ceph/commit/a46969cd51a6b3a0723ce7c44ea34c528544609f and manually updating the cinder docker image.

This fixed the issue and I was able to create an image.

Changed in starlingx:
assignee:	Daniel Badea (daniel.badea) → Bob Church (rchurch)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-17:

#13

Fixed by pull request:
https://github.com/starlingx-staging/stx-ceph/pull/31

Ghada Khalil (gkhalil) on 2019-05-17

Changed in starlingx:
status:	Confirmed → Fix Committed

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-18:

#14

Merged in the cengn build: 20190517T223724Z
./cgcs-root/stx/git/ceph 143ed7218158183529f928fd6833b85f760a1863 2019-05-17 18:30:11 -0400 brent-rowsell <email address hidden> Update verbose rados & rbd exceptions messages

Changed in starlingx:
status:	Fix Committed → Fix Released

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-05-20:

#15

controller-0_20190520.122506.tar Edit (33.2 MiB, application/x-tar)

Issue is still present on CENG Image: 20190519T233000Z

A full collect is uploaded.

We observed the same behavior on all 4 configurations. Please, let us know if besides the collect, you need any additional log for troubleshoot.

Revision history for this message

Bob Church (rchurch) wrote on 2019-05-21:

#16

I took a look at the cinder logs and saw the same failure signature. Can you confirm that you are running latest docker images? The docker images were rebuilt as part of this fix and are required to resolve this.

controller-0:~$ sudo docker image ls | grep cinder
192.168.204.2:9001/docker.io/starlingx/stx-cinder master-centos-stable-latest 505032920dd0 2 days ago 860MB
starlingx/stx-cinder master-centos-stable-latest 505032920dd0 2 days ago 860MB

I do not see this use when using these images.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

#17

Have not seen this issue recently

tags:

removed: stx.retestneeded

StarlingX

cinder volume create failed

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches