[BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot instance from volume due wrong free space information given to cinder-scheduler

Bug #1429794 reported by Bogdan Dobrelya
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Critical
Fuel Library (Deprecated)

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/171/console
http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/172/console

AssertionError: Failed tests, fails: 2 should fail: 0 failed tests name: [{u'Create volume and boot instance from it (failure)': u'Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.'}, {u'Create volume and attach it to instance (failure)': u'Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.'}]

After the last snapshot revert, I cannot reproduce the issue

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

On Friday, I took one of failed Ceph jobs, I was not able reproduce the issue upon the revert. It's kind of race which is not reproducible after environment stabilization.

Changed in fuel:
milestone: none → 6.1
importance: Undecided → Critical
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is a root cause, no free space to create a volume

http://paste.openstack.org/show/190958/

Changed in fuel:
status: New → Triaged
assignee: nobody → Fuel QA Team (fuel-qa)
summary: - [BVT] Ubuntu with RadosGW env failed to boot instance from volume OSTF
+ [BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot
+ instance from volume due to RadosGW free space issue
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: [BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot instance from volume due to RadosGW free space issue

The free space looks to be reported as '0' until some moment of time, for example:
remote/node-1.test.domain.local/cinder-scheduler.log:2015-03-08T11:43:20.644020+00:00 debug: Received volume service update from rbd:volumes: {u'volume_backend_name': u'DEFAULT', u'free_capacity_gb': 0, u'driver_version': u'1.1.0', u'total_capacity_gb': 0, u'reserved_percentage': 0, u'vendor_name': u'Open Source', u'storage_protocol': u'ceph'}
remote/node-1.test.domain.local/cinder-scheduler.log:2015-03-08T11:44:03.764781+00:00 debug: Received volume service update from rbd:volumes: {u'volume_backend_name': u'DEFAULT', u'free_capacity_gb': 283, u'driver_version': u'1.1.0', u'total_capacity_gb': 296, u'reserved_percentage': 0, u'vendor_name': u'Open Source', u'storage_protocol': u'ceph'}

Here it was reported as '0' all the time before "2015-03-08T11:44:03.764781" and reported as '296' Gb later. Could thi race be related to the ceph readiness bug https://launchpad.net/bugs/1415954 ?

Changed in fuel:
status: Triaged → Confirmed
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This looks strange as ceph-mon started reporting about active_clean state much more earlier:
2015-03-08T11:42:49.560464 node-1 ./remote/node-1.test.domain.local/ceph-mon.log:2015-03-08T11:42:49.560464+00:00 emerg: 2015-03-08 11:42:49.569606 7fcf613a9700 0 log [INF] : p
gmap v28: 1728 pgs: 89 inactive, 288 peering, 1351 active+clean; 0 bytes data, 12543 MB used, 283 GB / 296 GB avail

So it is unknown why cinder-scheduler had been misinformed about free/total space =0 just few seconds later

description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A note, the comments #5 and #6 are about the logs from http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/171/ build

summary: [BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot
- instance from volume due to RadosGW free space issue
+ instance from volume due wrong free space information given to cinder-
+ scheduler
Revision history for this message
Andrew Woodward (xarses) wrote :

code to check volume space is @ https://github.com/openstack/cinder/blob/stable/juno/cinder/volume/drivers/rbd.py#L350

this runs get_cluster_status from the rados client

>>> client = rados.Rados(rados_id='admin', conffile='')
>>> client.get_cluster_stats()
{'kb': 207172784L, 'num_objects': 59L, 'kb_avail': 198455288L, 'kb_used': 8717496L}

and would return 'unknown' for sizes if the command failed

Revision history for this message
Andrew Woodward (xarses) wrote :

I haven't gotten as far through the logs as I wanted, however for it to return 0 size to scheduler, my only guess is that the OSD's where not in the cluster when the test started.

Do we have `ceph -s` just prior or after the scheduler failed? Reverting the environment likely contaminated the result as coming out of the snapshot could leave the cluster not ready.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Thank to Mykola Golub for a clarification: "In the provided example https://bugs.launchpad.net/fuel/+bug/1429794/comments/6 ceph reported:
 gmap v28: 1728 pgs: 89 inactive, 288 peering, 1351 active+clean; 0 bytes data, 12543 MB used, 283 GB / 296 GB avail
i.e. not ALL pgs were in active+clean state, only 1351 from 1728, and 89 were inactive and 288 peering. In this state requests that read/write data mapped to the pgs in not active state are blocked or failed."

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.