Fuel for OpenStack

[BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot instance from volume due wrong free space information given to cinder-scheduler

Bug #1429794 reported by Bogdan Dobrelya on 2015-03-09

This bug report is a duplicate of: Bug #1415954: Excessive Ceph rebalancing can cause deployment failures. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/171/console
http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/172/console

AssertionError: Failed tests, fails: 2 should fail: 0 failed tests name: [{u'Create volume and boot instance from it (failure)': u'Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.'}, {u'Create volume and attach it to instance (failure)': u'Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.'}]

After the last snapshot revert, I cannot reproduce the issue

See original description

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-03-09:

On Friday, I took one of failed Ceph jobs, I was not able reproduce the issue upon the revert. It's kind of race which is not reproducible after environment stabilization.

Bogdan Dobrelya (bogdando) on 2015-03-09

Changed in fuel:
milestone:	none → 6.1
importance:	Undecided → Critical

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-09:

Here is a root cause, no free space to create a volume

http://paste.openstack.org/show/190958/

Changed in fuel:
status:	New → Triaged
assignee:	nobody → Fuel QA Team (fuel-qa)

Bogdan Dobrelya (bogdando) on 2015-03-09

summary:

- [BVT] Ubuntu with RadosGW env failed to boot instance from volume OSTF
+ [BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot
+ instance from volume due to RadosGW free space issue

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-09: Re: [BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot instance from volume due to RadosGW free space issue

The free space looks to be reported as '0' until some moment of time, for example:
remote/node-1.test.domain.local/cinder-scheduler.log:2015-03-08T11:43:20.644020+00:00 debug: Received volume service update from rbd:volumes: {u'volume_backend_name': u'DEFAULT', u'free_capacity_gb': 0, u'driver_version': u'1.1.0', u'total_capacity_gb': 0, u'reserved_percentage': 0, u'vendor_name': u'Open Source', u'storage_protocol': u'ceph'}
remote/node-1.test.domain.local/cinder-scheduler.log:2015-03-08T11:44:03.764781+00:00 debug: Received volume service update from rbd:volumes: {u'volume_backend_name': u'DEFAULT', u'free_capacity_gb': 283, u'driver_version': u'1.1.0', u'total_capacity_gb': 296, u'reserved_percentage': 0, u'vendor_name': u'Open Source', u'storage_protocol': u'ceph'}

Here it was reported as '0' all the time before "2015-03-08T11:44:03.764781" and reported as '296' Gb later. Could thi race be related to the ceph readiness bug https://launchpad.net/bugs/1415954 ?

Changed in fuel:
status:	Triaged → Confirmed
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-09:

This looks strange as ceph-mon started reporting about active_clean state much more earlier:
2015-03-08T11:42:49.560464 node-1 ./remote/node-1.test.domain.local/ceph-mon.log:2015-03-08T11:42:49.560464+00:00 emerg: 2015-03-08 11:42:49.569606 7fcf613a9700 0 log [INF] : p
gmap v28: 1728 pgs: 89 inactive, 288 peering, 1351 active+clean; 0 bytes data, 12543 MB used, 283 GB / 296 GB avail

So it is unknown why cinder-scheduler had been misinformed about free/total space =0 just few seconds later

description:

updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-09:

A note, the comments #5 and #6 are about the logs from http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/171/ build

summary:

[BVT][OSTF][system_tests] Ubuntu with RadosGW env failed to boot
- instance from volume due to RadosGW free space issue
+ instance from volume due wrong free space information given to cinder-
+ scheduler

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-03-10:

#10

code to check volume space is @ https://github.com/openstack/cinder/blob/stable/juno/cinder/volume/drivers/rbd.py#L350

this runs get_cluster_status from the rados client

>>> client = rados.Rados(rados_id='admin', conffile='')
>>> client.get_cluster_stats()
{'kb': 207172784L, 'num_objects': 59L, 'kb_avail': 198455288L, 'kb_used': 8717496L}

and would return 'unknown' for sizes if the command failed

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-03-10:

#11

I haven't gotten as far through the logs as I wanted, however for it to return 0 size to scheduler, my only guess is that the OSD's where not in the cluster when the test started.

Do we have `ceph -s` just prior or after the scheduler failed? Reverting the environment likely contaminated the result as coming out of the snapshot could leave the cluster not ready.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-10:

#14

Thank to Mykola Golub for a clarification: "In the provided example https://bugs.launchpad.net/fuel/+bug/1429794/comments/6 ceph reported:
gmap v28: 1728 pgs: 89 inactive, 288 peering, 1351 active+clean; 0 bytes data, 12543 MB used, 283 GB / 296 GB avail
i.e. not ALL pgs were in active+clean state, only 1351 from 1728, and 89 were inactive and 288 peering. In this state requests that read/write data mapped to the pgs in not active state are blocked or failed."

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1415954 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.