CephRadosGW.ceph_rados_gw() does not wait for Ceph to become available for OS services

Bug #1426375 reported by Roman Podoliaka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Triaged
Critical
Fuel Library (Deprecated)

Bug Description

http://jenkins-product.srt.mirantis.net:8080/job/6.1.staging.ubuntu.bvt_2/108/consoleText

6.1 staging job failed with:

======================================================================
FAIL: Deploy ceph HA with RadosGW for objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 65, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/tests/test_ceph.py", line 291, in ceph_rados_gw
    self.fuel_web.run_ostf(cluster_id=cluster_id)
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 601, in run_ostf
    failed_test_name=failed_test_name)
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/6.1.staging.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 197, in assert_ostf_run
    failed_tests_res))
AssertionError: Failed tests, fails: 1 should fail: 0 failed tests name: [{u'Create volume and boot instance from it (failure)': u'Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.'}]

cinder-scheduler contains the following error:

2015-02-27 09:50:44.101 9193 ERROR cinder.scheduler.flows.create_volume [req-e0d56160-3dc9-4149-946c-825079893a32 7784bc5b6a284bdbb9f79c912c0b0f83 75e3b6fdc7504d75aa1eabe3abc5656e - - -] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid host was found. No weighed hosts available

The problem is that last report from cinder-volume was:

2015-02-27 09:50:17.526 9193 DEBUG cinder.scheduler.host_manager [req-0817c820-8669-4834-8060-336ba8d5f0ec - - - - -] Received volume service update from rbd:volumes: {u'volume_backend_name': u'DEFAULT', u'free_
capacity_gb': 0, u'driver_version': u'1.1.0', u'total_capacity_gb': 0, u'reserved_percentage': 0, u'vendor_name': u'Open Source', u'storage_protocol': u'ceph'} update_service_capabilities /usr/lib/python2.7/dist
-packages/cinder/scheduler/host_manager.py:434

which means last time cinder-volume checked Ceph status (by default, every 60 seconds), there was no space available.

Looks like we need to add another wait to ceph_rados_gw(), which would wait before 'ceph df' reports available space for our pools (images, volumes) and give OS service another 60 seconds to update the information in the DB.

Tags: ceph staging
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
milestone: none → 6.1
description: updated
Changed in fuel:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Looks like, 2 of last 3 6.1 staging bvt runs failed because of this.

I suspect https://github.com/stackforge/fuel-qa/commit/f75ee6796164b25497fe10010bbaa671d31d842d might also be involved here.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Roman, two from last runs were successful, additional the last failed was failed on ostf.
http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.staging.ubuntu.bvt_2/111/console

Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Changed in fuel:
importance: High → Medium
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Nastya, I disagree, that this should be treated as Medium. This is the second time we've seen this issue on our duty. I'd say this is at least High, as it prevents new packages from getting into stable mirrors on regular basis.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

My current understanding is that https://github.com/stackforge/fuel-qa/commit/f75ee6796164b25497fe10010bbaa671d31d842d not only added this test to BVT_2, but also changed the test itself a lot and now it can fail even if the environment is perfectly OK.

As it's stated in the description, we start tests before cinder-scheduler understands there is available space in Ceph.

Changed in fuel:
importance: Medium → Critical
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

guys, I am wonder, why this issue is assigned to the qa team? there is no snapshot -revert step before ostf, so I believe all services should be ready after deployment

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
tags: added: ceph
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Folks, as Tatyana said, if some service is not ready after deployment, just fix it and don't mark deployment as successful

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Can we add something like "ceph df" check to our puppet manifests for Ceph?

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

My opinion is that this bug is a duplicate of https://bugs.launchpad.net/fuel/+bug/1415954
As soon as we introduce the step that fixes ceph convergence we can forget about this bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.