Volume becomes in 'error' state after scheduler starts

Bug #1409012 reported by Ivan Kolodyazhny on 2015-01-09
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Michal Dulko

Bug Description

Steps to reproduce:
1. Deploy working OpenStack (E.g. with Devstack)
2. Stop cinder-scheduler
3. Try to create volume
3.1. Ensure that volume is in 'creating' state
4. Start cinder scheduler

Actual results:
Volume becomes in 'error' state. cinder-scheduler reports:
2015-01-09 15:30:48.301 WARNING cinder.scheduler.filter_scheduler [req-22094d37-6304-488e-8531-9228467406ea 4031ead78394485ea79ff45852fad853 7a3f547d38cc491da92bc27246207e49] No weighed hosts found for volume with properties: {u'name': u'lvmdriver-1', u'qos_specs_id': None, u'deleted': False, u'created_at': u'2014-12-31T13:54:04.000000', u'updated_at': None, u'extra_specs': {u'volume_backend_name': u'lvmdriver-1'}, u'is_public': True, u'deleted_at': None, u'id': u'0153f7f7-e2c5-4294-9c68-01b29b28ef50', u'description': None}
2015-01-09 15:30:48.305 ERROR cinder.scheduler.flows.create_volume [req-22094d37-6304-488e-8531-9228467406ea 4031ead78394485ea79ff45852fad853 7a3f547d38cc491da92bc27246207e49] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid host was found. No weighed hosts available
2015-01-09 15:30:48.306 DEBUG cinder.volume.flows.common [req-22094d37-6304-488e-8531-9228467406ea 4031ead78394485ea79ff45852fad853 7a3f547d38cc491da92bc27246207e49] Updating volume: 3115b341-1372-40a6-a5c2-966c1a60bea8 with {'status': 'error'} due to: No valid host was found. No weighed hosts available from (pid=21194) error_out_volume /opt/stack/cinder/cinder/volume/flows/common.py:89

Expected results:
Volume created successfully and become in 'available' state.

This is a synthetic test case but would be a real case for failover in high availability (HA) deployment in production.

Ivan Kolodyazhny (e0ne) on 2015-01-09
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
Duncan Thomas (duncan-thomas) wrote :

This is caused by the fact a newly started scheduler needs to receive stat update broadcasts to know what backends are out there.

Possibly fixable by modifying the scheduler not to listen for incoming create (etc) requests until the stats broadcast period (30 seconds by default) has passed.

Note that this same problem occurs if you spin up a second (or more) scheduler, e.g. for load balancing or active/active H/A - the new scheduler will fail all requests it receives until it hears from backends.

Changed in cinder:
status: New → Confirmed
importance: Undecided → High
Ivan Kolodyazhny (e0ne) on 2015-02-04
Changed in cinder:
status: Confirmed → In Progress

Change abandoned by Ivan Kolodyazhny (<email address hidden>) on branch: master
Review: https://review.openstack.org/153379

Ivan Kolodyazhny (e0ne) wrote :

Duncan, thanks for details comment. Unfortunately proposed solution is not supported by current oslo.messaging implementation. I've tried workaround but it works unstable for me. Also I've take a look on nova-scheduler. It stores date in DB, not in memory as cinder-scheduler does

Fix proposed to branch: master
Review: https://review.openstack.org/156219

Changed in cinder:
assignee: Ivan Kolodyazhny (e0ne) → Michal Dulko (michal-dulko-f)

Fix proposed to branch: master
Review: https://review.openstack.org/158623

Changed in cinder:
assignee: Michal Dulko (michal-dulko-f) → Huang Zhiteng (zhiteng-huang)
Changed in cinder:
assignee: Huang Zhiteng (zhiteng-huang) → Michal Dulko (michal-dulko-f)

Fix proposed to branch: master
Review: https://review.openstack.org/163098

Fix proposed to branch: master
Review: https://review.openstack.org/163099

Reviewed: https://review.openstack.org/158623
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=60c563f72d2d6d008fca03f0a058941f3807e1a8
Submitter: Jenkins
Branch: master

commit 60c563f72d2d6d008fca03f0a058941f3807e1a8
Author: Zhiteng Huang <email address hidden>
Date: Tue Feb 24 17:05:46 2015 +0800

    Allow scheduler to receive volume stats when starting service

    Filter scheduler relies on volume services to report their stats in
    order to make decisions - this nature makes scheduler service a bit
    special when it starts because it can't really do anything without
    knowning stats of avilable volume backend. So the original
    implementation of Filter Scheduler added a hook in init_host() to
    ask all volume services to send their stats immediately (volume RPC
    call 'publish_service_capabitlies()'), this reduced the chance of a
    new volume being set to 'error' state because scheduler has no
    knowledge of backends to minimum.

                 Scheduler RPC Exchanges Volume Service(s)
    service start | RPC init |
                  | init_host() |
                  | --pub_src_caps()-->| ----------> | (vol fanout)
                  | <------------- |<-vol stats--| (sch fanout)

    However, commit 65fa80c361f71158cc492dfc520dc4a63ccfa419 moved
    init_host() ahead of service RPC initialization because for volume
    services, service initialization should be done before the service
    starts accepting RPC requests. This change unfortunately invalidated
    scheduler service's init_host() because that needs to be done *AFTER*
    RPC is ready (a queue is bound to scheduler fanout exchange). The
    result is scheduler asks vol service to send update, but it misses
    the information it asked for because its queue hasn't been created
    and bound to the fanout queue yet. So scheduler always has to wait
    for the next update from volume services to be able to work properly.

                 Scheduler RPC Exchanges Volume Service(s)
    service start | init_host() |
                  | --pub_src_caps()-->| ----------> | (vol fanout)
                  | (vol stats lost) |<-vol stats--| (sch fanout)
                  | RPC init |

    This change adds a new hook to Manager class called init_hook_with_rpc()
    to allow services like scheduler to do something once RPC is ready.
    This should restore scheduler's behavior before commit

    Change-Id: If6cf9030eb44c39a06ec501ac5c049d460782481
    Partial-bug: #1409012

Change abandoned by Michal Dulko (<email address hidden>) on branch: master
Review: https://review.openstack.org/156219
Reason: Abandoning in favor of https://review.openstack.org/#/c/163099/

Reviewed: https://review.openstack.org/163098
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=9cf6e694a37149f2e1e7b70c51c82bab4f135057
Submitter: Jenkins
Branch: master

commit 9cf6e694a37149f2e1e7b70c51c82bab4f135057
Author: Michal Dulko <email address hidden>
Date: Thu Mar 12 16:07:03 2015 +0100

    Add is_ready method to scheduler driver

    This commit adds is_ready method to scheduler driver to indicate
    if driver is ready to accept requests. Scheduler is considered ready
    when host_manager received capabilities for all active volume services.

    Partial-Bug: 1409012
    Change-Id: I4a90e68a3836c44038ee8c937ae1ecf40e5c1f32

Reviewed: https://review.openstack.org/163099
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=89106c52720e25f51384b3d77c710c7ddc7f1724
Submitter: Jenkins
Branch: master

commit 89106c52720e25f51384b3d77c710c7ddc7f1724
Author: Michal Dulko <email address hidden>
Date: Thu Mar 12 17:24:09 2015 +0100

    Add waiting for the driver to SchedulerManager

    This patch adds _wait_for_scheduler method before serving any request.
    Method waits till scheduler.is_ready() returns true or
    CONF.periodic_interval seconds passed from service startup.

    Change-Id: I9fab9fb076a955a24c1c157229baf027359d9771
    Closes-Bug: 1409012

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2015-03-20
Changed in cinder:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-04-30
Changed in cinder:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers